答复: Possibility to specify some type o f files in a directory as input

2008-06-05 Thread 志远
Put the input path like : dir1/type1*.txt Hi, I need a help in setting my map-reduce job to consider only certain type of files as input in a specific directory. For example, Suppose there is a directory dir1 and I have files like type1_1.txt type1_2.txt type1_3.txt type2_1.txt type2_2.txt

Possibility to specify some type of files in a directory as input

2008-06-05 Thread novice user
Hi, I need a help in setting my map-reduce job to consider only certain type of files as input in a specific directory. For example, Suppose there is a directory dir1 and I have files like type1_1.txt type1_2.txt type1_3.txt type2_1.txt type2_2.txt and If I want to consider only those files

Gigablast.com search engine, 10billion pages!!!

2008-06-05 Thread Dan Segel
Our ultimate goal is to basically replicate gigablast.com search engine. They claim to have less than 500 servers that contain 10billion pages indexed, spidered and updated on a routine basis... I am looking at featuring 500 million pages indexed per node, and have a total of 20 nodes. Each node

Re: Monthly user group meeting

2008-06-05 Thread Chris Doherty
On Wed, Jun 04, 2008 at 07:56:45PM -0700, Otis Gospodnetic said: The videos from the Hadoop summit are still not available: http://developer.yahoo.com/blogs/hadoop/2008/04/hadoop_summit_slides_and_video.html And at this point it looks like they never will be available :( I followed the link

Re: Monthly user group meeting

2008-06-05 Thread Otis Gospodnetic
Aha, I see, I see, the videos were added to http://research.yahoo.com/node/2104 . When I checked that page last time there were only slides there. Thanks. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Chris Doherty [EMAIL PROTECTED]

RE: compressed/encrypted file

2008-06-05 Thread Haijun Cao
Arun/John, Thanks for the update. For security reasons, we also need to encrypt the file, there is no support for encryption currently, so we will have to roll our own. Again, I'd like to know if anybody here do encryption, if yes, what algorithm and how key/password distribution is handled.

Re: compressed/encrypted file

2008-06-05 Thread Allen Wittenauer
On 6/5/08 11:38 AM, Ted Dunning [EMAIL PROTECTED] wrote: We use encryption on log files using standard AES. I wrote an input format to deal with it. Key distribution should be done better than we do it. My preference would be to insert an auth key into the job conf which is then used by

Re: compressed/encrypted file

2008-06-05 Thread Ted Dunning
Security and hadoop are not particularly compatible concepts. Things may improve when user authentication exists. The lack of security on job confs is the major motivation for making sure the auth is time limited. If and when something like kerberos user authentication exists, then kerberos

Re: compressed/encrypted file

2008-06-05 Thread Allen Wittenauer
On 6/5/08 11:57 AM, Ted Dunning [EMAIL PROTECTED] wrote: Can you suggest an alternative way to communicate a secret to hadoop tasks short of embedding it into source code? This is one of the reasons why we use hod--job isolation such that it helps prevent data leaks from one job to the

Gigablast.com search engine- 10BILLION PAGES!

2008-06-05 Thread Dan Segel
Our ultimate goal is to basically replicate gigablast.com search engine. They claim to have less than 500 servers that contain 10billion pages indexed, spidered and updated on a routine basis... I am looking at featuring 500 million pages indexed per node, and have a total of 20 nodes. Each

Re: compressed/encrypted file

2008-06-05 Thread Stuart Sierra
On Wed, Jun 4, 2008 at 6:52 PM, Arun C Murthy [EMAIL PROTECTED] wrote: With the current compression codecs available in Hadoop (zlib/gzip/lzo) it is not possible to split up a compressed file and then process it in a parallel manner. However once we get bzip2 to work we could split up the

MapWritable as output value of Reducer

2008-06-05 Thread Tarandeep Singh
hi, Can I use MapWritable as an output value of a Reducer ? If yes, how will the (key, value) pairs in the MapWritable object will be written to the file ? What output format should I use in this case ? Further, I want to chain the output of the first map reduce job to another map reduce job,

Re: compressed/encrypted file

2008-06-05 Thread Ted Dunning
Yes, that is what I meant. Not particularly good, but possibly the best we can do with hadoop (for a while). If hadoop handles the ticket for us in a secure way, then I would feel better. On Thu, Jun 5, 2008 at 3:40 PM, Haijun Cao [EMAIL PROTECTED] wrote: If and when something like kerberos

Re: MapWritable as output value of Reducer

2008-06-05 Thread Yang Chen
I believe the (key, value) structure is same both input and output file. In this case, you can consider the job flow. Like below, JobConf confA = new JobConf(A.class); conf.setJobName(A); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class);

local bytes written (high io, low memory usage)

2008-06-05 Thread Haijun Cao
I noticed that local bytes written/read stat in my map reduce job is really high, 2x, 3x, 4x of the hdfs bytes. When does hadoop mapred framework write to local fs? Is it done when the jvm memory is not enough and data is spill to disk? how I can configure so that it does not spill to disk?