Re: Why single thread for HDFS?

2010-07-05 Thread Bernd Fondermann
On Mon, Jul 5, 2010 at 07:47, Bardia Afshin brandon...@gmail.com wrote: What's the unsubcribe link? To unsubscribe, send mail to general-unsubscr...@hadoop.apache.org Many Apache MLs have an unsubscribe footer. Anyone volunteering to make this happen for this list, too? Bernd

Re: Displaying Map output in MapReduce

2010-07-05 Thread Aaron Kimball
If you set the number of reduce tasks to zero, the outputs of the mappers will be sent directly to the OutputFormat. You can debug your map phase of a job by disabling reduce and inspecting the mapper outputs, and then re-enable the reducer after you've got the mapping part of the job running

Hadoop versions distributions

2010-07-05 Thread Evert Lammerts
There are a number of different versions and distributions of Hadoop which, as far as I understand, all differ from each other. I know that in the 0.20-append branch, files in HDFS can be appended, and that the Y! distribution (0.20.S) implements security features through Kerberos. And then there

Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated?

2010-07-05 Thread zhangguoping zhangguoping
Hi, Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated? I did not find @deprecated comments in source file in 0.20.2 But I cannot use following: job.setOutputFormatClass( org.apache.hadoop.mapred.lib.MultipleOutputFormat ) The type does not match.

Re: Why single thread for HDFS?

2010-07-05 Thread elton sky
Segel, Jay Thanks for reply! Your parallelism comes from multiple tasks running on different nodes within the cloud. By default you get one map/reduce job per block. You can write your own splitter to increase this and then get more parallelism. sounds like an elegant solution. We can modify the

Re: Why single thread for HDFS?

2010-07-05 Thread Todd Lipcon
On Mon, Jul 5, 2010 at 5:08 AM, elton sky eltonsky9...@gmail.com wrote: Segel, Jay Thanks for reply! Your parallelism comes from multiple tasks running on different nodes within the cloud. By default you get one map/reduce job per block. You can write your own splitter to increase this and

Re: Hadoop versions distributions

2010-07-05 Thread Todd Lipcon
On Mon, Jul 5, 2010 at 1:12 AM, Evert Lammerts evert.lamme...@sara.nlwrote: There are a number of different versions and distributions of Hadoop which, as far as I understand, all differ from each other. I know that in the 0.20-append branch, files in HDFS can be appended, and that the Y!

Re: Why single thread for HDFS?

2010-07-05 Thread elton sky
There's actually an open ticket somewhere to make distcp do this using the new concat() API in the NameNode. Where can I find that open ticket? concat() allows several files to be combined into one file at the metadata level, so long as a number of restrictions are met. The work hasn't been done

Re: Why single thread for HDFS?

2010-07-05 Thread Allen Wittenauer
On Jul 5, 2010, at 5:01 PM, elton sky wrote: Well, this sounds good when you have many small files, you concat() them into a big one. I am talking about split a big file into blocks and copy all a few blocks in parallel. Basically, your point is that hadoop dfs -cp is relatively slow and

Re: Why single thread for HDFS?

2010-07-05 Thread elton sky
Basically, your point is that hadoop dfs -cp is relatively slow and could be made faster. If HDFS had a more multi-threaded design, itwould make cp operations faster. What I mean is, if we have the size of a file we can parallel by calculating blocks. Otherwise we couldn't. On Tue, Jul 6, 2010

Re: Partitioned Datasets Map/Reduce

2010-07-05 Thread Hemanth Yamijala
Hi, I have written my custom partitioner for partitioning datasets. I want  to partition two datasets using the same partitioner and then in the  next mapreduce job, I want each mapper to handle the same partition from  the two sources and perform some function such as joining etc. How I  can