Re: Displaying Map output in MapReduce

2010-07-05 Thread Aaron Kimball
If you set the number of reduce tasks to zero, the outputs of the mappers will be sent directly to the OutputFormat. You can debug your map phase of a job by disabling reduce and inspecting the mapper outputs, and then re-enable the reducer after you've got the mapping part of the job running corre

Re: Can we modify files in HDFS?

2010-07-05 Thread Aaron Kimball
On Tue, Jun 29, 2010 at 2:57 AM, Steve Loughran wrote: > elton sky wrote: > >> thanx Jeff, >> >> So...it is a significant drawback. >> As a matter of fact, there are many cases we need to modify. >> > > > When people say "Hadoop filesystems are not posix", this is what they mean. > No locks, no r

Hadoop versions & distributions

2010-07-05 Thread Evert Lammerts
There are a number of different versions and distributions of Hadoop which, as far as I understand, all differ from each other. I know that in the 0.20-append branch, files in HDFS can be appended, and that the Y! distribution (0.20.S) implements security features through Kerberos. And then there a

Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated?

2010-07-05 Thread zhangguoping zhangguoping
Hi, Is org.apache.hadoop.mapred.lib.MultipleOutputFormat deprecated? I did not find @deprecated comments in source file in 0.20.2 But I cannot use following: job.setOutputFormatClass( org.apache.hadoop.mapred.lib.MultipleOutputFormat ) The type does not match.

Re: Why single thread for HDFS?

2010-07-05 Thread elton sky
Segel, Jay Thanks for reply! >Your parallelism comes from multiple tasks running on different nodes within the cloud. By >default you get one map/reduce job per block. You can write your own splitter to increase >this and then get more parallelism. sounds like an elegant solution. We can modify th

Re: Why single thread for HDFS?

2010-07-05 Thread Todd Lipcon
On Mon, Jul 5, 2010 at 5:08 AM, elton sky wrote: > Segel, Jay > Thanks for reply! > > >Your parallelism comes from multiple tasks running on different nodes > within the cloud. By >default you get one map/reduce job per block. You can > write your own splitter to increase >this and then get more

Re: Hadoop versions & distributions

2010-07-05 Thread Todd Lipcon
On Mon, Jul 5, 2010 at 1:12 AM, Evert Lammerts wrote: > There are a number of different versions and distributions of Hadoop > which, as far as I understand, all differ from each other. I know that in > the 0.20-append branch, files in HDFS can be appended, and that the Y! > distribution (0.20.S)

Re: Why single thread for HDFS?

2010-07-05 Thread elton sky
>There's actually an open ticket somewhere to make distcp do this using the >new concat() API in the NameNode. Where can I find that "open ticket"? >concat() allows several files to be combined into one file at the metadata level, so long as a number of >restrictions are met. The work hasn't been

Re: Why single thread for HDFS?

2010-07-05 Thread Allen Wittenauer
On Jul 5, 2010, at 5:01 PM, elton sky wrote: > Well, this sounds good when you have many small files, you concat() them > into a big one. I am talking about split a big file into blocks and copy all > a few blocks in parallel. Basically, your point is that hadoop dfs -cp is relatively slow and co

Re: Why single thread for HDFS?

2010-07-05 Thread elton sky
>Basically, your point is that hadoop dfs -cp is relatively slow and could be made faster. If HDFS had a more multi-threaded >design, itwould make cp operations faster. What I mean is, if we have the size of a file we can parallel by calculating blocks. Otherwise we couldn't. On Tue, Jul 6, 2010

Re: Partitioned Datasets Map/Reduce

2010-07-05 Thread Hemanth Yamijala
Hi, > I have written my custom partitioner for partitioning datasets. I want  to > partition two datasets using the same partitioner and then in the  next > mapreduce job, I want each mapper to handle the same partition from  the two > sources and perform some function such as joining etc. How I