Bug in FSDataSet?

2011-08-03 Thread Shai Erera
Hi I've been trying to embed MiniDFSCluster into my unit tests for a long time, always giving up because it always failed, until yesterday I gave it another try and accidentally ran the test with an Oracle JVM (my default is IBM's), and it passed ! I run on Windows 7 64-bit, w/ hadoop-0.20.2.jar.

TestLocalDFS Fail

2011-04-18 Thread Shai Erera
Hi I've checked out Hadoop-0.20.2 from http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.2, and from cygwin I run 'ant test-core -Dtestcase=TestLocalDFS'. The test fails. Nothing is printed to the console, but build/test/TEST-org.apache.hadoop.hdfs.TestLocalDFS.txt shows errors like

Re: Best practices for jobs with large Map output

2011-04-18 Thread Shai Erera
from the OutputFormats bundled with Hadoop. You might > start there. > > Again, it's not clear what your goal is or what you mean by "index". > Are the input records changed before being written by the reduce? Or > is the purpose of this job only to concatenate index fil

Re: Best practices for jobs with large Map output

2011-04-15 Thread Shai Erera
bq. If you can change your job to handle metadata backed by a store in HDFS I have two Mappers, one that works with HDFS and one with GPFS. The GPFS one does exactly that -- it stores the indexes in GPFS (which all Mappers and Reducers see, as a shared location) and outputs just the pointer to tha

Re: Best practices for jobs with large Map output

2011-04-15 Thread Shai Erera
s? I don't mind writing some classes if that's what it takes ... Shai On Thu, Apr 14, 2011 at 9:50 PM, Harsh J wrote: > Hello Shai, > > On Fri, Apr 15, 2011 at 12:01 AM, Shai Erera wrote: > > Hi > > I'm running on Hadoop 0.20.2 and I have a job with the follow

Best practices for jobs with large Map output

2011-04-14 Thread Shai Erera
Hi I'm running on Hadoop 0.20.2 and I have a job with the following nature: * Mapper outputs very large records (50 to 200 MB) * Reducer (single) merges all those records together * Map output key is a constant (could be a NullWritable, but currently it's a LongWritable(1)) * Reducer doesn't care

Re: Control the number of Mappers

2010-11-25 Thread Shai Erera
has your input format. > You may find > http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/examples/MultiFileWordCount.html > useful. > > > On Thu, Nov 25, 2010 at 12:01 PM, Shai Erera wrote: > >> I wasn't talking about how to configure

Re: Control the number of Mappers

2010-11-25 Thread Shai Erera
t critical if it can't be done, but it can improve the performance of my job if it can be done. Thanks Shai On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes wrote: > Hi, > > 2010/11/25 Shai Erera : > > Is there a way to make MapReduce create exactly N Mappers? More > > speci

Control the number of Mappers

2010-11-25 Thread Shai Erera
Hi Is there a way to make MapReduce create exactly N Mappers? More specifically, if say my data can be split to 200 Mappers, and I have only 100 cores, how can I ensure only 100 Mappers will be created? The number of cores is not something I know in advance, so writing a special InputFormat might

Implement Writable which de-serializes to disk

2010-11-25 Thread Shai Erera
Hi I need to implement a Writable, which contains a lot of data, and unfortunately I cannot break it down to smaller pieces. The output of a Mapper is potentially a large record, which can be of any size ranging from few 10s of MBs to few 100s of MBs. Is there a way for me to de-serialize the Wri

Re: Pipelining Mappers and Reducers

2010-08-08 Thread Shai Erera
Thursday, July 29, 2010, Ferdy Galema wrote: > > > > > > > Very well. Could you keep us informed on how your instant merging plans > work out? We're actually running a similar indexing process. > > It's very interesting to be able to start merging Lucene index

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
ing 400 maps and 10 >> reduces followed by another job with a single reducer will not benefit if >> the single reducer has to process the same amount of data that the previous >> reducers have been outputting. Therefore it completely depends on what your >> reducer actually does.

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
un intended) the > amount of data in the pipeline. For example running 400 maps and 10 reduces > followed by another job with a single reducer will not benefit if the single > reducer has to process the same amount of data that the previous reducers > have been outputting. Therefore it c

Re: Pipelining Mappers and Reducers

2010-07-29 Thread Shai Erera
t; merged, you could use "hadoop fs -getmerge ..." to pull a merged copy of the > DFS. > > Btw I share your opinion on keeping your Map/Reduce functions > singlethreaded (thus simple) when possible. The Hadoop framework will be > able to run your application concurrently by usin

Re: Pipelining Mappers and Reducers

2010-07-28 Thread Shai Erera
y question, I > believe that the copy stage may start before all mappers are finished. > However, the sorting and application of your reduce function can not proceed > until each mapper is finished. > > Could you describe your problem in more detail? > > Regards, > Greg Lawre

Re: Pipelining Mappers and Reducers

2010-07-27 Thread Shai Erera
Thanks for the prompt response Amogh ! I'm kinda rookie w/ Hadoop, so please forgive my perhaps "too rookie" questions :). Check the property mapred.reduce.slowstart.completed.maps > >From what I read here ( http://hadoop.apache.org/common/docs/current/mapred-default.html), this parameter contro

Pipelining Mappers and Reducers

2010-07-27 Thread Shai Erera
Hi I have a scenario for which I'd like to write a MR job in which Mappers do some work and eventually the output of all mappers need to be combined by a single Reducer. Each Mapper outputs that is distinct from all other Mappers, meaning the Reducer.reduce() method always receives a single eleme