Re: Business logic in cleanup?

2011-11-16 Thread Harsh J
I'm sure you understand all implications here so I'll just answer your questions, inline. On Thu, Nov 17, 2011 at 9:53 AM, Something Something wrote: > Is the idea of writing business logic in cleanup method of a Mapper good or > bad?  We think we can make our Mapper run faster if we keep accumul

Business logic in cleanup?

2011-11-16 Thread Something Something
Is the idea of writing business logic in cleanup method of a Mapper good or bad? We think we can make our Mapper run faster if we keep accumulating data in a HashMap in a Mapper, and later in the cleanup() method write it. 1) Does Map/Reduce paradigm guarantee that cleanup will always be called

Re: Mahout and Hadoop

2011-11-16 Thread bejoy . hadoop
Hey Bish AFAIK Cloudera repository has mahout now. So may be that should be included in the latest CDH3u2 demo VM from cloudera as well, but I'm not sure since I haven't checked the same yet. Please check the cloudera downloads for more details. AFAIK Mahout is just a collection of

Mahout and Hadoop

2011-11-16 Thread Bish Maten
I have Hadoop installed on Ubuntu VM  and next need Mahout installed on this Ubuntu virtual machine. I was hoping if there is already preconfigured Hadoop and Mahout available? Hadoop install was simple but the Mahout install appears to have some specific installation directories, dependencies

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Dmitriy Ryaboy
Libjars works if your MR job is initialized correctly. Here's a code snippet: public static void main(String[] args) throws Exception { GenericOptionsParser optParser = new GenericOptionsParser(args); int exitCode = ToolRunner.run(optParser.getConfiguration(), new MyMRJob(),

unsubscibe

2011-11-16 Thread Francesco De Luca
unsubscibe

RE: Enable Hyperthreading? / Typical M/R ratios and rules of thumb

2011-11-16 Thread Jeffrey Buell
> > How much RAM do you have? > > > > A good rule of thumb is to use 1-1.5G for maps and 2G per reduce > > (vmem). Ensure your OS has at least 2G of memory. > > > > Thus, with 24G and dual quad cores you should be at 8-10m/2r. Scale > up > > if you have more memory. > > Would you say RAM was the m

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
I agree. It will eventually get us in trouble. That's why we want to get the -libjars option to work, but it's not working.. arrrghhh.. It's the simplest things in engineering that take the longest time... -:) Can you see why this may not work? /Users/xyz/hadoop-0.20.2/bin/hadoop jar /Users/xy

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Friso van Vollenhoven
You use maven jar-with-deps default assembly? That layout works too, but it will give you problems eventually when you have different classes with the same package and name. Java jar files are regular ZIP files. They can contain duplicate entries. I don't know whether your packaging creates dup

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
Thanks Bejoy & Friso. When I use the all-in-one jar file created by Maven I get this: Mkdirs failed to create /Users/xyz/hdfs/hadoop-unjar4743660161930001886/META-INF/license Do you recall coming across this? Our 'all-in-one' jar is not exactly how you have described it. It doesn't contain an

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Friso van Vollenhoven
We usually package my jobs as a single jar that contains a /lib directory in the jar that contains all other jars that the job code depends on. Hadoop understands this layout when run as 'hadoop jar'. So the jar layout would be something like: /META-INF/manifest.mf /com/mypackage/MyMapperClass.

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Bejoy Ks
Hi You can find the usage examples of libjars and files at the following apache url http://hadoop.apache.org/common/docs/current/mapred_tutorial.html#Usage *"Running wordcount example with -libjars, -files and -archives: hadoop jar hadoop-examples.jar wordcount -files cachefile.txt -libjars

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
Bejoy - Thanks for the reply. The '-libjars' is not working for me with 'hadoop jar'. Also, as per the documentation ( http://hadoop.apache.org/common/docs/current/commands_manual.html#jar): Generic Options The following options are supported by dfsadmin

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Something Something
Hmm... there must be a different way 'cause we don't need to do that to run Pig jobs. On Tue, Nov 15, 2011 at 10:58 PM, Daan Gerits wrote: > There might be different ways but currently we are storing our jars onto > HDFS and register them from there. They will be copied to the machine once > the

Re: Enable Hyperthreading? / Typical M/R ratios and rules of thumb

2011-11-16 Thread Tom Hall
Thanks Arun, On Mon, Nov 14, 2011 at 4:34 AM, Arun Murthy wrote: > How much RAM do you have? > > A good rule of thumb is to use 1-1.5G for maps and 2G per reduce > (vmem). Ensure your OS has at least 2G of memory. > > Thus, with 24G and dual quad cores you should be at 8-10m/2r. Scale up > if you

Re: Distributing our jars to all machines in a cluster

2011-11-16 Thread Bejoy Ks
Hi To distribute application specific jars or files you can just do the same with 'hadoop jar command' like *hadoop jar* sample.jar com.test.Samples.Application *-files* *file1.txt, file2.csv* *-libjars* *custom_connector.jar, json_util.jar* input_dir output_dir. But this would happen for ev

Re: how to implement error thresholds in a map-reduce job ?

2011-11-16 Thread Mapred Learn
Thanks Harsh for a descriptive response. This means that all mappers would finish before we can find out if there were errors, right ? Even though first mapper might have reached this threshold. Thanks, Sent from my iPhone On Nov 15, 2011, at 9:21 PM, Harsh J wrote: > Ah so the threshold is