RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Devaraj k
Hi Utkarsh, The usage of the jar command is like this, Usage: hadoop jar [mainClass] args... If you want the commons-math3.jar to be available for all the tasks you can do any one of these 1. Copy the jar file in $HADOOP_HOME/lib dir or 2. Use the generic option -libjars. Can you give the s

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Utkarsh Gupta
Hi Devaraj, I have already copied the required jar file in $HADOOP_HOME/lib folder. Can you tell me where to add generic option -libjars The stack trace is: hadoop$ bin/hadoop jar WordCount.jar /user/hduser1/input/ /user/hduser1/output 12/04/04 12:45:51 WARN mapred.JobClient: Use GenericOptionsPa

Re: Including third party jar files in Map Reduce job

2012-04-04 Thread Bejoy Ks
Hi Utkarsh You can add third party jars to your map reduce job elegantly in the following ways 1) use - libjars hadoop jar jarName.jar com.driver.ClassName -libjars /home/some/dir/somejar.jar 2) include the third pary jars in /lib folder while packaging your application 3) If you a

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Devaraj k
As Bejoy mentioned, If you have copied the jar to $HADOOP_HOME, then you should copy it to all the nodes in the cluster. (or) If you want to make use of -libjar option, your application should implement Tool to support generic options. Please check the below link for more details. http://hadoo

Re: how to overwrite output in HDFS?

2012-04-04 Thread Ioan Eugen Stan
Pe 03.04.2012 12:34, Fang Xin a scris: Hi, all I'm writing my own map-reduce code using eclipse with hadoop plug-in. I've specified input and output directories in the project property. (two folders, namely input and output) My problem is that each time when I do some modification and try to ru

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Utkarsh Gupta
I have tried implementing Tool interface as mentioned @ http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html But -libjars option is not working. I have copied the jar to all the nodes at $HADOOP_HOME/lib folder But I am still getting the same error. The map task comple

Calling one MR job within another MR job

2012-04-04 Thread Stuti Awasthi
Hi all, We have a usecase in which I start with first MR1 job with input file as File1.txt, and from this job, call another MR2 job with input as File2.txt So : MRjob1{ Map(){ MRJob2(File2.txt) } } MRJob2{ Processing } My queries are is this kind of approach is possible and how much are the

Re: Calling one MR job within another MR job

2012-04-04 Thread Ashwanth Kumar
Have you tired using Oozie < http://incubator.apache.org/oozie/ >? On Wed, Apr 4, 2012 at 4:04 PM, Stuti Awasthi wrote: > Hi all, > > ** ** > > We have a usecase in which I start with first MR1 job with input file as > File1.txt, and from this job, call another MR2 job with input as File2.t

RE: Calling one MR job within another MR job

2012-04-04 Thread Stuti Awasthi
Hi Ashwanth, No I have not tried oozie. I want to attain this simply through Java Map Reduce jobs. Any ideas? From: ashwanth.ku...@gmail.com [mailto:ashwanth.ku...@gmail.com] On Behalf Of Ashwanth Kumar Sent: Wednesday, April 04, 2012 4:13 PM To: mapreduce-user@hadoop.apache.org Subject: Re: Ca

Re: Calling one MR job within another MR job

2012-04-04 Thread Ashwanth Kumar
I have not tired doing but, looking into Oozie source code should get you some ideas. As Oozie uses something called LauncherMapper which launches other MR Jobs. On Wed, Apr 4, 2012 at 4:19 PM, Stuti Awasthi wrote: > Hi Ashwanth, > > ** ** > > No I have not tried oozie. I want to attain this

Re: Calling one MR job within another MR job

2012-04-04 Thread Ashwanth Kumar
Have you tired using JobConf / JobClient for starting new jobs? Also refer here - http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining on Job Chaining. On Wed, Apr 4, 2012 at 4:19 PM, Stuti Awasthi wrote: > Hi Ashwanth, > > ** ** > > No I have not tried oozie. I want to attain

RE: Calling one MR job within another MR job

2012-04-04 Thread Ravi teja ch n v
Hi Stuti, If you are looking for MRjob2 to run after MRjob1, ie the job dependency, you can use JobControl API, where you can manage the dependencies. Calling another Job from a Mapper is not a good idea. Thanks, Ravi Teja From: Stuti Awasthi [stutiawa

RE: Calling one MR job within another MR job

2012-04-04 Thread Stuti Awasthi
Hi Ravi, There is no job dependency so I cannot use chaining MR or JobControl as you suggested. I have 2 relatively big files, I start processing with File1 as input to MR1 job , now this processing required to find the data from File2. One way to do is loop through File2 and get the data. Othe

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Utkarsh Gupta
Hi Devaraj, The code is running now after copying jar @ each node. I might be doing some mistake previously. Thanks Devaraj and Bejoy :) -Original Message- From: Devaraj k [mailto:devara...@huawei.com] Sent: Wednesday, April 04, 2012 2:08 PM To: mapreduce-user@hadoop.apache.org Subject:

RE: Calling one MR job within another MR job

2012-04-04 Thread Stuti Awasthi
Hi Ashwanth, My scenario is not resolved by chaining jobs as in chaining : Output of one MR job is input in other MR job. Neither I can use JobControl Api as this tells Job1 to wait till Job2 is complete. In my scenario processing of each line File1 is dependent on simultaneous processing of Fi

Accessing local filesystem with org.apache.hadoop.fs.FileSystem

2012-04-04 Thread Pedro Costa
I'm trying to open a local file with the FileSystem class. FileSystem rfs = FileSystem.get(conf); FSDataInputStream i = srcFs.open(p); but I get file not found. The path is correct, but I think that my class is accessing hdfs, instead of my local filesystem. Can I use the FileSystem to access l

Re: Accessing local filesystem with org.apache.hadoop.fs.FileSystem

2012-04-04 Thread Ashwanth Kumar
Did you try adding *file://* in front of the path? On Wed, Apr 4, 2012 at 5:43 PM, Pedro Costa wrote: > I'm trying to open a local file with the FileSystem class. > > FileSystem rfs = FileSystem.get(conf); > FSDataInputStream i = srcFs.open(p); > > > but I get file not found. The path is correc

RE: Accessing local filesystem with org.apache.hadoop.fs.FileSystem

2012-04-04 Thread Devaraj k
Please try like this to access the local file system. FileSystem fileSystem = FileSystem.getLocal(conf); FSDataInputStream i = fileSystem.open(p); Thanks Devaraj From: ashwanth.ku...@gmail.com [ashwanth.ku...@gmail.com] on behalf of Ashwanth Kumar [ashwa

Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz
Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I

RE: Calling one MR job within another MR job

2012-04-04 Thread Ravi teja ch n v
Hi Stuti, In that case, you can run the Job with dependent file (file2) first, then go for the job using file1. Then your second mapper can use the already processed output. I guess this will solve the problem u have mentioned. Thanks, Ravi Teja From:

Re: Calling one MR job within another MR job

2012-04-04 Thread praveenesh kumar
Try looking into distributed cache.. may be it solves your problem ? Regards, Praveenesh On Wed, Apr 4, 2012 at 6:01 PM, Ravi teja ch n v wrote: > Hi Stuti, > > > > In that case, you can run the Job with dependent file (file2) first, then > go for the job using file1. > > Then your second mappe

RE: Calling one MR job within another MR job

2012-04-04 Thread Devaraj k
Hi Stuti, If you want deal with different types of files in the map phase, you can use org.apache.hadoop.mapred.lib.MultipleInputs API(different input formats, mappers) and then the output of those mappers can same type. After map phase, partitioner can send the map outputs from file1 and file2

Re: Including third party jar files in Map Reduce job

2012-04-04 Thread Harsh J
Utkarsh, A log like "12/04/04 15:21:00 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same." indicates you haven't implemented the Tool approach properly (or aren't calling its run()). On Wed, Apr 4, 2012 at 5:25 PM, Utkarsh G

RE: Calling one MR job within another MR job

2012-04-04 Thread jagatsingh
Hello Stuti The way you have explained it seems we can think about caching the file2 already in nodes. -- Just out of context , In the same way replicated joins are being handled in Pig in which one file (file2) to be joined is cached in the memory by file1. Regards Jagat - Original

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Utkarsh Gupta
Hi Harsh, I have implemented Tool like this public static void main(String[] args) throws Exception { Configuration configuration = new Configuration(); int rc = ToolRunner.run(configuration, new WordCount(), args); System.exit(rc); } @Override public int run(S

Re: Including third party jar files in Map Reduce job

2012-04-04 Thread Harsh J
When using Tool, do not use: Configuration conf = new Configuration(); Instead get config from the class: Configuration conf = getConf(); This is documented at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html On Wed, Apr 4, 2012 at 6:25 PM, Utkarsh Gupta wrote

Re: Calling one MR job within another MR job

2012-04-04 Thread Praveen Kumar K J V S
Dear Stuti, As per the mail chain I uderstand you want to do SetJoin on two sets File1 and File2 with some join finction F(F1,F2). On this assumption, please find my reply below: Set join is not simple and that too if input the input is very large. It essestially does a cartesian product between

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread Utkarsh Gupta
Hi Harsh, This worked this was exactly what I was looking for. The warning has gone and now I can add third party jar files using DistributedCache.addFileToClassPath() method. Now there is no need to copy jar to each node's $HADOOP_HOME/lib folder Thanks a lot Utkarsh -Original Message-

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Jagat Singh
Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The t

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Marcos Ortiz
On 04/04/2012 09:15 AM, Jagat Singh wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i star

Re: Including third party jar files in Map Reduce job

2012-04-04 Thread Ioan Eugen Stan
Pe 04.04.2012 16:01, Harsh J a scris: When using Tool, do not use: Configuration conf = new Configuration(); Instead get config from the class: Configuration conf = getConf(); This is documented at http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Tool.html I wish I k

Does the combiner always work if specified?

2012-04-04 Thread Sudip Sinha
Hi, I've read that the combiner only works if it is specified AND the sort memory buffer overflows in the mapper. http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201107.mbox/%3c374d8f3f-b8b1-499f-bedb-bfee32190...@hortonworks.com%3E But when I run a Hadoop streaming job in R using RHado

port 8080 in YARN

2012-04-04 Thread Radim Kolar
Which application/service runs on port 8080 in YARN by default? I need to change port.

RE: Including third party jar files in Map Reduce job

2012-04-04 Thread GUOJUN Zhu
Just a note for "-libjars". All the hadoop Tools options "-libjars", "-files"... have to be in front of the customized options. Any options after the one that Tools does not understand will be considered as customized options and ignored. Zhu, Guojun Modeling Sr Graduate 571-3824370 guojun_.

Re: port 8080 in YARN

2012-04-04 Thread Harsh J
MR Client's ShuffleHandler uses it by default. Tweakable via mapreduce.shuffle.port in mapred-site.xml. While you are at it, would you also like to offer a quick patch on https://issues.apache.org/jira/browse/MAPREDUCE-3493 to document this in mapred-default.xml file? 2012/4/4 Radim Kolar : > Whi

Sharing data between maps

2012-04-04 Thread Kevin Savage
Hi, I'm currently working on some simulation software than models engineering facilities. As input we have two big chunks of data, one about the design of the site and one about the climate the site is in. As we have an extensive set of climate data (about 1000 locations) we thought it would

Re: Sharing data between maps

2012-04-04 Thread John Armstrong
On 04/04/2012 05:00 PM, Kevin Savage wrote: However, what we have is one big file of design data that needs to go to all the maps and many big files of climate data that need to go to one map each. I've not been able to work out if there is a good way of doing this in Hadoop. It sounds like "

Re: Sharing data between maps

2012-04-04 Thread Kevin Savage
On 4 Apr 2012, at 22:07, John Armstrong wrote: > On 04/04/2012 05:00 PM, Kevin Savage wrote: >> However, what we have is one big file of design data that needs to go to all >> the maps and many big files of climate data that need to go to one map each. >> I've not been able to work out if there

Re: Yahoo Hadoop Tutorial with new APIs?

2012-04-04 Thread Robert Evans
I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate thi

(Un-)Deprecated APIs and javadoc examples

2012-04-04 Thread Steven Willis
(I just finished writing this when I noticed the similar email from Marcos bringing up similar issues with the Yahoo tutorials) I've read through the following: http://www.mail-archive.com/mapreduce-dev@hadoop.apache.org/msg01833.html http://www.mail-archive.com/general@hadoop.apache.org/msg0462

Re: (Un-)Deprecated APIs and javadoc examples

2012-04-04 Thread Harsh J
Hi Steve, With 0.23 (or 2.0, going forward), it is alright to use either set. You can also continue using the new API - it is fairly complete in 2.0 and usable (compared to 1.x). There is still some shadows on API deprecation however. We do want to support both APIs for a longer term, per our ear

Re: port 8080 in YARN

2012-04-04 Thread madhu phatak
Hi Harsha, Added patch for MAPREDUCE-3493 .Can you have a look at it and say is that correct? On Thu, Apr 5, 2012 at 12:36 AM, Harsh J wrote: > MR Client's ShuffleHandler uses it by default. Tweakable via > mapreduce.shuffle.port in mapred-site.xml. > > While you are at it, would you also lik

v0.20.203: How to stop creation of part-r-XXXXX files by reducer

2012-04-04 Thread Piyush Kansal
Hi Friends, In the reducer, I am dumping all the data to my customize set of files (using File I/O APIs) and thus not using the regular "context.write()". I am also creating filenames for these files at run time. This functionality is working fine. But, I am also getting "0 byte" part-r-X fil

RE: Calling one MR job within another MR job

2012-04-04 Thread Stuti Awasthi
Thanks everyone, So with this discussion, there are 2 main opinions I got : 1. Not to call one MR job from inside another MR job. 2. Can use distributed cache (but not good for very large file). I want to design the system so that I can efficiently do the processing. So if I run MR jo

Re: v0.20.203: How to stop creation of part-r-XXXXX files by reducer

2012-04-04 Thread Harsh J
Piyush, This has been asked many times. You'll find your answer here: http://search-hadoop.com/m/R4PzD1IoGjj2 On Thu, Apr 5, 2012 at 11:21 AM, Piyush Kansal wrote: > Hi Friends, > > In the reducer, I am dumping all the data to my customize set of files > (using File I/O APIs) and thus not using