Re: Submitting and running hadoop jobs Programmatically
Madhu, Ditch the '*' in the classpath element that has the configuration directory. The directory ought to be on the classpath, not the files AFAIK. Try and let us know if it then picks up the proper config (right now, its using the local mode). On Wed, Jul 27, 2011 at 10:25 AM, madhu phatak phatak@gmail.com wrote: Hi I am submitting the job as follows java -cp Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/* com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv kkk11fffrrw 1 I get the log in CLI as below 11/07/27 10:22:54 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/07/27 10:22:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/07/27 10:22:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 11/07/27 10:22:54 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/07/27 10:22:54 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/tmp/hadoop-hadoop/mapred/staging/hadoop-1331241340/.staging/job_local_0001 It doesn't create any job in hadoop. On Tue, Jul 26, 2011 at 5:11 PM, Devaraj K devara...@huawei.com wrote: Madhu, Can you check the client logs, whether any error/exception is coming while submitting the job? Devaraj K -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, July 26, 2011 5:01 PM To: common-user@hadoop.apache.org Subject: Re: Submitting and running hadoop jobs Programmatically Yes. Internally, it calls regular submit APIs. On Tue, Jul 26, 2011 at 4:32 PM, madhu phatak phatak@gmail.com wrote: I am using JobControl.add() to add a job and running job control in a separate thread and using JobControl.allFinished() to see all jobs completed or not . Is this work same as Job.submit()?? On Tue, Jul 26, 2011 at 4:08 PM, Harsh J ha...@cloudera.com wrote: Madhu, Do you get a specific error message / stack trace? Could you also paste your JT logs? On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak phatak@gmail.com wrote: Hi I am using the same APIs but i am not able to run the jobs by just adding the configuration files and jars . It never create a job in Hadoop , it just shows cleaning up staging area and fails. On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K devara...@huawei.com wrote: Hi Madhu, You can submit the jobs using the Job API's programmatically from any system. The job submission code can be written this way. // Create a new Job Job job = new Job(new Configuration()); job.setJarByClass(MyJob.class); // Specify various job-specific parameters job.setJobName(myjob); job.setInputPath(new Path(in)); job.setOutputPath(new Path(out)); job.setMapperClass(MyJob.MyMapper.class); job.setReducerClass(MyJob.MyReducer.class); // Submit the job job.submit(); For submitting this, need to add the hadoop jar files and configuration files in the class path of the application from where you want to submit the job. You can refer this docs for more info on Job API's. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred uce/Job.html Devaraj K -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, July 26, 2011 3:29 PM To: common-user@hadoop.apache.org Subject: Submitting and running hadoop jobs Programmatically Hi, I am working on a open source project Nectarhttps://github.com/zinnia-phatak-dev/Nectar where i am trying to create the hadoop jobs depending upon the user input. I was using Java Process API to run the bin/hadoop shell script to submit the jobs. But it seems not good way because the process creation model is not consistent across different operating systems . Is there any better way to submit the jobs rather than invoking the shell script? I am using hadoop-0.21.0 version and i am running my program in the same user where hadoop is installed . Some of the older thread told if I add configuration files in path it will work fine . But i am not able to run in that way . So anyone tried this before? If So , please can you give detailed instruction how to achieve it . Advanced thanks for your help. Regards, Madhukara Phatak -- Harsh J -- Harsh J -- Harsh J
questions regarding data storage and inputformat
Hi Folks, I have a bunch of binary files which I've stored in a sequencefile. The name of the file is the key, the data is the value and I've stored them sorted by key. (I'm not tied to using a sequencefile for this). The current test data is only 50MB, but the real data will be 500MB - 1GB. My M/R job requires that it's input be several of these records in the sequence file, which is determined by the key. The sorting mentioned above keeps these all packed together. 1. Any reason not to use a sequence file for this? Perhaps a mapfile? Since I've sorted it, I don't need random accesses, but I do need to be aware of the keys, as I need to be sure that I get all of the relevant keys sent to a given mapper 2. Looks like I want a custom inputformat for this, extending SequenceFileInputFormat. Do you agree? I'll gladly take some opinions on this, as I ultimately want to split the based on what's in the file, which might be a little unorthodox. 3. Another idea might be create separate seq files for chunk of records and make them non-splittable, ensuring that they go to a single mapper. Assuming I can get away with this, see any pros/cons with that approach? Thanks, Tom -- === Skybox is hiring. http://www.skyboximaging.com/careers/jobs
Re: Submitting and running hadoop jobs Programmatically
On 27/07/11 05:55, madhu phatak wrote: Hi I am submitting the job as follows java -cp Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/* com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv kkk11fffrrw 1 My code to submit jobs (via a declarative configuration) is up online http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/SubmitterImpl.java?revision=8590view=markup It's LGPL, but ask nicely and I'll change the header to Apache. That code doesn't set up the classpath by pushing out more JARs (I'm planning to push out .groovy scripts instead), but it can also poll for job completion, take a timeout (useful in small test runs), and do other things. I currently mainly use it for testing
Re: Submitting and running hadoop jobs Programmatically
Thank you . Will have a look on it. On Wed, Jul 27, 2011 at 3:28 PM, Steve Loughran ste...@apache.org wrote: On 27/07/11 05:55, madhu phatak wrote: Hi I am submitting the job as follows java -cp Nectar-analytics-0.0.1-**SNAPSHOT.jar:/home/hadoop/** hadoop-for-nectar/hadoop-0.21.**0/conf/*:$HADOOP_COMMON_HOME/** lib/*:$HADOOP_COMMON_HOME/* com.zinnia.nectar.regression.**hadoop.primitive.jobs.SigmaJob input/book.csv kkk11fffrrw 1 My code to submit jobs (via a declarative configuration) is up online http://smartfrog.svn.**sourceforge.net/viewvc/** smartfrog/trunk/core/hadoop-**components/hadoop-ops/src/org/** smartfrog/services/hadoop/**operations/components/** submitter/SubmitterImpl.java?**revision=8590view=markuphttp://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/components/submitter/SubmitterImpl.java?revision=8590view=markup It's LGPL, but ask nicely and I'll change the header to Apache. That code doesn't set up the classpath by pushing out more JARs (I'm planning to push out .groovy scripts instead), but it can also poll for job completion, take a timeout (useful in small test runs), and do other things. I currently mainly use it for testing
Re: error of loading logging class
Its the problem of multiple versions of same jar. On Thu, Jul 21, 2011 at 5:15 PM, Steve Loughran ste...@apache.org wrote: On 20/07/11 07:16, Juwei Shi wrote: Hi, We faced a problem of loading logging class when start the name node. It seems that hadoop can not find commons-logging-*.jar We have tried other commons-logging-1.0.4.jar and commons-logging-api-1.0.4.jar. It does not work! The following are error logs from starting console: I'd drop the -api file as it isn't needed, and as you say, avoid duplicate versions. Make sure that log4j is at the same point in the class hierarchy too (e.g in hadoop/lib) to debug commons logging, tell it to log to stderr. It's useful in emergencies -Dorg.apache.commons.logging.**diagnostics.dest=STDERR
Re: questions regarding data storage and inputformat
1. Any reason not to use a sequence file for this? Perhaps a mapfile? Since I've sorted it, I don't need random accesses, but I do need to be aware of the keys, as I need to be sure that I get all of the relevant keys sent to a given mapper MapFile *may* be better here (see my answer for 2 below). 2. Looks like I want a custom inputformat for this, extending SequenceFileInputFormat. Do you agree? I'll gladly take some opinions on this, as I ultimately want to split the based on what's in the file, which might be a little unorthodox. If you need to split based on where certain keys are in the file, then a SequenceFile isn't a great solution. It would require that your InputFormat scan through all of the data just to find split points. Assuming you know what keys to split on ahead of time, you could use MapFiles and find the exact split point more quickly. 3. Another idea might be create separate seq files for chunk of records and make them non-splittable, ensuring that they go to a single mapper. Assuming I can get away with this, see any pros/cons with that approach? Separate sequence files would require the least amount of custom code. -Joey -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Multiple Output Formats
Roger, Or you can take a look at Hadoop's MultipleOutputs class. Thanks. Alejandro On Tue, Jul 26, 2011 at 11:30 PM, Luca Pireddu pire...@crs4.it wrote: On July 26, 2011 06:11:33 PM Roger Chen wrote: Hi all, I am attempting to implement MultipleOutputFormat to write data to multiple files dependent on the output keys and values. Can somebody provide a working example with how to implement this in Hadoop 0.20.2? Thanks! Hello, I have a working sample here: http://biodoop-seal.bzr.sourceforge.net/bzr/biodoop- seal/trunk/annotate/head%3A/src/it/crs4/seal/demux/DemuxOutputFormat.java It extends FileOutputFormat. -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: Cygwin not working with Hadoop and Eclipse Plugin
See (inline at ***) Cheers, A Df From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org; A Df abbey_dragonfor...@yahoo.com Sent: Tuesday, 26 July 2011, 21:29 Subject: Re: Cygwin not working with Hadoop and Eclipse Plugin A Df, On Wed, Jul 27, 2011 at 1:42 AM, A Df abbey_dragonfor...@yahoo.com wrote: Harsh: See (inline at the **) I hope its easy to follow and for the other responses, I was not sure how to respond to get everything into one. Sorry for top posting! Np! I don't strongly enforce a style of reply so long as it is visible, and readable :) Eric where would I put the line below and explain in newbie terms, thanks: PATH=$PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin *** I have added it to the PATH variable as $PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin, I hope this is correct. You'd set this in your Windows environment. A good guide (googled link): http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx The last known version I'd heard had a no-complains, fully-working eclipse plugin along with it was Hadoop 0.20.2 (although stable is 203, I've seen lots of issues pop up with eclipse plugin from members on the ML, but someone else can comment better on if its fixed for 204 or is a non-issue). I've used this one personally on Windows myself and things work. I think there was just one issue one could encounter somehow and I'd covered it in a blog post some time ago, here: http://www.harshj.com/2010/07/18/making-the-eclipse-plugin-work-for-hadoop/ ** I tried to use the patch but my cygwin gives the error: bash: patch: command not found Feared you may face it. You need to install the patch program from Cygwin's package manager/installer. I believe the package name is (iirc): patchutils *** I installed the patchutils but now when I reach the ant command it gives error: bash: ant: command not found. I searched for a ant plugin but I don't see any. How do I get to run the line ant -Declipse.home=$ECLIPSE_HOME binary? Beyond that, the tutorial at v-lad.org is the one I'd recommend following. It has worked well for me over time. ** yes, the screenshots and instructions are easy to follow just that I seem to always have a problem with the plugin or cygwin What specific error do you get when you load the plugin or start the daemons via cygwin shell, etc.? Its easier for folks to answer if they see an error message, or a stacktrace. I wanted to test on Windows first to get a feel of Hadoop since I am new to it and also because I am newbie Unix/Linux user. I have been trying to follow the tutorials shown at the link above but each time I run into errors with the plugin or not recognizing the import or JAVA_HOME not set. Please can I get some help. Thanks I'd say use Linux when/where possible. A VM is a good choice as well, as James pointed out above, if your hardware can handle it. ** Harsh and James, I tried the vmware from the Yahoo tutorial but I had problems with the plugin too. You can setup a raw linux VM, and install stuff atop. I've had better success with the VMs Cloudera offers: https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM (They come ready with the whole stack). But basically it all boils down to using a Linux VM, wherever you source it from. -- Harsh J
cygwin not connecting to Hadoop server
Hi All: I am have Hadoop 0.20.2 and I am using cygwin on Windows 7. I modified the files as shown below for the Hadoop configuration. conf/core-site.xml: configuration property namefs.default.name/name valuehdfs://localhost:9100/value /property /configuration conf/hdfs-site.xml: configuration property namedfs.replication/name value1/value /property /configuration conf/mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:9101/value /property /configuration Then I have the PATH variable with $PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin I added JAVA_HOME to the file in cygwin\home\Williams\hadoop-0.20.2\conf\hadoop-env.sh. My Java home is now at C:\Java\jdk1.6.0_26 so there is not space. I also turned off my firewall. However, I get the error from the command line: CODE Williams@TWilliams-LTPC ~ $ pwd /home/Williams Williams@TWilliams-LTPC ~ $ cd hadoop-0.20.2 Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/start-all.sh starting namenode, logging to /home/Williams/hadoop-0.20.2/bin/../logs/hadoop-Wi lliams-namenode-TWilliams-LTPC.out localhost: starting datanode, logging to /home/Williams/hadoop-0.20.2/bin/../log s/hadoop-Williams-datanode-TWilliams-LTPC.out localhost: starting secondarynamenode, logging to /home/Williams/hadoop-0.20.2/b in/../logs/hadoop-Williams-secondarynamenode-TWilliams-LTPC.out starting jobtracker, logging to /home/Williams/hadoop-0.20.2/bin/../logs/hadoop- Williams-jobtracker-TWilliams-LTPC.out localhost: starting tasktracker, logging to /home/Williams/hadoop-0.20.2/bin/../ logs/hadoop-Williams-tasktracker-TWilliams-LTPC.out Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/hadoop fs -put conf input 11/07/27 17:11:28 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 0 time(s). 11/07/27 17:11:30 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 1 time(s). 11/07/27 17:11:32 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 2 time(s). 11/07/27 17:11:34 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 3 time(s). 11/07/27 17:11:36 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 4 time(s). 11/07/27 17:11:38 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 5 time(s). 11/07/27 17:11:40 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 6 time(s). 11/07/27 17:11:43 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 7 time(s). 11/07/27 17:11:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 8 time(s). 11/07/27 17:11:47 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 9 time(s). Bad connection to FS. command aborted. Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/hadoop fs -put conf input 11/07/27 17:17:29 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 0 time(s). 11/07/27 17:17:31 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 1 time(s). 11/07/27 17:17:33 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 2 time(s). 11/07/27 17:17:35 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 3 time(s). 11/07/27 17:17:37 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 4 time(s). 11/07/27 17:17:39 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 5 time(s). 11/07/27 17:17:41 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 6 time(s). 11/07/27 17:17:44 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 7 time(s). 11/07/27 17:17:46 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 8 time(s). 11/07/27 17:17:48 INFO ipc.Client: Retrying connect to server: localhost/127.0.0 .1:9100. Already tried 9 time(s). Bad connection to FS. command aborted. Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ ping 127.0.0.1:9100 Ping request could not find host 127.0.0.1:9100. Please check the name and try a gain. /CODE I am not sure why the ip address seem to have localhost/127.0.0.1 which seems to be repeating itself. The conf files are fine. I also know that when Hadoop is running there is a web interface to check but do the default ones work from cygwin which are: * NameNode - http://localhost:50070/ * JobTracker - http://localhost:50030/ I wanted to give the cygwin a try once more before just switching to a cloudera hadoop vmware. I was hoping that it would not have so many problems just to get it working on Windows! Thanks again. Cheers, A Df
Re: cygwin not connecting to Hadoop server
Hi A Df, Did you format the NameNode first? Can you check the NN logs whether NN is started or not? Regards, Uma ** This email and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained here in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it! * - Original Message - From: A Df abbey_dragonfor...@yahoo.com Date: Wednesday, July 27, 2011 9:55 pm Subject: cygwin not connecting to Hadoop server To: common-user@hadoop.apache.org common-user@hadoop.apache.org Hi All: I am have Hadoop 0.20.2 and I am using cygwin on Windows 7. I modified the files as shown below for the Hadoop configuration. conf/core-site.xml: configuration property namefs.default.name/name valuehdfs://localhost:9100/value /property /configuration conf/hdfs-site.xml: configuration property namedfs.replication/name value1/value /property /configuration conf/mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:9101/value /property /configuration Then I have the PATH variable with $PATH:/cygdrive/c/cygwin/bin:/cygdrive/c/cygwin/usr/bin I added JAVA_HOME to the file in cygwin\home\Williams\hadoop- 0.20.2\conf\hadoop-env.sh. My Java home is now at C:\Java\jdk1.6.0_26 so there is not space. I also turned off my firewall. However, I get the error from the command line: CODE Williams@TWilliams-LTPC ~ $ pwd /home/Williams Williams@TWilliams-LTPC ~ $ cd hadoop-0.20.2 Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/start-all.sh starting namenode, logging to /home/Williams/hadoop- 0.20.2/bin/../logs/hadoop-Wi lliams-namenode-TWilliams-LTPC.out localhost: starting datanode, logging to /home/Williams/hadoop- 0.20.2/bin/../logs/hadoop-Williams-datanode-TWilliams-LTPC.out localhost: starting secondarynamenode, logging to /home/Williams/hadoop-0.20.2/b in/../logs/hadoop-Williams-secondarynamenode-TWilliams-LTPC.out starting jobtracker, logging to /home/Williams/hadoop- 0.20.2/bin/../logs/hadoop- Williams-jobtracker-TWilliams-LTPC.out localhost: starting tasktracker, logging to /home/Williams/hadoop- 0.20.2/bin/../logs/hadoop-Williams-tasktracker-TWilliams-LTPC.out Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/hadoop fs -put conf input 11/07/27 17:11:28 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 0 time(s). 11/07/27 17:11:30 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 1 time(s). 11/07/27 17:11:32 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 2 time(s). 11/07/27 17:11:34 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 3 time(s). 11/07/27 17:11:36 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 4 time(s). 11/07/27 17:11:38 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 5 time(s). 11/07/27 17:11:40 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 6 time(s). 11/07/27 17:11:43 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 7 time(s). 11/07/27 17:11:45 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 8 time(s). 11/07/27 17:11:47 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 9 time(s). Bad connection to FS. command aborted. Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/hadoop fs -put conf input 11/07/27 17:17:29 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 0 time(s). 11/07/27 17:17:31 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 1 time(s). 11/07/27 17:17:33 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 2 time(s). 11/07/27 17:17:35 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 3 time(s). 11/07/27 17:17:37 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 4 time(s). 11/07/27 17:17:39 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 5 time(s). 11/07/27 17:17:41 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:9100. Already tried 6 time(s). 11/07/27 17:17:44 INFO
cannot get configuration settings from API
Good afternoon, during writing a MapReduce job, I need to get the value of some configuration settings. For instance, I need to get the value of dfs.write.packet.size inside the reducer, so I write, using the context of the reducer: Configuration the_conf=context.getConfiguration(); int data_packet_size=the_conf.getInt(dfs.write.packet.size, 0); However, this does not return 64KB (which is the value by default), but it gives 0 instead. Could you please help me, by telling me how I can get and set the value of these configuration parameters? Thank you in advance, Sofia
Re: cygwin not connecting to Hadoop server
See inline at **. More questions and many Thanks :D From: Uma Maheswara Rao G 72686 mahesw...@huawei.com To: common-user@hadoop.apache.org; A Df abbey_dragonfor...@yahoo.com Cc: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Wednesday, 27 July 2011, 17:31 Subject: Re: cygwin not connecting to Hadoop server Hi A Df, Did you format the NameNode first? ** I had formatted it already but then I had reinstalled Java and upgraded the plugins in cygwin so I reformatted it again. :D yes it worked!! I am not sure all the steps that got it to finally work but I will have to document it to prevent this headache in the future. Although I typed ssh localhost too , so question is, do I need to type ssh localhost each time I need to run hadoop?? Also, since I need to work with Eclipse maybe you can have a look at my post about the plugin cause I can get the patch to work. The subject is Re: Cygwin not working with Hadoop and Eclipse Plugin. I plan to read up on how to write programs for Hadoop. I am using the tutorial at Yahoo but if you know of any really good about coding with Hadoop or just about understanding Hadoop then please let me know. Can you check the NN logs whether NN is started or not? ** I checked and the previous runs had some logs missing but now the last one have all 5 logs and I got two conf files in xml. I also copied out the other output files which I plan to examine. Where do I specify the output extension format that I want for my output file? I was hoping for an txt file it shows the output in a file with no extension even though I can read it in Notepad++. I also got to view the web interface at: NameNode - http://localhost:50070/ JobTracker - http://localhost:50030/ ** See below for the working version, finally!! Thanks CMD Williams@TWilliams-LTPC ~/hadoop-0.20.2 $ bin/hadoop jar hadoop-0.20.2-examples.jar grep input 11/07/27 17:42:20 INFO mapred.FileInputFormat: Total in 11/07/27 17:42:20 INFO mapred.JobClient: Running job: j 11/07/27 17:42:21 INFO mapred.JobClient: map 0% reduce 11/07/27 17:42:33 INFO mapred.JobClient: map 15% reduc 11/07/27 17:42:36 INFO mapred.JobClient: map 23% reduc 11/07/27 17:42:39 INFO mapred.JobClient: map 38% reduc 11/07/27 17:42:42 INFO mapred.JobClient: map 38% reduc 11/07/27 17:42:45 INFO mapred.JobClient: map 53% reduc 11/07/27 17:42:48 INFO mapred.JobClient: map 69% reduc 11/07/27 17:42:51 INFO mapred.JobClient: map 76% reduc 11/07/27 17:42:54 INFO mapred.JobClient: map 92% reduc 11/07/27 17:42:57 INFO mapred.JobClient: map 100% redu 11/07/27 17:43:06 INFO mapred.JobClient: map 100% redu 11/07/27 17:43:09 INFO mapred.JobClient: Job complete: 11/07/27 17:43:09 INFO mapred.JobClient: Counters: 18 11/07/27 17:43:09 INFO mapred.JobClient: Job Counters 11/07/27 17:43:09 INFO mapred.JobClient: Launched r 11/07/27 17:43:09 INFO mapred.JobClient: Launched m 11/07/27 17:43:09 INFO mapred.JobClient: Data-local 11/07/27 17:43:09 INFO mapred.JobClient: FileSystemCo 11/07/27 17:43:09 INFO mapred.JobClient: FILE_BYTES 11/07/27 17:43:09 INFO mapred.JobClient: HDFS_BYTES 11/07/27 17:43:09 INFO mapred.JobClient: FILE_BYTES 11/07/27 17:43:09 INFO mapred.JobClient: HDFS_BYTES 11/07/27 17:43:09 INFO mapred.JobClient: Map-Reduce F 11/07/27 17:43:09 INFO mapred.JobClient: Reduce inp 11/07/27 17:43:09 INFO mapred.JobClient: Combine ou 11/07/27 17:43:09 INFO mapred.JobClient: Map input 11/07/27 17:43:09 INFO mapred.JobClient: Reduce shu 11/07/27 17:43:09 INFO mapred.JobClient: Reduce out 11/07/27 17:43:09 INFO mapred.JobClient: Spilled Re 11/07/27 17:43:09 INFO mapred.JobClient: Map output 11/07/27 17:43:09 INFO mapred.JobClient: Map input 11/07/27 17:43:09 INFO mapred.JobClient: Combine in 11/07/27 17:43:09 INFO mapred.JobClient: Map output 11/07/27 17:43:09 INFO mapred.JobClient: Reduce inp 11/07/27 17:43:09 WARN mapred.JobClient: Use GenericOpt e arguments. Applications should implement Tool for the 11/07/27 17:43:09 INFO mapred.FileInputFormat: Total in 11/07/27 17:43:09 INFO mapred.JobClient: Running job: j 11/07/27 17:43:10 INFO mapred.JobClient: map 0% reduce 11/07/27 17:43:22 INFO mapred.JobClient: map 100% redu 11/07/27 17:43:31 INFO mapred.JobClient: map 100% redu 11/07/27 17:43:36 INFO mapred.JobClient: map 100% redu 11/07/27 17:43:38 INFO mapred.JobClient: Job complete: 11/07/27 17:43:39 INFO mapred.JobClient: Counters: 18 11/07/27 17:43:39 INFO mapred.JobClient: Job Counters 11/07/27 17:43:39 INFO mapred.JobClient: Launched r 11/07/27 17:43:39 INFO mapred.JobClient: Launched m 11/07/27 17:43:39 INFO mapred.JobClient: Data-local 11/07/27 17:43:39 INFO mapred.JobClient: FileSystemCo 11/07/27 17:43:39 INFO mapred.JobClient: FILE_BYTES 11/07/27 17:43:39 INFO mapred.JobClient: HDFS_BYTES 11/07/27 17:43:39 INFO mapred.JobClient: FILE_BYTES 11/07/27 17:43:39 INFO
Re: questions regarding data storage and inputformat
3. Another idea might be create separate seq files for chunk of records and make them non-splittable, ensuring that they go to a single mapper. Assuming I can get away with this, see any pros/cons with that approach? Separate sequence files would require the least amount of custom code. Thanks for the response, Joey. So, if I were to do the above, I would still need a custom record reader to put all the keys and values together, right? Thanks, Tom -- === Skybox is hiring. http://www.skyboximaging.com/careers/jobs
OSX starting hadoop error
All When starting hadoop on OSX I am getting this error. is there a fix for it java[22373:1c03] Unable to load realm info from SCDynamicStore
RE: Hadoop-streaming using binary executable c program
Hi Bobby, I just want to ask you if there is away of using a reducer or something like concatenation to glue my outputs from the mapper and outputs them as a single file and segment of the predicted RNA 2D structure? FYI: I have used a reducer NONE before: HADOOP_HOME$ bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output /user/yehdego/RF-out -reducer NONE -verbose and a sample of my output using the mapper of two different slave nodes looks like this : AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGC and [...(((...))).]. (-13.46) GGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU .(((.((......).. (-11.00) and I want to concatenate and output them as a single predicated RNA sequence structure: AUACCCGCAAAUUCACUCAAAUCUGUAAUAGGUUUGUCAUUCAAAUCUAGUGCAAAUAUUACUUUCGCCAAUUAGGUAUAAUAAUGGUAAGCGGGACAAGACUCGACAUUUGAUACACUAUUUAUCAAUGGAUGUCUUCU [...(((...))).]..(((.((......).. Regards, Daniel T. Yehdego Computational Science Program University of Texas at El Paso, UTEP dtyehd...@miners.utep.edu From: dtyehd...@miners.utep.edu To: common-user@hadoop.apache.org Subject: RE: Hadoop-streaming using binary executable c program Date: Tue, 26 Jul 2011 16:23:10 + Good afternoon Bobby, Thanks so much, now its working excellent. And the speed is also reasonable. Once again thanks u. Regards, Daniel T. Yehdego Computational Science Program University of Texas at El Paso, UTEP dtyehd...@miners.utep.edu From: ev...@yahoo-inc.com To: common-user@hadoop.apache.org Date: Mon, 25 Jul 2011 14:47:34 -0700 Subject: Re: Hadoop-streaming using binary executable c program This is likely to be slow and it is not ideal. The ideal would be to modify pknotsRG to be able to read from stdin, but that may not be possible. The shell script would probably look something like the following #!/bin/sh rm -f temp.txt; while read line do echo $line temp.txt; done exec pknotsRG temp.txt; Place it in a file say hadoopPknotsRG Then you probably want to run chmod +x hadoopPknotsRG After that you want to test it with hadoop fs -cat /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | ./hadoopPknotsRG If that works then you can try it with Hadoop streaming HADOOP_HOME$ bin/hadoop jar /data/yehdego/hadoop-0.20.2/hadoop-0.20.2-streaming.jar -mapper ./hadoopPknotsRG -file /data/yehdego/hadoop-0.20.2/pknotsRG -file /data/yehdego/hadoop-0.20.2/hadoopPknotsRG -input /user/yehdego/RF00028_B.bpseqL3G5_seg_Centered_Method.txt -output /user/yehdego/RF-out -reducer NONE -verbose --Bobby On 7/25/11 3:37 PM, Daniel Yehdego dtyehd...@miners.utep.edu wrote: Good afternoon Bobby, Thanks, you gave me a great help in finding out what the problem was. After I put the command line you suggested me, I found out that there was a segmentation error. The binary executable program pknotsRG only reads a file with a sequence in it. This means, there should be a shell script, as you have said, that will take the data coming from stdin and write it to a temporary file. Any idea on how to do this job in shell script. The thing is I am from a biology background and don't have much experience in CS. looking forward to hear from you. Thanks so much. Regards, Daniel T. Yehdego Computational Science Program University of Texas at El Paso, UTEP dtyehd...@miners.utep.edu From: ev...@yahoo-inc.com To: common-user@hadoop.apache.org Date: Fri, 22 Jul 2011 12:39:08 -0700 Subject: Re: Hadoop-streaming using binary executable c program I would suggest that you do the following to help you debug. hadoop fs -cat /user/yehdego/RNAData/RF00028_B.bpseqL3G5_seg_Centered_Method.txt | head -2 | /data/yehdego/hadoop-0.20.2/pknotsRG-1.3/src/pknotsRG - This is simulating what hadoop streaming is doing. Here we are taking the first 2 lines out of the input file and feeding them to the stdin of pknotsRG. The first step is to make sure that you can get your program to run correctly with something like this. You may need to change the command line to pknotsRG to get it to read the data it is processing from stdin, instead of from a file. Alternatively you may need to write a shell script that will take the data coming from stdin. Write it to a file and then call pknotsRG on that
Re: questions regarding data storage and inputformat
You could either use a custom RecordReader or you could override the run() method on your Mapper class to do the merging before calling the map() method. -Joey On Wed, Jul 27, 2011 at 11:09 AM, Tom Melendez t...@supertom.com wrote: 3. Another idea might be create separate seq files for chunk of records and make them non-splittable, ensuring that they go to a single mapper. Assuming I can get away with this, see any pros/cons with that approach? Separate sequence files would require the least amount of custom code. Thanks for the response, Joey. So, if I were to do the above, I would still need a custom record reader to put all the keys and values together, right? Thanks, Tom -- === Skybox is hiring. http://www.skyboximaging.com/careers/jobs -- Joseph Echeverria Cloudera, Inc. 443.305.9434
RE: Build Hadoop 0.20.2 from source
Hi Vighnesh, Also, Cloudera has a decent screencast that walks you through building in eclipse: http://www.cloudera.com/blog/2009/04/configuring-eclipse-for-hadoop-development-a-screencast/ http://wiki.apache.org/hadoop/EclipseEnvironment -Eric -Original Message- From: Uma Maheswara Rao G 72686 [mailto:mahesw...@huawei.com] Sent: Wednesday, July 27, 2011 12:47 AM To: common-user@hadoop.apache.org Subject: Re: Build Hadoop 0.20.2 from source Hi Vighnesh, Step 1) Download the code base from apache svn repository. Step 2) In root folder you can find build.xml file. In that folder just execute a)ant and b)ant eclipse this will generate the eclipse project setings files. After this directly you can import this project in you eclipse. Regards, Uma ** This email and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained here in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this email in error, please notify the sender by phone or email immediately and delete it! ** *** - Original Message - From: Vighnesh Avadhani vighnesh.avadh...@gmail.com Date: Wednesday, July 27, 2011 11:08 am Subject: Build Hadoop 0.20.2 from source To: common-user@hadoop.apache.org Hi, I want to build Hadoop 0.20.2 from source using the Eclipse IDE. Can anyone help me with this? Regards, Vighnesh
Replication and failure
Just trying to understand what happens if there are 3 nodes with replication set to 3 and one node fails. Does it fail the writes too? If there is a link that I can look at will be great. I tried searching but didn't see any definitive answer. Thanks, Mohit
File System Counters.
Hello I don't know if the question has been answered. I am trying to understand the overlap between FILE_BYTES_READ and HDFS_BYTES_READ. What are the various components that provide value to this counter? For example when I see FILE_BYTES_READ for a specific task ( Map or Reduce ) , is it purely due to the spill during sort phase? If a HDFS read happens on a non local node, does the counter increase on the node where the data block resides? What happens when the data is local? does the counter increase for both HDFS_BYTES_READ and FILE_BYTES_READ? From the values I am seeing, this looks to be the case but I am not sure. I am not very fluent in Java , and hence I don't fully understand the source . :-( Raj
Re: Submitting and running hadoop jobs Programmatically
Thank you Harsha . I am able to run the jobs by ditching *. On Wed, Jul 27, 2011 at 11:41 AM, Harsh J ha...@cloudera.com wrote: Madhu, Ditch the '*' in the classpath element that has the configuration directory. The directory ought to be on the classpath, not the files AFAIK. Try and let us know if it then picks up the proper config (right now, its using the local mode). On Wed, Jul 27, 2011 at 10:25 AM, madhu phatak phatak@gmail.com wrote: Hi I am submitting the job as follows java -cp Nectar-analytics-0.0.1-SNAPSHOT.jar:/home/hadoop/hadoop-for-nectar/hadoop-0.21.0/conf/*:$HADOOP_COMMON_HOME/lib/*:$HADOOP_COMMON_HOME/* com.zinnia.nectar.regression.hadoop.primitive.jobs.SigmaJob input/book.csv kkk11fffrrw 1 I get the log in CLI as below 11/07/27 10:22:54 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/07/27 10:22:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/07/27 10:22:54 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 11/07/27 10:22:54 WARN mapreduce.JobSubmitter: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 11/07/27 10:22:54 INFO mapreduce.JobSubmitter: Cleaning up the staging area file:/tmp/hadoop-hadoop/mapred/staging/hadoop-1331241340/.staging/job_local_0001 It doesn't create any job in hadoop. On Tue, Jul 26, 2011 at 5:11 PM, Devaraj K devara...@huawei.com wrote: Madhu, Can you check the client logs, whether any error/exception is coming while submitting the job? Devaraj K -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, July 26, 2011 5:01 PM To: common-user@hadoop.apache.org Subject: Re: Submitting and running hadoop jobs Programmatically Yes. Internally, it calls regular submit APIs. On Tue, Jul 26, 2011 at 4:32 PM, madhu phatak phatak@gmail.com wrote: I am using JobControl.add() to add a job and running job control in a separate thread and using JobControl.allFinished() to see all jobs completed or not . Is this work same as Job.submit()?? On Tue, Jul 26, 2011 at 4:08 PM, Harsh J ha...@cloudera.com wrote: Madhu, Do you get a specific error message / stack trace? Could you also paste your JT logs? On Tue, Jul 26, 2011 at 4:05 PM, madhu phatak phatak@gmail.com wrote: Hi I am using the same APIs but i am not able to run the jobs by just adding the configuration files and jars . It never create a job in Hadoop , it just shows cleaning up staging area and fails. On Tue, Jul 26, 2011 at 3:46 PM, Devaraj K devara...@huawei.com wrote: Hi Madhu, You can submit the jobs using the Job API's programmatically from any system. The job submission code can be written this way. // Create a new Job Job job = new Job(new Configuration()); job.setJarByClass(MyJob.class); // Specify various job-specific parameters job.setJobName(myjob); job.setInputPath(new Path(in)); job.setOutputPath(new Path(out)); job.setMapperClass(MyJob.MyMapper.class); job.setReducerClass(MyJob.MyReducer.class); // Submit the job job.submit(); For submitting this, need to add the hadoop jar files and configuration files in the class path of the application from where you want to submit the job. You can refer this docs for more info on Job API's. http://hadoop.apache.org/mapreduce/docs/current/api/org/apache/hadoop/mapred uce/Job.html Devaraj K -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, July 26, 2011 3:29 PM To: common-user@hadoop.apache.org Subject: Submitting and running hadoop jobs Programmatically Hi, I am working on a open source project Nectarhttps://github.com/zinnia-phatak-dev/Nectar where i am trying to create the hadoop jobs depending upon the user input. I was using Java Process API to run the bin/hadoop shell script to submit the jobs. But it seems not good way because the process creation model is not consistent across different operating systems . Is there any better way to submit the jobs rather than invoking the shell script? I am using hadoop-0.21.0 version and i am running my program in the same user where hadoop is installed . Some of the older thread told if I add configuration files in path it will work fine . But i am not able to run in that way . So anyone tried this before? If So , please can you give detailed instruction how to achieve it . Advanced thanks for your help. Regards,
Hadoop Question
Hi All, How can I determine if a file is being written to (by any thread) in HDFS. I have a continuous process on the master node, which is tracking a particular folder in HDFS for files to process. On the slave nodes, I am creating files in the same folder using the following code : At the slave node: import org.apache.commons.io.IOUtils; import org.apache.hadoop.fs.FileSystem; import java.io.OutputStream; OutputStream oStream = fileSystem.create(path); IOUtils.write(Some String, oStream); IOUtils.closeQuietly(oStream); At the master node, I am getting the earliest modified file in the folder. At times when I try reading the file, I get nothing in the file, mostly because the slave might be still finishing writing to the file. Is there any way, to somehow tell the master, that the slave is still writing to the file and to check the file sometime later for actual content. Thanks, -- Nitin Khandelwal