Re: Question on Hadoop Streaming
Does you job end with an error? I am guessing what you want is: -mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh' First option says use your script as a mapper and second says ship your script as part of the job. Brock On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzler ro...@ormium.de wrote: Hi, I've got the following setup for NGS read alignment: A script accepting data from stdin/out: cat /root/bowtiestreaming.sh cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/ /home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 - 2 /root/bowtie.log A file copied to HDFS: hadoop fs -put SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 A streaming job invoked with only the mapper: hadoop jar hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 -output SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned -mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0 The file cannot be found even it is displayed: hadoop fs -cat /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned 11/12/06 09:07:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id cat: File does not exist: /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned He file looks like this (tab seperated): head SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 @SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA I3I+I(%BH43%III7I(5III*II+ @SRR014475.2 :1:1:112:26 length=36 GNNTTCCCCAACTTCCAAATCACCTAAC I!!II=I@II5II)/$;%+*/%%## @SRR014475.3 :1:1:101:937 length=36 GAAGATCCGGTACAACCCTGATGTAAATGGTA IAIIAII%I0G @SRR014475.4 :1:1:124:64 length=36 GAACACATAGAACAACAGGATTCGCCAGAACACCTG IIICI+@5+)'(-';%$;+; @SRR014475.5 :1:1:108:897 length=36 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT I0I:I'+IG3II46II0C@=III()+:+2$ @SRR014475.6 :1:1:106:14 length=36 GNNNTNTAGCATTAAGTAATTGGT I!!!I!I6I*+III:%IB0+I.%? @SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT III0%%)%I.II;III.(I@E2*'+1;;#;' @SRR014475.8 :1:1:123:8 length=36 GNNNTTNN I!!!$(!! @SRR014475.9 :1:1:118:88 length=36 GGAAACTGGCGCGCTACCAGGTAACGCGCCAC IIIGIAA4;1+16*;*+)'$%#$% @SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA IICII;CGIDI?%$I:%6)C*;#; and the result like this: cat SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 |./bowtiestreaming.sh |head @SRR014475.3 :1:1:101:937 length=36 + gi|110640213|ref|NC_008253.1| 3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA IAIIAII%I0G 0 7:TC,27:GT @SRR014475.4 :1:1:124:64 length=36 + gi|110640213|ref|NC_008253.1| 2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG IIICI+@5+)'(-';%$;+; 0 30:TC @SRR014475.5 :1:1:108:897 length=36 + gi|110640213|ref|NC_008253.1| 4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT I0I:I'+IG3II46II0C@=III()+:+2$ 0 5:CA,28:GT,29:CG,30:AT,34:CT @SRR014475.9 :1:1:118:88 length=36 - gi|110640213|ref|NC_008253.1| 3598410 GTGGCGCGTTACCTGGTAGCGCGCCAGTTTCC %$#%$')+*;*61+1;4AAIGIII 0 @SRR014475.15 :1:1:87:967 length=36 + gi|110640213|ref|NC_008253.1| 4474247 GACTACACGATCGCCTGCCTTAATATTCTTTACACC A27II7CIII*I5I+F?II' 0 6:GA,26:GT @SRR014475.20 :1:1:108:121 length=36 - gi|110640213|ref|NC_008253.1| 37761 AATGCATATTGAGAGTGTGATTATTAGC ID4II'2IIIC/;B?FII 0 12:CT @SRR014475.23 :1:1:75:54 length=36 + gi|110640213|ref|NC_008253.1| 2465453 GGTTTCTTTCTGCGCAGATGCCAGACGGTCTTTATA CI;';29=9I.4%EE2)*' 0 @SRR014475.24 :1:1:89:904 length=36 - gi|110640213|ref|NC_008253.1| 3216193 ATTAGTGTTAAGATTTCTATATTGTTGAGGCC #%);%;$EI-;$%8%I%I/+III 0 18:CT,21:GT,30:CT,31:TG,34:AT @SRR014475.27 :1:1:74:887 length=36 - gi|110640213|ref|NC_008253.1| 540567
Re: Automate Hadoop installation
Also, checkout Ambari (http://incubator.apache.org/ambari/) which is still in the Incubator status. How does Ambari and Puppet compare? Regards, Praveen On Tue, Dec 6, 2011 at 1:00 PM, alo alt wget.n...@googlemail.com wrote: Hi, to deploy software I suggest pulp: https://fedorahosted.org/pulp/wiki/HowTo For a package-based distro (debian, redhat, centos) you can build apache's hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a redhat / centos take a look at spacewalk. best, Alex On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote: These that great project called BigTop (in the apache incubator) which provides for building of Hadoop stack. The part of what it provides is a set of Puppet recipes which will allow you to do exactly what you're looking for with perhaps some minor corrections. Serious, look at Puppet - otherwise it will be a living through nightmare of configuration mismanagements. Cos On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote: Hi all, Can anyone guide me how to automate the hadoop installation/configuration process? I want to install hadoop on 10-20 nodes which may even exceed to 50-100 nodes ? I know we can use some configuration tools like puppet/or shell-scripts ? Has anyone done it ? How can we do hadoop installations on so many machines parallely ? What are the best practices for this ? Thanks, Praveenesh -- Alexander Lorenz http://mapredit.blogspot.com *P **Think of the environment: please don't print this email unless you really need to.*
Re: Question on Hadoop Streaming
Hi Brock, I'm not getting any errors. I'm issuing the following command now: hadoop jar hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 -output SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned -mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0 -file bowtiestreaming.sh The only error I get using cat hadoop-0.21.0/logs/* |grep Exception is: org.apache.hadoop.fs.ChecksumException: Checksum error: file:/root/hadoop-0.21.0/logs/history/job_201112060917_0002_root at 2620416 2011-12-06 11:14:34,515 WARN org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell command org.apache.hadoop.util.Shell$ExitCodeException: kill -13816: No such process 2011-12-06 11:14:43,039 WARN org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell command org.apache.hadoop.util.Shell$ExitCodeException: kill -13862: No such process 2011-12-06 11:14:46,282 WARN org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell command org.apache.hadoop.util.Shell$ExitCodeException: kill -13891: No such process 2011-12-06 11:14:49,841 WARN org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell command org.apache.hadoop.util.Shell$ExitCodeException: kill -13978: No such process best Regards, Romeo On 12/06/2011 10:49 AM, Brock Noland wrote: Does you job end with an error? I am guessing what you want is: -mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh' First option says use your script as a mapper and second says ship your script as part of the job. Brock On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzlerro...@ormium.de wrote: Hi, I've got the following setup for NGS read alignment: A script accepting data from stdin/out: cat /root/bowtiestreaming.sh cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/ /home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 - 2 /root/bowtie.log A file copied to HDFS: hadoop fs -put SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 A streaming job invoked with only the mapper: hadoop jar hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 -output SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned -mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0 The file cannot be found even it is displayed: hadoop fs -cat /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned 11/12/06 09:07:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id cat: File does not exist: /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned He file looks like this (tab seperated): head SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 @SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA I3I+I(%BH43%III7I(5III*II+ @SRR014475.2 :1:1:112:26 length=36 GNNTTCCCCAACTTCCAAATCACCTAAC I!!II=I@II5II)/$;%+*/%%## @SRR014475.3 :1:1:101:937 length=36 GAAGATCCGGTACAACCCTGATGTAAATGGTA IAIIAII%I0G @SRR014475.4 :1:1:124:64 length=36 GAACACATAGAACAACAGGATTCGCCAGAACACCTG IIICI+@5+)'(-';%$;+; @SRR014475.5 :1:1:108:897 length=36 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT I0I:I'+IG3II46II0C@=III()+:+2$ @SRR014475.6 :1:1:106:14 length=36 GNNNTNTAGCATTAAGTAATTGGT I!!!I!I6I*+III:%IB0+I.%? @SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT III0%%)%I.II;III.(I@E2*'+1;;#;' @SRR014475.8 :1:1:123:8 length=36 GNNNTTNN I!!!$(!! @SRR014475.9 :1:1:118:88 length=36 GGAAACTGGCGCGCTACCAGGTAACGCGCCAC IIIGIAA4;1+16*;*+)'$%#$% @SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA IICII;CGIDI?%$I:%6)C*;#; and the result like this: cat SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 |./bowtiestreaming.sh |head @SRR014475.3 :1:1:101:937 length=36 + gi|110640213|ref|NC_008253.1| 3393863
Re: Multiple Mappers for Multiple Tables
MultipleInputs take multiple Path (files) and not DB as input. As mentioned earlier export tables into HDFS either using Sqoop or native DB export tool and then do the processing. Sqoop is configured to use native DB export tool whenever possible. Regards, Praveen On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent justi...@gmail.com wrote: Thanks Bejoy, I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a Path parameter. Are these paths just ignored here? On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Justin, Just to add on to my response. If you need to fetch data from rdbms on your mapper using your custom mapreduce code you can use the DBInputFormat in your mapper class with MultipleInputs. You have to be careful in using the number of mappers for your application as dbs would be constrained with a limit on maximum simultaneous connections. Also you need to ensure that that the same Query is not executed n number of times in n mappers all fetching the same data, It'd be just wastage of network. Sqoop + Hive would be my recommendation and a good combination for such use cases. If you have Pig competency you can also look into pig instead of hive. Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve the same in two easy steps - Import data from RDBMS into Hive using SQOOP (Import) - Use hive to do some join and processing on this data Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com wrote: I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Re: Running a job continuously
If the requirement is for real time data processing, using Flume will not suffice as there is a time lag between the collection of files by Flume and processing done by Hadoop. Consider frameworks like S4, Storm (from Twitter), HStreaming etc which suits realtime processing. Regards, Praveen On Tue, Dec 6, 2011 at 10:39 AM, Ravi teja ch n v raviteja.c...@huawei.comwrote: Hi Burak, Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Just to add to Bejoy's point, with Oozie, you can specify the data dependency for running your job. When specific amount of data is in, your can configure Oozie to run your job. I think this will suffice your requirement. Regards, Ravi Teja From: burakkk [burak.isi...@gmail.com] Sent: 06 December 2011 04:03:59 To: mapreduce-u...@hadoop.apache.org Cc: common-user@hadoop.apache.org Subject: Re: Running a job continuously Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to execute the MR job on the same algorithm but different files have different velocity. Both Storm and facebook's hadoop are designed for that. But i want to use apache distribution. Bejoy Ks, i have a continuous inflow of data but i think i need a near real-time system. Mike Spreitzer, both output and input are continuous. Output isn't relevant to the input. Only that i want is all the incoming files are processed by the same job and the same algorithm. For ex, you think about wordcount problem. When you want to run wordcount, you implement that: http://wiki.apache.org/hadoop/WordCount But when the program find that code job.waitForCompletion(true);, somehow job will end up. When you want to make it continuously, what will you do in hadoop without other tools? One more thing is you assumption that the input file's name is filename_timestamp(filename_20111206_0030) public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf, wordcount);job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class);job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Burak If you have a continuous inflow of data, you can choose flume to aggregate the files into larger sequence files or so if they are small and when you have a substantial chunk of data(equal to hdfs block size). You can push that data on to hdfs based on your SLAs you need to schedule your jobs using oozie or simpe shell script. In very simple terms - push input data (could be from flume collector) into a staging hdfs dir - before triggering the job(hadoop jar) copy the input from staging to main input dir - execute the job - archive the input and output into archive dirs(any other dirs). - the output archive dir could be source of output data - delete output dir and empty input dir Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote: Hi everyone, I want to run a MR job continuously. Because i have streaming data and i try to analyze it all the time in my way(algorithm). For example you want to solve wordcount problem. It's the simplest one :) If you have some multiple files and the new files are keep going, how do you handle it? You could execute a MR job per one file but you have to do it repeatly. So what do you think? Thanks Best regards... -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * * -- *BURAK ISIKLI** *| *http://burakisikli.wordpress.com* * *
RE: Multiple Mappers for Multiple Tables
Hi Justin, If it is not feasible for you to do as praveen suggested, here you can go. 1. You can write customized InputFormat which can create different connections for different data sources and returns splits from those data source tables. Internally you can use DBInputFormat for each data source in your customized InputFormat if you can. 2. If your mapper input is not same for two data sources, you can write one mapper which internally delegates to mappers corresponding to the mapper based on the inputsplit(you can refer MultipleInputs for this). MultipleInputs doesn't support for DBInputFormat, it supports only the input format's which uses file path as input path. If you explain your use case with more details, I may help you better. Devaraj K -Original Message- From: Praveen Sripati [mailto:praveensrip...@gmail.com] Sent: Tuesday, December 06, 2011 4:11 PM To: common-user@hadoop.apache.org Subject: Re: Multiple Mappers for Multiple Tables MultipleInputs take multiple Path (files) and not DB as input. As mentioned earlier export tables into HDFS either using Sqoop or native DB export tool and then do the processing. Sqoop is configured to use native DB export tool whenever possible. Regards, Praveen On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent justi...@gmail.com wrote: Thanks Bejoy, I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a Path parameter. Are these paths just ignored here? On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Justin, Just to add on to my response. If you need to fetch data from rdbms on your mapper using your custom mapreduce code you can use the DBInputFormat in your mapper class with MultipleInputs. You have to be careful in using the number of mappers for your application as dbs would be constrained with a limit on maximum simultaneous connections. Also you need to ensure that that the same Query is not executed n number of times in n mappers all fetching the same data, It'd be just wastage of network. Sqoop + Hive would be my recommendation and a good combination for such use cases. If you have Pig competency you can also look into pig instead of hive. Hope it helps!... Regards Bejoy.K.S On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Justin If I get your requirement right you need to get in data from multiple rdbms sources and do a join on the same, also may be some more custom operations on top of this. For this you don't need to go in for writing your custom mapreduce code unless it is that required. You can achieve the same in two easy steps - Import data from RDBMS into Hive using SQOOP (Import) - Use hive to do some join and processing on this data Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com wrote: I would like join some db tables, possibly from different databases, in a MR job. I would essentially like to use MultipleInputs, but that seems file oriented. I need a different mapper for each db table. Suggestions? Thanks! Justin Vincent
Hadoop 0.21
Hi All, According to the Hadoop release notes, version 0.21.0 should not be considered stable or suitable for production: 23 August, 2010: release 0.21.0 available This release contains many improvements, new features, bug fixes and optimizations. It has not undergone testing at scale and should not be considered stable or suitable for production. This release is being classified as a minor release, which means that it should be API compatible with 0.20.2. Is this still the case ? Thank you, Saurabh
Re: Hadoop 0.21
Yep. J-D On Tue, Dec 6, 2011 at 10:41 AM, Saurabh Sehgal saurabh@gmail.com wrote: Hi All, According to the Hadoop release notes, version 0.21.0 should not be considered stable or suitable for production: 23 August, 2010: release 0.21.0 available This release contains many improvements, new features, bug fixes and optimizations. It has not undergone testing at scale and should not be considered stable or suitable for production. This release is being classified as a minor release, which means that it should be API compatible with 0.20.2. Is this still the case ? Thank you, Saurabh
Version of Hadoop That Will Work With HBase?
Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks!
Re: Version of Hadoop That Will Work With HBase?
0.20.205 should work, and so should CDH3 or 0.20-append branch builds (no longer maintained, after 0.20.205 replaced it though). What problem are you facing? Have you ensured HBase does not have a bad hadoop version jar in its lib/? On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote: Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks! -- Harsh J
Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)
Thanks guys, I'll get with operations to do the upgrade. Chris On Mon, Dec 5, 2011 at 4:11 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Chris From the stack trace, it looks like a JVM corruption issue. It is a known issue and have been fixed in CDH3u2, i believe an upgrade would solve your issues. https://issues.apache.org/jira/browse/MAPREDUCE-3184 Then regarding your queries,I'd try to help you out a bit.In mapreduce the data transfer between map and reduce happens over http. If jetty is down then that won't happen which means map output in one location wont be accessible to reducer in another location. The map outputs are in LFS and not on HDFS so even if the data node on the machine is up we can't get the data in above circumstances. Hope it helps!.. Regards Bejoy.K.S On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, Using: *Version:* 0.20.2-cdh3u0, r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14, 8 node cluster, 64 bit Centos We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer jobs. When we investigate it looks like the TaskTracker on the node being fetched from is not running. Looking at the logs we see what looks like a self-initiated shutdown: 2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of tasks it ran: 0 2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught Throwable in JVMRunner. Aborting TaskTracker. java.lang.NullPointerException at org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145) at org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472) at org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446) 2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118 / Then the reducers have the following: 2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask: java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:529) at sun.net.NetworkClient.doConnect(NetworkClient.java:158) at sun.net.www.http.HttpClient.openServer(HttpClient.java:394) at sun.net.www.http.HttpClient.openServer(HttpClient.java:529) at sun.net.www.http.HttpClient.init(HttpClient.java:233) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.http.HttpClient.New(HttpClient.java:323) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233) 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201112050908_0169_r_05_0: Failed fetch #2 from attempt_201112050908_0169_m_02_0 2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output from attempt_201112050908_0169_m_02_0 even after MAX_FETCH_RETRIES_PER_MAP retries... or it is a read error, reporting to the JobTracker 2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask: Shuffle failed with too many fetch failures and insufficient progress!Killing task attempt_201112050908_0169_r_05_0. 2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty box, next contact in 8 seconds 2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
RE: Version of Hadoop That Will Work With HBase?
Sadly, CDH3 is not an option although I wish it was. I need to get an official release of HBase from apache to work. I've tried every version of HBase 0.89 and up with 0.20.205 and all of them throw EOFExceptions. Which version of Hadoop core should I be using? HBase 0.94 ships with a 20-append version which doesn't work throws an EOFException, but when I tried replacing it with the hadoop-core included with hadoop 0.20.205 I still get the same exception. Thanks Original Message Subject: Re: Version of Hadoop That Will Work With HBase? From: Harsh J ha...@cloudera.com Date: Tue, December 06, 2011 2:32 pm To: common-user@hadoop.apache.org 0.20.205 should work, and so should CDH3 or 0.20-append branch builds (no longer maintained, after 0.20.205 replaced it though). What problem are you facing? Have you ensured HBase does not have a bad hadoop version jar in its lib/? On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote: Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks! -- Harsh J
RE: Version of Hadoop That Will Work With HBase?
Sadly, CDH3 is not an option although I wish it was. I need to get an official release of HBase from apache to work. I've tried every version of HBase 0.89 and up with 0.20.205 and all of them throw EOFExceptions. Which version of Hadoop core should I be using? HBase 0.94 ships with a 20-append version which doesn't work throws an EOFException, but when I tried replacing it with the hadoop-core included with hadoop 0.20.205 I still get the same exception. Thanks Original Message Subject: Re: Version of Hadoop That Will Work With HBase? From: Harsh J ha...@cloudera.com Date: Tue, December 06, 2011 2:32 pm To: common-user@hadoop.apache.org 0.20.205 should work, and so should CDH3 or 0.20-append branch builds (no longer maintained, after 0.20.205 replaced it though). What problem are you facing? Have you ensured HBase does not have a bad hadoop version jar in its lib/? On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote: Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks! -- Harsh J
Re: Version of Hadoop That Will Work With HBase?
For the record, this thread was started from another discussion in user@hbase. 0.20.205 does work with HBase 0.90.4, I think the OP was a little too quick saying it doesn't. J-D On Tue, Dec 6, 2011 at 11:44 AM, jcfol...@pureperfect.com wrote: Sadly, CDH3 is not an option although I wish it was. I need to get an official release of HBase from apache to work. I've tried every version of HBase 0.89 and up with 0.20.205 and all of them throw EOFExceptions. Which version of Hadoop core should I be using? HBase 0.94 ships with a 20-append version which doesn't work throws an EOFException, but when I tried replacing it with the hadoop-core included with hadoop 0.20.205 I still get the same exception. Thanks Original Message Subject: Re: Version of Hadoop That Will Work With HBase? From: Harsh J ha...@cloudera.com Date: Tue, December 06, 2011 2:32 pm To: common-user@hadoop.apache.org 0.20.205 should work, and so should CDH3 or 0.20-append branch builds (no longer maintained, after 0.20.205 replaced it though). What problem are you facing? Have you ensured HBase does not have a bad hadoop version jar in its lib/? On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote: Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks! -- Harsh J
Re: Version of Hadoop That Will Work With HBase?
Did you set dfs.support.append to true? It is not enabled by default in 0.20.205 (unlike 20.append) On Tue, Dec 6, 2011 at 11:25 AM, jcfol...@pureperfect.com wrote: Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks!
Re: Splitting SequenceFile in controlled manner
Majid, Sync markers are written into sequence files already, they are part of the format. This is nothing to worry about - and is simple enough to test and be confident about. The mechanism is same as reading a text file with newlines - the reader will ensure reading off the boundary data in order to complete a record if it has to. On 07-Dec-2011, at 1:25 AM, Majid Azimi wrote: hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a large unbounded log file. Hadoop will split the file based on block size and save them on multiple data nodes. Is it guaranteed that each key-value pair will reside on a single block? or we may have a case so that key is in one block on node 1 and value(or parts of it) on second block on node 2? If we may have unmeaning-full splits, then what is the solution? sync markers? Another question is: Does hadoop automatically write sync markers or we should write it manually?
RE: Version of Hadoop That Will Work With HBase?
Yes. From what I have read, it needs to be set in both hdfs-site.xml and in hbase-site.xml. It's working now. I don't really know why, but it is. Thanks! Original Message Subject: Re: Version of Hadoop That Will Work With HBase? From: Jitendra Pandey jiten...@hortonworks.com Date: Tue, December 06, 2011 2:51 pm To: common-user@hadoop.apache.org Did you set dfs.support.append to true? It is not enabled by default in 0.20.205 (unlike 20.append) On Tue, Dec 6, 2011 at 11:25 AM, jcfol...@pureperfect.com wrote: Hi, Can someone please tell me which versions of hadoop contain the 20-appender code and will work with HBase? According to the Hbase docs (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work with HBase but it does not appear to. Thanks!
Re: Splitting SequenceFile in controlled manner
So if we have a map job analysing only the second block of the log file, it should not transfer any other parts of that from other nodes because that part is stand alone and meaning full split? Am I right? On Tue, Dec 6, 2011 at 11:32 PM, Harsh J ha...@cloudera.com wrote: Majid, Sync markers are written into sequence files already, they are part of the format. This is nothing to worry about - and is simple enough to test and be confident about. The mechanism is same as reading a text file with newlines - the reader will ensure reading off the boundary data in order to complete a record if it has to. On 07-Dec-2011, at 1:25 AM, Majid Azimi wrote: hadoop writes in a SequenceFile in in key-value pair(record) format. Consider we have a large unbounded log file. Hadoop will split the file based on block size and save them on multiple data nodes. Is it guaranteed that each key-value pair will reside on a single block? or we may have a case so that key is in one block on node 1 and value(or parts of it) on second block on node 2? If we may have unmeaning-full splits, then what is the solution? sync markers? Another question is: Does hadoop automatically write sync markers or we should write it manually?
Re: Hadoop 0.21
I second Vinod´s idea. Get the latest stable from Cloudera. Their binaries are near perfect! On Tue, Dec 6, 2011 at 1:46 PM, T Vinod Gupta tvi...@readypulse.com wrote: Saurabh, Its best if you go through the hbase book - Lars George's book HBase the Definitive Guide. Your best bet is to build all binaries yourself or get a stable build from cloudera. I was in this situation few months ago and had to spend a lot of time before I was able to get a production ready hbase version up and running. thanks vinod On Tue, Dec 6, 2011 at 10:41 AM, Saurabh Sehgal saurabh@gmail.com wrote: Hi All, According to the Hadoop release notes, version 0.21.0 should not be considered stable or suitable for production: 23 August, 2010: release 0.21.0 available This release contains many improvements, new features, bug fixes and optimizations. It has not undergone testing at scale and should not be considered stable or suitable for production. This release is being classified as a minor release, which means that it should be API compatible with 0.20.2. Is this still the case ? Thank you, Saurabh -- --- Get your facts first, then you can distort them as you please.--
Re: Question on Hadoop Streaming
Hi, the following command works: hadoop jar hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input input -output output2 -mapper /root/bowtiestreaming.sh -reducer NONE Best Regards, Romeo On 12/06/2011 10:49 AM, Brock Noland wrote: Does you job end with an error? I am guessing what you want is: -mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh' First option says use your script as a mapper and second says ship your script as part of the job. Brock On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzlerro...@ormium.de wrote: Hi, I've got the following setup for NGS read alignment: A script accepting data from stdin/out: cat /root/bowtiestreaming.sh cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/ /home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 - 2 /root/bowtie.log A file copied to HDFS: hadoop fs -put SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 A streaming job invoked with only the mapper: hadoop jar hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 -output SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned -mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0 The file cannot be found even it is displayed: hadoop fs -cat /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned 11/12/06 09:07:47 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id cat: File does not exist: /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned He file looks like this (tab seperated): head SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 @SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA I3I+I(%BH43%III7I(5III*II+ @SRR014475.2 :1:1:112:26 length=36 GNNTTCCCCAACTTCCAAATCACCTAAC I!!II=I@II5II)/$;%+*/%%## @SRR014475.3 :1:1:101:937 length=36 GAAGATCCGGTACAACCCTGATGTAAATGGTA IAIIAII%I0G @SRR014475.4 :1:1:124:64 length=36 GAACACATAGAACAACAGGATTCGCCAGAACACCTG IIICI+@5+)'(-';%$;+; @SRR014475.5 :1:1:108:897 length=36 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT I0I:I'+IG3II46II0C@=III()+:+2$ @SRR014475.6 :1:1:106:14 length=36 GNNNTNTAGCATTAAGTAATTGGT I!!!I!I6I*+III:%IB0+I.%? @SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT III0%%)%I.II;III.(I@E2*'+1;;#;' @SRR014475.8 :1:1:123:8 length=36 GNNNTTNN I!!!$(!! @SRR014475.9 :1:1:118:88 length=36 GGAAACTGGCGCGCTACCAGGTAACGCGCCAC IIIGIAA4;1+16*;*+)'$%#$% @SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA IICII;CGIDI?%$I:%6)C*;#; and the result like this: cat SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 |./bowtiestreaming.sh |head @SRR014475.3 :1:1:101:937 length=36 + gi|110640213|ref|NC_008253.1| 3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA IAIIAII%I0G 0 7:TC,27:GT @SRR014475.4 :1:1:124:64 length=36 + gi|110640213|ref|NC_008253.1| 2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG IIICI+@5+)'(-';%$;+; 0 30:TC @SRR014475.5 :1:1:108:897 length=36 + gi|110640213|ref|NC_008253.1| 4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT I0I:I'+IG3II46II0C@=III()+:+2$ 0 5:CA,28:GT,29:CG,30:AT,34:CT @SRR014475.9 :1:1:118:88 length=36 - gi|110640213|ref|NC_008253.1| 3598410 GTGGCGCGTTACCTGGTAGCGCGCCAGTTTCC %$#%$')+*;*61+1;4AAIGIII 0 @SRR014475.15 :1:1:87:967 length=36 + gi|110640213|ref|NC_008253.1| 4474247 GACTACACGATCGCCTGCCTTAATATTCTTTACACC A27II7CIII*I5I+F?II' 0 6:GA,26:GT @SRR014475.20 :1:1:108:121 length=36- gi|110640213|ref|NC_008253.1| 37761 AATGCATATTGAGAGTGTGATTATTAGC ID4II'2IIIC/;B?FII 0 12:CT @SRR014475.23 :1:1:75:54 length=36 + gi|110640213|ref|NC_008253.1| 2465453 GGTTTCTTTCTGCGCAGATGCCAGACGGTCTTTATA CI;';29=9I.4%EE2)*' 0 @SRR014475.24 :1:1:89:904 length=36 - gi|110640213|ref|NC_008253.1| 3216193
HDFS Backup nodes
Does hadoop 0.20.205 supports configuring HDFS backup nodes ? Thanks, Praveenesh