Re: Question on Hadoop Streaming

2011-12-06 Thread Brock Noland
Does you job end with an error?

I am guessing what you want is:

-mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh'

First option says use your script as a mapper and second says ship
your script as part of the job.

Brock

On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzler ro...@ormium.de wrote:
 Hi,

 I've got the following setup for NGS read alignment:


 A script accepting data from stdin/out:
 
 cat /root/bowtiestreaming.sh
 cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/
 /home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 -
 2 /root/bowtie.log



 A file copied to HDFS:
 
 hadoop fs -put
 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1

 A streaming job invoked with only the mapper:
 
 hadoop jar
 hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input
 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
 -output
 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
 -mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0

 The file cannot be found even it is displayed:
 
 hadoop fs -cat
 /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
 11/12/06 09:07:47 INFO security.Groups: Group mapping
 impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
 cacheTimeout=30
 11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated.
 Instead, use mapreduce.task.attempt.id
 cat: File does not exist:
 /user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned


 He file looks like this (tab seperated):
 head
 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
 @SRR014475.1 :1:1:108:111 length=36     GAGACGTCGTCCTCAGTACATATA
    I3I+I(%BH43%III7I(5III*II+
 @SRR014475.2 :1:1:112:26 length=36      GNNTTCCCCAACTTCCAAATCACCTAAC
    I!!II=I@II5II)/$;%+*/%%##
 @SRR014475.3 :1:1:101:937 length=36     GAAGATCCGGTACAACCCTGATGTAAATGGTA
    IAIIAII%I0G
 @SRR014475.4 :1:1:124:64 length=36      GAACACATAGAACAACAGGATTCGCCAGAACACCTG
    IIICI+@5+)'(-';%$;+;
 @SRR014475.5 :1:1:108:897 length=36     GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
    I0I:I'+IG3II46II0C@=III()+:+2$
 @SRR014475.6 :1:1:106:14 length=36      GNNNTNTAGCATTAAGTAATTGGT
    I!!!I!I6I*+III:%IB0+I.%?
 @SRR014475.7 :1:1:118:934 length=36     GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT
    III0%%)%I.II;III.(I@E2*'+1;;#;'
 @SRR014475.8 :1:1:123:8 length=36       GNNNTTNN
    I!!!$(!!
 @SRR014475.9 :1:1:118:88 length=36      GGAAACTGGCGCGCTACCAGGTAACGCGCCAC
    IIIGIAA4;1+16*;*+)'$%#$%
 @SRR014475.10 :1:1:92:122 length=36     ATTTGCTGCCAATGGCGAGATTACGAATAATA
    IICII;CGIDI?%$I:%6)C*;#;


 and the result like this:

 cat
 SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
 |./bowtiestreaming.sh |head
 @SRR014475.3 :1:1:101:937 length=36     +
 gi|110640213|ref|NC_008253.1|   3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA
    IAIIAII%I0G  0       7:TC,27:GT
 @SRR014475.4 :1:1:124:64 length=36      +
 gi|110640213|ref|NC_008253.1|   2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG
    IIICI+@5+)'(-';%$;+;  0       30:TC
 @SRR014475.5 :1:1:108:897 length=36     +
 gi|110640213|ref|NC_008253.1|   4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
    I0I:I'+IG3II46II0C@=III()+:+2$  0
 5:CA,28:GT,29:CG,30:AT,34:CT
 @SRR014475.9 :1:1:118:88 length=36      -
 gi|110640213|ref|NC_008253.1|   3598410 GTGGCGCGTTACCTGGTAGCGCGCCAGTTTCC
    %$#%$')+*;*61+1;4AAIGIII  0
 @SRR014475.15 :1:1:87:967 length=36     +
 gi|110640213|ref|NC_008253.1|   4474247 GACTACACGATCGCCTGCCTTAATATTCTTTACACC
    A27II7CIII*I5I+F?II'  0       6:GA,26:GT
 @SRR014475.20 :1:1:108:121 length=36    -
 gi|110640213|ref|NC_008253.1|   37761   AATGCATATTGAGAGTGTGATTATTAGC
    ID4II'2IIIC/;B?FII  0       12:CT
 @SRR014475.23 :1:1:75:54 length=36      +
 gi|110640213|ref|NC_008253.1|   2465453 GGTTTCTTTCTGCGCAGATGCCAGACGGTCTTTATA
    CI;';29=9I.4%EE2)*'  0
 @SRR014475.24 :1:1:89:904 length=36     -
 gi|110640213|ref|NC_008253.1|   3216193 ATTAGTGTTAAGATTTCTATATTGTTGAGGCC
    #%);%;$EI-;$%8%I%I/+III  0
 18:CT,21:GT,30:CT,31:TG,34:AT
 @SRR014475.27 :1:1:74:887 length=36     -
 gi|110640213|ref|NC_008253.1|   540567  

Re: Automate Hadoop installation

2011-12-06 Thread Praveen Sripati
Also, checkout Ambari (http://incubator.apache.org/ambari/) which is still
in the Incubator status. How does Ambari and Puppet compare?

Regards,
Praveen

On Tue, Dec 6, 2011 at 1:00 PM, alo alt wget.n...@googlemail.com wrote:

 Hi,

 to deploy software I suggest pulp:
 https://fedorahosted.org/pulp/wiki/HowTo

 For a package-based distro (debian, redhat, centos) you can build apache's
 hadoop, pack it and delpoy. Configs, as Cos say, over puppet. If you use a
 redhat / centos take a look at spacewalk.

 best,
  Alex


 On Mon, Dec 5, 2011 at 8:20 PM, Konstantin Boudnik c...@apache.org wrote:

  These that great project called BigTop (in the apache incubator) which
  provides for building of Hadoop stack.
 
  The part of what it provides is a set of Puppet recipes which will allow
  you
  to do exactly what you're looking for with perhaps some minor
 corrections.
 
  Serious, look at Puppet - otherwise it will be a living through nightmare
  of
  configuration mismanagements.
 
  Cos
 
  On Mon, Dec 05, 2011 at 04:02PM, praveenesh kumar wrote:
   Hi all,
  
   Can anyone guide me how to automate the hadoop
 installation/configuration
   process?
   I want to install hadoop on 10-20 nodes which may even exceed to 50-100
   nodes ?
   I know we can use some configuration tools like puppet/or
 shell-scripts ?
   Has anyone done it ?
  
   How can we do hadoop installations on so many machines parallely ? What
  are
   the best practices for this ?
  
   Thanks,
   Praveenesh
 



 --
 Alexander Lorenz
 http://mapredit.blogspot.com

 *P **Think of the environment: please don't print this email unless you
 really need to.*



Re: Question on Hadoop Streaming

2011-12-06 Thread Romeo Kienzler

Hi Brock,

I'm not getting any errors.

I'm issuing the following command now:

hadoop jar 
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar 
-input 
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1 
-output 
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned 
-mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0 -file 
bowtiestreaming.sh


The only error I get using cat hadoop-0.21.0/logs/* |grep Exception is:
org.apache.hadoop.fs.ChecksumException: Checksum error: 
file:/root/hadoop-0.21.0/logs/history/job_201112060917_0002_root at 2620416
2011-12-06 11:14:34,515 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13816: No 
such process
2011-12-06 11:14:43,039 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13862: No 
such process
2011-12-06 11:14:46,282 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13891: No 
such process
2011-12-06 11:14:49,841 WARN 
org.apache.hadoop.mapreduce.util.ProcessTree: Error executing shell 
command org.apache.hadoop.util.Shell$ExitCodeException: kill -13978: No 
such process



best Regards,

Romeo

On 12/06/2011 10:49 AM, Brock Noland wrote:

Does you job end with an error?

I am guessing what you want is:

-mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh'

First option says use your script as a mapper and second says ship
your script as part of the job.

Brock

On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzlerro...@ormium.de  wrote:

Hi,

I've got the following setup for NGS read alignment:


A script accepting data from stdin/out:

cat /root/bowtiestreaming.sh
cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/
/home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 -
2  /root/bowtie.log



A file copied to HDFS:

hadoop fs -put
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1

A streaming job invoked with only the mapper:

hadoop jar
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
-output
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
-mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0

The file cannot be found even it is displayed:

hadoop fs -cat
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
11/12/06 09:07:47 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
cat: File does not exist:
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned


He file looks like this (tab seperated):
head
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
@SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA
I3I+I(%BH43%III7I(5III*II+
@SRR014475.2 :1:1:112:26 length=36  GNNTTCCCCAACTTCCAAATCACCTAAC
I!!II=I@II5II)/$;%+*/%%##
@SRR014475.3 :1:1:101:937 length=36 GAAGATCCGGTACAACCCTGATGTAAATGGTA
IAIIAII%I0G
@SRR014475.4 :1:1:124:64 length=36  GAACACATAGAACAACAGGATTCGCCAGAACACCTG
IIICI+@5+)'(-';%$;+;
@SRR014475.5 :1:1:108:897 length=36 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
I0I:I'+IG3II46II0C@=III()+:+2$
@SRR014475.6 :1:1:106:14 length=36  GNNNTNTAGCATTAAGTAATTGGT
I!!!I!I6I*+III:%IB0+I.%?
@SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT
III0%%)%I.II;III.(I@E2*'+1;;#;'
@SRR014475.8 :1:1:123:8 length=36   GNNNTTNN
I!!!$(!!
@SRR014475.9 :1:1:118:88 length=36  GGAAACTGGCGCGCTACCAGGTAACGCGCCAC
IIIGIAA4;1+16*;*+)'$%#$%
@SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA
IICII;CGIDI?%$I:%6)C*;#;


and the result like this:

cat
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
|./bowtiestreaming.sh |head
@SRR014475.3 :1:1:101:937 length=36 +
gi|110640213|ref|NC_008253.1|   3393863 

Re: Multiple Mappers for Multiple Tables

2011-12-06 Thread Praveen Sripati
MultipleInputs take multiple Path (files) and not DB as input. As mentioned
earlier export tables into HDFS either using Sqoop or native DB export tool
and then do the processing. Sqoop is configured to use native DB export
tool whenever possible.

Regards,
Praveen

On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent justi...@gmail.com wrote:

 Thanks Bejoy,
 I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
 Path parameter. Are these paths just ignored here?

 On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Hi Justin,
 Just to add on to my response. If you need to fetch data from
  rdbms on your mapper using your custom mapreduce code you can use the
  DBInputFormat in your mapper class with MultipleInputs. You have to be
  careful in using the number of mappers for your application as dbs would
 be
  constrained with a limit on maximum simultaneous connections. Also you
 need
  to ensure that that the same Query is not executed n number of times in n
  mappers all fetching the same data, It'd be just wastage of network.
 Sqoop
  + Hive would be my recommendation and a good combination for such use
  cases. If you have Pig competency you can also look into pig instead of
  hive.
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote:
 
   Justin
   If I get your requirement right you need to get in data from
   multiple rdbms sources and do a join on the same, also may be some more
   custom operations on top of this. For this you don't need to go in for
   writing your custom mapreduce code unless it is that required. You can
   achieve the same in two easy steps
   - Import data from RDBMS into Hive using SQOOP (Import)
   - Use hive to do some join and processing on this data
  
   Hope it helps!..
  
   Regards
   Bejoy.K.S
  
  
   On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com
  wrote:
  
   I would like join some db tables, possibly from different databases,
 in
  a
   MR job.
  
   I would essentially like to use MultipleInputs, but that seems file
   oriented. I need a different mapper for each db table.
  
   Suggestions?
  
   Thanks!
  
   Justin Vincent
  
  
  
 



Re: Running a job continuously

2011-12-06 Thread Praveen Sripati
If the requirement is for real time data processing, using Flume
will not suffice as there is a time lag between the collection of files
by Flume and processing done by Hadoop. Consider frameworks like S4,
Storm (from Twitter), HStreaming etc which suits realtime processing.

Regards,
Praveen

On Tue, Dec 6, 2011 at 10:39 AM, Ravi teja ch n v
raviteja.c...@huawei.comwrote:

 Hi Burak,

 Bejoy Ks, i have a continuous inflow of data but i think i need a near
 real-time system.

 Just to add to Bejoy's point,
 with Oozie, you can specify the data dependency for running your job.
 When specific amount of data is in, your can configure Oozie to run your
 job.
 I think this will suffice your requirement.

 Regards,
 Ravi Teja

 
 From: burakkk [burak.isi...@gmail.com]
 Sent: 06 December 2011 04:03:59
 To: mapreduce-u...@hadoop.apache.org
 Cc: common-user@hadoop.apache.org
 Subject: Re: Running a job continuously

 Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
 execute the MR job on the same algorithm but different files have different
 velocity.

 Both Storm and facebook's hadoop are designed for that. But i want to use
 apache distribution.

 Bejoy Ks, i have a continuous inflow of data but i think i need a near
 real-time system.

 Mike Spreitzer, both output and input are continuous. Output isn't relevant
 to the input. Only that i want is all the incoming files are processed by
 the same job and the same algorithm.
 For ex, you think about wordcount problem. When you want to run wordcount,
 you implement that:
 http://wiki.apache.org/hadoop/WordCount

 But when the program find that code job.waitForCompletion(true);, somehow
 job will end up. When you want to make it continuously, what will you do in
 hadoop without other tools?
 One more thing is you assumption that the input file's name is
 filename_timestamp(filename_20111206_0030)

 public static void main(String[] args) throws Exception {Configuration
 conf = new Configuration();Job job = new Job(conf,
 wordcount);job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(IntWritable.class);
 job.setMapperClass(Map.class);job.setReducerClass(Reduce.class);
  job.setInputFormatClass(TextInputFormat.class);
 job.setOutputFormatClass(TextOutputFormat.class);
 FileInputFormat.addInputPath(job, new Path(args[0]));
 FileOutputFormat.setOutputPath(job, new Path(args[1]));
 job.waitForCompletion(true); }

 On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Burak
 If you have a continuous inflow of data, you can choose flume to
  aggregate the files into larger sequence files or so if they are small
 and
  when you have a substantial chunk of data(equal to hdfs block size). You
  can push that data on to hdfs based on your SLAs you need to schedule
 your
  jobs using oozie or simpe shell script. In very simple terms
  - push input data (could be from flume collector) into a staging hdfs dir
  - before triggering the job(hadoop jar) copy the input from staging to
  main input dir
  - execute the job
  - archive the input and output into archive dirs(any other dirs).
 - the output archive dir could be source of output data
  - delete output dir and empty input dir
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 2:19 AM, burakkk burak.isi...@gmail.com wrote:
 
  Hi everyone,
  I want to run a MR job continuously. Because i have streaming data and i
  try to analyze it all the time in my way(algorithm). For example you
 want
  to solve wordcount problem. It's the simplest one :) If you have some
  multiple files and the new files are keep going, how do you handle it?
  You could execute a MR job per one file but you have to do it repeatly.
 So
  what do you think?
 
  Thanks
  Best regards...
 
  --
 
  *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
  *
  *
 
 
 


 --

 *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
 *
 *



RE: Multiple Mappers for Multiple Tables

2011-12-06 Thread Devaraj K
Hi Justin,

   If it is not feasible for you to do as praveen suggested, here you can
go.

1. You can write customized InputFormat which can create different
connections for different data sources and returns splits from those data
source tables. Internally you can use DBInputFormat for each data source in
your customized InputFormat if you can.

2. If your mapper input is not same for two data sources, you can write one
mapper which internally delegates to mappers corresponding to the mapper
based on the inputsplit(you can refer MultipleInputs for this).

MultipleInputs doesn't support for DBInputFormat, it supports only the input
format's which uses file path as input path.

If you explain your use case with more details, I may help you better.



Devaraj K 

-Original Message-
From: Praveen Sripati [mailto:praveensrip...@gmail.com] 
Sent: Tuesday, December 06, 2011 4:11 PM
To: common-user@hadoop.apache.org
Subject: Re: Multiple Mappers for Multiple Tables

MultipleInputs take multiple Path (files) and not DB as input. As mentioned
earlier export tables into HDFS either using Sqoop or native DB export tool
and then do the processing. Sqoop is configured to use native DB export
tool whenever possible.

Regards,
Praveen

On Tue, Dec 6, 2011 at 3:44 AM, Justin Vincent justi...@gmail.com wrote:

 Thanks Bejoy,
 I was looking at DBInputFormat with MultipleInputs. MultipleInputs takes a
 Path parameter. Are these paths just ignored here?

 On Mon, Dec 5, 2011 at 2:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Hi Justin,
 Just to add on to my response. If you need to fetch data from
  rdbms on your mapper using your custom mapreduce code you can use the
  DBInputFormat in your mapper class with MultipleInputs. You have to be
  careful in using the number of mappers for your application as dbs would
 be
  constrained with a limit on maximum simultaneous connections. Also you
 need
  to ensure that that the same Query is not executed n number of times in
n
  mappers all fetching the same data, It'd be just wastage of network.
 Sqoop
  + Hive would be my recommendation and a good combination for such use
  cases. If you have Pig competency you can also look into pig instead of
  hive.
 
  Hope it helps!...
 
  Regards
  Bejoy.K.S
 
  On Tue, Dec 6, 2011 at 1:36 AM, Bejoy Ks bejoy.had...@gmail.com wrote:
 
   Justin
   If I get your requirement right you need to get in data from
   multiple rdbms sources and do a join on the same, also may be some
more
   custom operations on top of this. For this you don't need to go in for
   writing your custom mapreduce code unless it is that required. You can
   achieve the same in two easy steps
   - Import data from RDBMS into Hive using SQOOP (Import)
   - Use hive to do some join and processing on this data
  
   Hope it helps!..
  
   Regards
   Bejoy.K.S
  
  
   On Tue, Dec 6, 2011 at 12:13 AM, Justin Vincent justi...@gmail.com
  wrote:
  
   I would like join some db tables, possibly from different databases,
 in
  a
   MR job.
  
   I would essentially like to use MultipleInputs, but that seems file
   oriented. I need a different mapper for each db table.
  
   Suggestions?
  
   Thanks!
  
   Justin Vincent
  
  
  
 




Hadoop 0.21

2011-12-06 Thread Saurabh Sehgal
Hi All,

According to the Hadoop release notes, version 0.21.0 should not be
considered stable or suitable for production:

23 August, 2010: release 0.21.0 available
This release contains many improvements, new features, bug fixes and
optimizations. It has not undergone testing at scale and should not be
considered stable or suitable for production. This release is being
classified as a minor release, which means that it should be API
compatible with 0.20.2.


Is this still the case ?

Thank you,

Saurabh


Re: Hadoop 0.21

2011-12-06 Thread Jean-Daniel Cryans
Yep.

J-D

On Tue, Dec 6, 2011 at 10:41 AM, Saurabh Sehgal saurabh@gmail.com wrote:
 Hi All,

 According to the Hadoop release notes, version 0.21.0 should not be
 considered stable or suitable for production:

 23 August, 2010: release 0.21.0 available
 This release contains many improvements, new features, bug fixes and
 optimizations. It has not undergone testing at scale and should not be
 considered stable or suitable for production. This release is being
 classified as a minor release, which means that it should be API
 compatible with 0.20.2.


 Is this still the case ?

 Thank you,

 Saurabh


Version of Hadoop That Will Work With HBase?

2011-12-06 Thread jcfolsom


Hi,


Can someone please tell me which versions of hadoop contain the
20-appender code and will work with HBase? According to the Hbase docs
(http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work
with HBase but it does not appear to.


Thanks!



Re: Version of Hadoop That Will Work With HBase?

2011-12-06 Thread Harsh J
0.20.205 should work, and so should CDH3 or 0.20-append branch builds
(no longer maintained, after 0.20.205 replaced it though).

What problem are you facing? Have you ensured HBase does not have a
bad hadoop version jar in its lib/?

On Wed, Dec 7, 2011 at 12:55 AM,  jcfol...@pureperfect.com wrote:


 Hi,


 Can someone please tell me which versions of hadoop contain the
 20-appender code and will work with HBase? According to the Hbase docs
 (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work
 with HBase but it does not appear to.


 Thanks!




-- 
Harsh J


Re: MAX_FETCH_RETRIES_PER_MAP (TaskTracker dying?)

2011-12-06 Thread Chris Curtin
Thanks guys, I'll get with operations to do the upgrade.

Chris

On Mon, Dec 5, 2011 at 4:11 PM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Chris
 From the stack trace, it looks like a JVM corruption issue. It is
 a known issue and have been fixed in CDH3u2, i believe an upgrade would
 solve your issues.
 https://issues.apache.org/jira/browse/MAPREDUCE-3184

 Then regarding your queries,I'd try to help you out a bit.In mapreduce the
 data transfer between map and reduce happens over http. If jetty is down
 then that won't happen which means map output in one location wont be
 accessible to reducer in another location. The map outputs are in LFS and
 not on HDFS so even if the data node on the machine is up we can't get the
 data in above circumstances.

 Hope it helps!..

 Regards
 Bejoy.K.S


 On Tue, Dec 6, 2011 at 2:15 AM, Chris Curtin curtin.ch...@gmail.com
 wrote:

  Hi,
 
  Using: *Version:* 0.20.2-cdh3u0,
 r81256ad0f2e4ab2bd34b04f53d25a6c23686dd14,
   8 node cluster, 64 bit Centos
 
  We are occasionally seeing MAX_FETCH_RETRIES_PER_MAP errors on reducer
  jobs. When we investigate it looks like the TaskTracker on the node being
  fetched from is not running. Looking at the logs we see what looks like a
  self-initiated shutdown:
 
  2011-12-05 14:10:48,632 INFO org.apache.hadoop.mapred.JvmManager: JVM :
  jvm_201112050908_0222_r_1100711673 exited with exit code 0. Number of
 tasks
  it ran: 0
  2011-12-05 14:10:48,632 ERROR org.apache.hadoop.mapred.JvmManager: Caught
  Throwable in JVMRunner. Aborting TaskTracker.
  java.lang.NullPointerException
 at
 
 
 org.apache.hadoop.mapred.DefaultTaskController.logShExecStatus(DefaultTaskController.java:145)
 at
 
 
 org.apache.hadoop.mapred.DefaultTaskController.launchTask(DefaultTaskController.java:129)
 at
 
 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.runChild(JvmManager.java:472)
 at
 
 
 org.apache.hadoop.mapred.JvmManager$JvmManagerForType$JvmRunner.run(JvmManager.java:446)
  2011-12-05 14:10:48,634 INFO org.apache.hadoop.mapred.TaskTracker:
  SHUTDOWN_MSG:
  /
  SHUTDOWN_MSG: Shutting down TaskTracker at had11.atlis1/10.120.41.118
  /
 
  Then the reducers have the following:
 
 
  2011-12-05 14:12:00,962 WARN org.apache.hadoop.mapred.ReduceTask:
  java.net.ConnectException: Connection refused
   at java.net.PlainSocketImpl.socketConnect(Native Method)
   at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
   at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
   at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
   at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
   at java.net.Socket.connect(Socket.java:529)
   at sun.net.NetworkClient.doConnect(NetworkClient.java:158)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:394)
   at sun.net.www.http.HttpClient.openServer(HttpClient.java:529)
   at sun.net.www.http.HttpClient.init(HttpClient.java:233)
   at sun.net.www.http.HttpClient.New(HttpClient.java:306)
   at sun.net.www.http.HttpClient.New(HttpClient.java:323)
   at
 
 
 sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:970)
   at
 
 
 sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:911)
   at
 
 
 sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:836)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getInputStream(ReduceTask.java:1525)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.setupSecureConnection(ReduceTask.java:1482)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1390)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1301)
   at
 
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1233)
 
  2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Task
  attempt_201112050908_0169_r_05_0: Failed fetch #2 from
  attempt_201112050908_0169_m_02_0
  2011-12-05 14:12:00,962 INFO org.apache.hadoop.mapred.ReduceTask: Failed
 to
  fetch map-output from attempt_201112050908_0169_m_02_0 even after
  MAX_FETCH_RETRIES_PER_MAP retries...  or it is a read error,  reporting
 to
  the JobTracker
  2011-12-05 14:12:00,962 FATAL org.apache.hadoop.mapred.ReduceTask:
 Shuffle
  failed with too many fetch failures and insufficient progress!Killing
 task
  attempt_201112050908_0169_r_05_0.
  2011-12-05 14:12:00,966 WARN org.apache.hadoop.mapred.ReduceTask:
  attempt_201112050908_0169_r_05_0 adding host had11.atlis1 to penalty
  box, next contact in 8 seconds
  2011-12-05 14:12:00,966 INFO org.apache.hadoop.mapred.ReduceTask:
  attempt_201112050908_0169_r_05_0: Got 1 map-outputs from previous
 

RE: Version of Hadoop That Will Work With HBase?

2011-12-06 Thread jcfolsom

Sadly, CDH3 is not an option although I wish it was. I need to get an
official release of HBase from apache to work.

I've tried every version of HBase 0.89 and up with 0.20.205 and all of
them throw EOFExceptions. Which version of Hadoop core should I be
using? HBase 0.94 ships with a 20-append version which doesn't work
throws an EOFException, but when I tried replacing it with the
hadoop-core included with hadoop 0.20.205 I still get the same
exception.

Thanks


   Original Message 
 Subject: Re: Version of Hadoop That Will Work With HBase?
 From: Harsh J ha...@cloudera.com
 Date: Tue, December 06, 2011 2:32 pm
 To: common-user@hadoop.apache.org
 
 0.20.205 should work, and so should CDH3 or 0.20-append branch builds
 (no longer maintained, after 0.20.205 replaced it though).
 
 What problem are you facing? Have you ensured HBase does not have a
 bad hadoop version jar in its lib/?
 
 On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote:
 
 
  Hi,
 
 
  Can someone please tell me which versions of hadoop contain the
  20-appender code and will work with HBase? According to the Hbase
docs
  (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should
work
  with HBase but it does not appear to.
 
 
  Thanks!
 
 
 
 
 -- 
 Harsh J



RE: Version of Hadoop That Will Work With HBase?

2011-12-06 Thread jcfolsom

Sadly, CDH3 is not an option although I wish it was. I need to get an
official release of HBase from apache to work.

I've tried every version of HBase 0.89 and up with 0.20.205 and all of
them throw EOFExceptions. Which version of Hadoop core should I be
using? HBase 0.94 ships with a 20-append version which doesn't work
throws an EOFException, but when I tried replacing it with the
hadoop-core included with hadoop 0.20.205 I still get the same
exception.

Thanks


  Original Message 
 Subject: Re: Version of Hadoop That Will Work With HBase?
 From: Harsh J ha...@cloudera.com
 Date: Tue, December 06, 2011 2:32 pm
 To: common-user@hadoop.apache.org
 
 0.20.205 should work, and so should CDH3 or 0.20-append branch builds
 (no longer maintained, after 0.20.205 replaced it though).
 
 What problem are you facing? Have you ensured HBase does not have a
 bad hadoop version jar in its lib/?
 
 On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote:
 
 
  Hi,
 
 
  Can someone please tell me which versions of hadoop contain the
  20-appender code and will work with HBase? According to the Hbase
docs
  (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should
work
  with HBase but it does not appear to.
 
 
  Thanks!
 
 
 
 
 -- 
 Harsh J



Re: Version of Hadoop That Will Work With HBase?

2011-12-06 Thread Jean-Daniel Cryans
For the record, this thread was started from another discussion in
user@hbase. 0.20.205 does work with HBase 0.90.4, I think the OP was a
little too quick saying it doesn't.

J-D

On Tue, Dec 6, 2011 at 11:44 AM,  jcfol...@pureperfect.com wrote:

 Sadly, CDH3 is not an option although I wish it was. I need to get an
 official release of HBase from apache to work.

 I've tried every version of HBase 0.89 and up with 0.20.205 and all of
 them throw EOFExceptions. Which version of Hadoop core should I be
 using? HBase 0.94 ships with a 20-append version which doesn't work
 throws an EOFException, but when I tried replacing it with the
 hadoop-core included with hadoop 0.20.205 I still get the same
 exception.

 Thanks


   Original Message 
  Subject: Re: Version of Hadoop That Will Work With HBase?
  From: Harsh J ha...@cloudera.com
  Date: Tue, December 06, 2011 2:32 pm
  To: common-user@hadoop.apache.org

  0.20.205 should work, and so should CDH3 or 0.20-append branch builds
  (no longer maintained, after 0.20.205 replaced it though).

  What problem are you facing? Have you ensured HBase does not have a
  bad hadoop version jar in its lib/?

  On Wed, Dec 7, 2011 at 12:55 AM, jcfol...@pureperfect.com wrote:
  
  
   Hi,
  
  
   Can someone please tell me which versions of hadoop contain the
   20-appender code and will work with HBase? According to the Hbase
 docs
   (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should
 work
   with HBase but it does not appear to.
  
  
   Thanks!
  



  --
  Harsh J



Re: Version of Hadoop That Will Work With HBase?

2011-12-06 Thread Jitendra Pandey
Did you set dfs.support.append to true? It is not enabled by default in
0.20.205 (unlike 20.append)

On Tue, Dec 6, 2011 at 11:25 AM, jcfol...@pureperfect.com wrote:



 Hi,


 Can someone please tell me which versions of hadoop contain the
 20-appender code and will work with HBase? According to the Hbase docs
 (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should work
 with HBase but it does not appear to.


 Thanks!




Re: Splitting SequenceFile in controlled manner

2011-12-06 Thread Harsh J
Majid,

Sync markers are written into sequence files already, they are part of the 
format. This is nothing to worry about - and is simple enough to test and be 
confident about. The mechanism is same as reading a text file with newlines - 
the reader will ensure reading off the boundary data in order to complete a 
record if it has to.

On 07-Dec-2011, at 1:25 AM, Majid Azimi wrote:

 hadoop writes in a SequenceFile in in key-value pair(record) format.
 Consider we have a large unbounded log file. Hadoop will split the file
 based on block size and save them on multiple data nodes. Is it guaranteed
 that each key-value pair will reside on a single block? or we may have a
 case so that key is in one block on node 1 and value(or parts of it) on
 second block on node 2? If we may have unmeaning-full splits, then what is
 the solution? sync markers?
 
 Another question is: Does hadoop automatically write sync markers or we
 should write it manually?



RE: Version of Hadoop That Will Work With HBase?

2011-12-06 Thread jcfolsom



Yes. From what I have read, it needs to be set in both hdfs-site.xml and
in hbase-site.xml. It's working now. I don't really know why, but it is.
Thanks!


   Original Message 
 Subject: Re: Version of Hadoop That Will Work With HBase?
 From: Jitendra Pandey jiten...@hortonworks.com
 Date: Tue, December 06, 2011 2:51 pm
 To: common-user@hadoop.apache.org
 
 Did you set dfs.support.append to true? It is not enabled by default in
 0.20.205 (unlike 20.append)
 
 On Tue, Dec 6, 2011 at 11:25 AM, jcfol...@pureperfect.com wrote:
 
 
 
  Hi,
 
 
  Can someone please tell me which versions of hadoop contain the
  20-appender code and will work with HBase? According to the Hbase
docs
  (http://hbase.apache.org/book/hadoop.html), Hadoop 0.20.205 should
work
  with HBase but it does not appear to.
 
 
  Thanks!
 
 



Re: Splitting SequenceFile in controlled manner

2011-12-06 Thread Majid Azimi
So if we have a map job analysing only the second block of the log file, it
should not transfer any other parts of that from other nodes because that
part is stand alone and meaning full split? Am I right?

On Tue, Dec 6, 2011 at 11:32 PM, Harsh J ha...@cloudera.com wrote:

 Majid,

 Sync markers are written into sequence files already, they are part of the
 format. This is nothing to worry about - and is simple enough to test and
 be confident about. The mechanism is same as reading a text file with
 newlines - the reader will ensure reading off the boundary data in order to
 complete a record if it has to.

 On 07-Dec-2011, at 1:25 AM, Majid Azimi wrote:

  hadoop writes in a SequenceFile in in key-value pair(record) format.
  Consider we have a large unbounded log file. Hadoop will split the file
  based on block size and save them on multiple data nodes. Is it
 guaranteed
  that each key-value pair will reside on a single block? or we may have a
  case so that key is in one block on node 1 and value(or parts of it) on
  second block on node 2? If we may have unmeaning-full splits, then what
 is
  the solution? sync markers?
 
  Another question is: Does hadoop automatically write sync markers or we
  should write it manually?




Re: Hadoop 0.21

2011-12-06 Thread Rita
I second Vinod´s idea. Get the latest stable from Cloudera. Their binaries
are near perfect!


On Tue, Dec 6, 2011 at 1:46 PM, T Vinod Gupta tvi...@readypulse.com wrote:

 Saurabh,
 Its best if you go through the hbase book - Lars George's book HBase the
 Definitive Guide.
 Your best bet is to build all binaries yourself or get a stable build from
 cloudera.
 I was in this situation few months ago and had to spend a lot of time
 before I was able to get a production ready hbase version up and running.

 thanks
 vinod

 On Tue, Dec 6, 2011 at 10:41 AM, Saurabh Sehgal saurabh@gmail.com
 wrote:

  Hi All,
 
  According to the Hadoop release notes, version 0.21.0 should not be
  considered stable or suitable for production:
 
  23 August, 2010: release 0.21.0 available
  This release contains many improvements, new features, bug fixes and
  optimizations. It has not undergone testing at scale and should not be
  considered stable or suitable for production. This release is being
  classified as a minor release, which means that it should be API
  compatible with 0.20.2.
 
 
  Is this still the case ?
 
  Thank you,
 
  Saurabh
 




-- 
--- Get your facts first, then you can distort them as you please.--


Re: Question on Hadoop Streaming

2011-12-06 Thread Romeo Kienzler

Hi,

the following command works:

hadoop jar 
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar 
-input input -output output2 -mapper /root/bowtiestreaming.sh -reducer NONE


Best Regards,

Romeo

On 12/06/2011 10:49 AM, Brock Noland wrote:

Does you job end with an error?

I am guessing what you want is:

-mapper bowtiestreaming.sh -file '/root/bowtiestreaming.sh'

First option says use your script as a mapper and second says ship
your script as part of the job.

Brock

On Tue, Dec 6, 2011 at 4:59 PM, Romeo Kienzlerro...@ormium.de  wrote:

Hi,

I've got the following setup for NGS read alignment:


A script accepting data from stdin/out:

cat /root/bowtiestreaming.sh
cd /home/streamsadmin/crossbow-1.1.2/bin/linux32/
/home/streamsadmin/crossbow-1.1.2/bin/linux32/bowtie -m 1 -q e_coli --12 -
2  /root/bowtie.log



A file copied to HDFS:

hadoop fs -put
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1

A streaming job invoked with only the mapper:

hadoop jar
hadoop-0.21.0/mapred/contrib/streaming/hadoop-0.21.0-streaming.jar -input
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
-output
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
-mapper '/root/bowtiestreaming.sh' -jobconf mapred.reduce.tasks=0

The file cannot be found even it is displayed:

hadoop fs -cat
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned
11/12/06 09:07:47 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/12/06 09:07:48 WARN conf.Configuration: mapred.task.id is deprecated.
Instead, use mapreduce.task.attempt.id
cat: File does not exist:
/user/root/SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1.aligned


He file looks like this (tab seperated):
head
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
@SRR014475.1 :1:1:108:111 length=36 GAGACGTCGTCCTCAGTACATATA
I3I+I(%BH43%III7I(5III*II+
@SRR014475.2 :1:1:112:26 length=36  GNNTTCCCCAACTTCCAAATCACCTAAC
I!!II=I@II5II)/$;%+*/%%##
@SRR014475.3 :1:1:101:937 length=36 GAAGATCCGGTACAACCCTGATGTAAATGGTA
IAIIAII%I0G
@SRR014475.4 :1:1:124:64 length=36  GAACACATAGAACAACAGGATTCGCCAGAACACCTG
IIICI+@5+)'(-';%$;+;
@SRR014475.5 :1:1:108:897 length=36 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
I0I:I'+IG3II46II0C@=III()+:+2$
@SRR014475.6 :1:1:106:14 length=36  GNNNTNTAGCATTAAGTAATTGGT
I!!!I!I6I*+III:%IB0+I.%?
@SRR014475.7 :1:1:118:934 length=36 GGTTACTACTCTGCGACTCCTCGCAGAAGAGACGCT
III0%%)%I.II;III.(I@E2*'+1;;#;'
@SRR014475.8 :1:1:123:8 length=36   GNNNTTNN
I!!!$(!!
@SRR014475.9 :1:1:118:88 length=36  GGAAACTGGCGCGCTACCAGGTAACGCGCCAC
IIIGIAA4;1+16*;*+)'$%#$%
@SRR014475.10 :1:1:92:122 length=36 ATTTGCTGCCAATGGCGAGATTACGAATAATA
IICII;CGIDI?%$I:%6)C*;#;


and the result like this:

cat
SRR014475.lite.nodoublequotewithendsnocommas.fastq.received.1-read-per-line-format.1
|./bowtiestreaming.sh |head
@SRR014475.3 :1:1:101:937 length=36 +
gi|110640213|ref|NC_008253.1|   3393863 GAAGATCCGGTACAACCCTGATGTAAATGGTA
IAIIAII%I0G  0   7:TC,27:GT
@SRR014475.4 :1:1:124:64 length=36  +
gi|110640213|ref|NC_008253.1|   2288633 GAACACATAGAACAACAGGATTCGCCAGAACACCTG
IIICI+@5+)'(-';%$;+;  0   30:TC
@SRR014475.5 :1:1:108:897 length=36 +
gi|110640213|ref|NC_008253.1|   4389356 GGAAGAGATGAAGTGGGTCGTTGTGGTGTGTTTGTT
I0I:I'+IG3II46II0C@=III()+:+2$  0
5:CA,28:GT,29:CG,30:AT,34:CT
@SRR014475.9 :1:1:118:88 length=36  -
gi|110640213|ref|NC_008253.1|   3598410 GTGGCGCGTTACCTGGTAGCGCGCCAGTTTCC
%$#%$')+*;*61+1;4AAIGIII  0
@SRR014475.15 :1:1:87:967 length=36 +
gi|110640213|ref|NC_008253.1|   4474247 GACTACACGATCGCCTGCCTTAATATTCTTTACACC
A27II7CIII*I5I+F?II'  0   6:GA,26:GT
@SRR014475.20 :1:1:108:121 length=36-
gi|110640213|ref|NC_008253.1|   37761   AATGCATATTGAGAGTGTGATTATTAGC
ID4II'2IIIC/;B?FII  0   12:CT
@SRR014475.23 :1:1:75:54 length=36  +
gi|110640213|ref|NC_008253.1|   2465453 GGTTTCTTTCTGCGCAGATGCCAGACGGTCTTTATA
CI;';29=9I.4%EE2)*'  0
@SRR014475.24 :1:1:89:904 length=36 -
gi|110640213|ref|NC_008253.1|   3216193 

HDFS Backup nodes

2011-12-06 Thread praveenesh kumar
Does hadoop 0.20.205 supports configuring HDFS backup nodes ?

Thanks,
Praveenesh