Re: Print execution time

2009-04-08 Thread Sharad Agarwal
Also available on jobtracker web ui.

Farhan Husain wrote:
 The logs might help.

 On Tue, Apr 7, 2009 at 7:28 PM, Mithila Nagendra mnage...@asu.edu wrote:

 Hey all!

 Is there a way to print out the execution time of a map reduce task? An
 inbuilt function or option to be used with bin/hadoop

 Thanks!
 Mithila Nagendra







Re: Very assymetric data allocation

2009-04-08 Thread Marcus Herou
Great thanks for the info!

Right after I finished my last question I started to think about how Hadoop
measures data allocation. Are the figures presented actually the size of
HDFS on each machine or the amount of disk allocated and measured by issuing
something like df.

The reason why I am asking is that df -h is quite close to the figures
presented in the GUI but it could be a coincidence.

//Marcus

On Tue, Apr 7, 2009 at 4:02 PM, Koji Noguchi knogu...@yahoo-inc.com wrote:

 Marcus,

 One known issue in 0.18.3 is HADOOP-5465.

 CopyPaste from
 https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693
 956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa
 nel#action_12693956https://issues.apache.org/jira/browse/HADOOP-4489?focusedCommentId=12693%0A956page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpa%0Anel#action_12693956

 Hairong said:
  This bug might be caused by HADOOP-5465. Once a datanode hits
 HADOOP-5465, NameNode sends an empty replication request to the data
 node on every reply to a heartbeat, thus not a single scheduled block
 deletion request can be sent to the data node.

 (Also, if you're always writing from one of the nodes, that node is more
 likely to get full.)



 Nigel, not sure if this is the issue, but it would be nice to have
 0.18.4 out.


 Koji



 -Original Message-
 From: Marcus Herou [mailto:marcus.he...@tailsweep.com]
 Sent: Tuesday, April 07, 2009 12:45 AM
 To: hadoop-u...@lucene.apache.org
 Subject: Very assymetric data allocation

 Hi.

 We are running Hadoop 0.18.3 and noticed a strange issue when one of our
 machines went out of disk yesterday.
 If you can see the table below it would display that the server
 mapredcoord is 66.91% allocated and the others are almost empty.
 How can that be ?

 Any information about this would be very helpful.

 mapredcoord is as well our jobtracker.

 //Marcus

 Node Last Contact Admin State Size (GB) Used (%) Used (%) Remaining (GB)
 Blocks
 mapredcoordhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor
 t=50070dir=%2Fhttp://mapredcoord:50076/browseDirectory.jsp?namenodeInfoPor%0At=50070dir=%2F
 2In
 Service416.6966.91

 90.9419806
 mapreduce2http://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce2:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 2In
 Service416.696.71

 303.54456
 mapreduce3http://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce3:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 2In
 Service416.690.44
 351.693975
 mapreduce4http://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce4:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 0In
 Service416.690.25
 355.821549
 mapreduce5http://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce5:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 2In
 Service416.690.42
 347.683995
 mapreduce6http://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce6:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 0In
 Service416.690.43
 352.73982
 mapreduce7http://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce7:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 0In
 Service416.690.5
 351.914079
 mapreduce8http://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=
 50070dir=%2Fhttp://mapreduce8:50076/browseDirectory.jsp?namenodeInfoPort=%0A50070dir=%2F
 1In
 Service416.690.48
 350.154169


 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/
 http://blogg.tailsweep.com/




-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/
http://blogg.tailsweep.com/


Getting free and used space

2009-04-08 Thread Stas Oskin
Hi.

I'm trying to use the API to get the overall used and free spaces.

I tried this function getUsed(), but it always returns 0.

Any idea?

Thanks.


Re: Too many fetch errors

2009-04-08 Thread xiaolin guo
I have checked the log and found that  for each map task , there are 3
failures which look like machin1(failed) - machine2(failed) -
machine1(failed) - machine2(succeeded). All failures are Too many fetch
failures. And i am sure there is no firewall between the two nodes , at
least port 50060 can be accessed from web browser.

How can I check whether two nodes can fetch mapper outputs from one
another?  I have no idea how reducers fetch these data ...

Thanks!

On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball aa...@cloudera.com wrote:

 Xiaolin,

 Are you certain that the two nodes can fetch mapper outputs from one
 another? If it's taking that long to complete, it might be the case that
 what makes it complete is just that eventually it abandons one of your
 two
 nodes and runs everything on a single node where it succeeds -- defeating
 the point, of course.

 Might there be a firewall between the two nodes that blocks the port used
 by
 the reducer to fetch the mapper outputs? (I think this is on 50060 by
 default.)

 - Aaron

 On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo xiao...@hulu.com wrote:

  This simple map-recude application will take nearly 1 hour to finish
  running
  on the two-node cluster ,due to lots of Failed/Killed task attempts,
 while
  in the single node cluster this application only takes 1 minite ... I am
  quite confusing why there are so many Failed/Killed attempts ..
 
  On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo xiao...@hulu.com wrote:
 
   I am trying to setup a small hadoop cluster , everything was ok before
 I
   moved from single node cluster to two-node cluster. I followed the
  article
  
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 
  
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 to
  config master and slaves.However, when I tried to run the example
   wordcount map-reduce application , the reduce task got stuck in 19% for
 a
   log time . Then I got a notice:INFO mapred.JobClient: TaskId :
   attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch
   errors  and an error message : Error reading task outputslave.
  
   All map tasks in both task nodes had been finished which could be
  verified
   in task tracker pages.
  
   Both nodes work well in single node mode . And the Hadoop file system
  seems
   to be healthy in multi-node mode.
  
   Can anyone help me with this issue?  Have already got entangled in this
   issue for a long time ...
  
   Thanks very much!
  
 



Re: Too many fetch errors

2009-04-08 Thread xiaolin guo
Fixed the problem
The problem is that the one of the nodes can not resolve the name of the
other node.  Even if I use ip address in the masters and slaves , hadoop
will use the name of the node instead of the ip address ...

On Wed, Apr 8, 2009 at 7:26 PM, xiaolin guo xiao...@hulu.com wrote:

 I have checked the log and found that  for each map task , there are 3
 failures which look like machin1(failed) - machine2(failed) -
 machine1(failed) - machine2(succeeded). All failures are Too many fetch
 failures. And i am sure there is no firewall between the two nodes , at
 least port 50060 can be accessed from web browser.

 How can I check whether two nodes can fetch mapper outputs from one
 another?  I have no idea how reducers fetch these data ...

 Thanks!


 On Wed, Apr 8, 2009 at 2:21 AM, Aaron Kimball aa...@cloudera.com wrote:

 Xiaolin,

 Are you certain that the two nodes can fetch mapper outputs from one
 another? If it's taking that long to complete, it might be the case that
 what makes it complete is just that eventually it abandons one of your
 two
 nodes and runs everything on a single node where it succeeds -- defeating
 the point, of course.

 Might there be a firewall between the two nodes that blocks the port used
 by
 the reducer to fetch the mapper outputs? (I think this is on 50060 by
 default.)

 - Aaron

 On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo xiao...@hulu.com wrote:

  This simple map-recude application will take nearly 1 hour to finish
  running
  on the two-node cluster ,due to lots of Failed/Killed task attempts,
 while
  in the single node cluster this application only takes 1 minite ... I am
  quite confusing why there are so many Failed/Killed attempts ..
 
  On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo xiao...@hulu.com wrote:
 
   I am trying to setup a small hadoop cluster , everything was ok before
 I
   moved from single node cluster to two-node cluster. I followed the
  article
  
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 
  
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 to
  config master and slaves.However, when I tried to run the example
   wordcount map-reduce application , the reduce task got stuck in 19%
 for a
   log time . Then I got a notice:INFO mapred.JobClient: TaskId :
   attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch
   errors  and an error message : Error reading task outputslave.
  
   All map tasks in both task nodes had been finished which could be
  verified
   in task tracker pages.
  
   Both nodes work well in single node mode . And the Hadoop file system
  seems
   to be healthy in multi-node mode.
  
   Can anyone help me with this issue?  Have already got entangled in
 this
   issue for a long time ...
  
   Thanks very much!
  
 





using cascading fro map-reduce

2009-04-08 Thread vishal s. ghawate
hi,
I am trying to use cascading API for replacing map-reduce.
can anybody give me the detailed description on how to run it.

i have followed the instruction given on http://www.cascading.org/ but 
couldn't do it so if anybody had run that can please guide me on this.

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


Re: Getting free and used space

2009-04-08 Thread Brian Bockelman

Hey Stas,

Did you try this as a privileged user?  There might be some permission  
errors... in most of the released versions, getUsed() is only  
available to the Hadoop superuser.  It may be that the exception isn't  
propagating correctly.


Brian

On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:


Hi.

I'm trying to use the API to get the overall used and free spaces.

I tried this function getUsed(), but it always returns 0.

Any idea?

Thanks.




using cascading

2009-04-08 Thread vishal s. ghawate
hi,
I am  new to the cascading concept.can anybody help me on how to run the 
cascading example 
I followed the http://cascading.org link but its not helping me please can 
anybody explain me the steps needed to be followed to run the cascading example

DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.


RE: Hadoop data nodes failing to start

2009-04-08 Thread Kevin Eppinger
FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
that only popped up when the additional nodes were added.  The solution was to 
put the following entry in hadoop-site.xml:

property
   namedfs.datanode.socket.write.timeout/name
   value0/value
/property

Thanks to 'jdcryans' and 'digarok' from IRC for the help.

-kevin

-Original Message-
From: Kevin Eppinger [mailto:keppin...@adknowledge.com] 
Sent: Tuesday, April 07, 2009 1:05 PM
To: core-user@hadoop.apache.org
Subject: Hadoop data nodes failing to start

Hello everyone-

So I have a 5 node cluster that I've been running for a few weeks with no 
problems.  Today I decided to add nodes and double its size to 10.  After doing 
all the setup and starting the cluster, I discovered that four out of the 10 
nodes had failed to startup.  Specifically, the data nodes didn't start.  The 
task trackers seemed to start fine.  Thinking it was something I did 
incorrectly with the expansion, I then reverted back to the 5 node 
configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
files:

2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting Periodic 
block scanner.
2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
9269 blocks got processed in 1128 msecs
2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
Exiting due to:java.nio.channels.ClosedSelectorException
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
at 
org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
at java.lang.Thread.run(Thread.java:619)

After this the data node shuts down.  This same message is appearing on all the 
failed nodes.  Help!

-kevin


Re: Hadoop data nodes failing to start

2009-04-08 Thread Jean-Daniel Cryans
Kevin,

I'm glad it worked for you.

We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
on that same cluster without the socket timeout thing?

Thx,

J-D

On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
keppin...@adknowledge.com wrote:
 FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
 that only popped up when the additional nodes were added.  The solution was 
 to put the following entry in hadoop-site.xml:

 property
   namedfs.datanode.socket.write.timeout/name
   value0/value
 /property

 Thanks to 'jdcryans' and 'digarok' from IRC for the help.

 -kevin

 -Original Message-
 From: Kevin Eppinger [mailto:keppin...@adknowledge.com]
 Sent: Tuesday, April 07, 2009 1:05 PM
 To: core-user@hadoop.apache.org
 Subject: Hadoop data nodes failing to start

 Hello everyone-

 So I have a 5 node cluster that I've been running for a few weeks with no 
 problems.  Today I decided to add nodes and double its size to 10.  After 
 doing all the setup and starting the cluster, I discovered that four out of 
 the 10 nodes had failed to startup.  Specifically, the data nodes didn't 
 start.  The task trackers seemed to start fine.  Thinking it was something I 
 did incorrectly with the expansion, I then reverted back to the 5 node 
 configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
 starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
 files:

 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting 
 Periodic block scanner.
 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
 9269 blocks got processed in 1128 msecs
 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
 DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
 Exiting due to:java.nio.channels.ClosedSelectorException
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
        at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
        at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
        at 
 org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
        at java.lang.Thread.run(Thread.java:619)

 After this the data node shuts down.  This same message is appearing on all 
 the failed nodes.  Help!

 -kevin



Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Tom White
Does it work if you use addArchiveToClassPath()?

Also, it may be more convenient to use GenericOptionsParser's -libjars option.

Tom

On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote:
 Hi all,

 I'm stumped as to how to use the distributed cache's classpath feature. I
 have a library of Java classes I'd like to distribute to jobs and use in my
 mapper; I figured the DCache's addFileToClassPath() method was the correct
 means, given the example at
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html.


 I've boiled it down to the following non-working example:

 in TestDriver.java:


  private void runJob() throws IOException {
    JobConf conf = new JobConf(getConf(), TestDriver.class);

    // do standard job configuration.
    FileInputFormat.addInputPath(conf, new Path(input));
    FileOutputFormat.setOutputPath(conf, new Path(output));

    conf.setMapperClass(TestMapper.class);
    conf.setNumReduceTasks(0);

    // load aaronTest2.jar into the dcache; this contains the class
 ValueProvider
    FileSystem fs = FileSystem.get(conf);
    fs.copyFromLocalFile(new Path(aaronTest2.jar), new
 Path(tmp/aaronTest2.jar));
    DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar),
 conf);

    // run the job.
    JobClient.runJob(conf);
  }


  and then in TestMapper:

  public void map(LongWritable key, Text value,
 OutputCollectorLongWritable, Text output,
      Reporter reporter) throws IOException {

    try {
      ValueProvider vp = (ValueProvider)
 Class.forName(ValueProvider).newInstance();
      Text val = vp.getValue();
      output.collect(new LongWritable(1), val);
    } catch (ClassNotFoundException e) {
      throw new IOException(not found:  + e.toString()); // newInstance()
 throws to here.
    } catch (Exception e) {
      throw new IOException(Exception: + e.toString());
    }
  }


 The class ValueProvider is to be loaded from aaronTest2.jar. I can verify
 that this code works if I put ValueProvider into the main jar I deploy. I
 can verify that aaronTest2.jar makes it into the
 ${mapred.local.dir}/taskTracker/archive/

 But when run with ValueProvider in aaronTest2.jar, the job fails with:

 $ bin/hadoop jar aaronTest1.jar TestDriver
 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
 : 10
 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
 : 10
 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005
 09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
 09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
 attempt_200903012210_0005_m_00_0, Status : FAILED
 java.io.IOException: not found: java.lang.ClassNotFoundException:
 ValueProvider
    at TestMapper.map(Unknown Source)
    at TestMapper.map(Unknown Source)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


 Do I need to do something else (maybe in Mapper.configure()?) to actually
 classload the jar? The documentation makes me believe it should already be
 in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3.

 Thanks,
 - Aaron



Re: using cascading fro map-reduce

2009-04-08 Thread Erik Holstad
Hi!
If you are interested in Cascading I recommend you to ask on the Cascading
mailing list or come ask in the irc channel.
The mailing list can be found at the bottom left corner of www.cascading.org
.

Regards Erik


Re: Getting free and used space

2009-04-08 Thread Stas Oskin
Hi.

Thanks for the explanation.

Now for the easier part - how do I specify the user when connecting? :)

Is it a config file level, or run-time level setting?

Regards.

2009/4/8 Brian Bockelman bbock...@cse.unl.edu

 Hey Stas,

 Did you try this as a privileged user?  There might be some permission
 errors... in most of the released versions, getUsed() is only available to
 the Hadoop superuser.  It may be that the exception isn't propagating
 correctly.

 Brian


 On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:

  Hi.

 I'm trying to use the API to get the overall used and free spaces.

 I tried this function getUsed(), but it always returns 0.

 Any idea?

 Thanks.





Re: Getting free and used space

2009-04-08 Thread Brian Bockelman

Hey Stas,

What we do locally is apply the latest patch for this issue: 
https://issues.apache.org/jira/browse/HADOOP-4368

This makes getUsed (actually, it switches to FileSystem.getStatus) not  
a privileged action.


As far as specifying the user ... gee, I can't think of it off the top  
of my head.  It's a variable you can insert into the JobConf, but I'd  
have to poke around google or the code to remember which one (I try to  
not override it if possible).


Brian

On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote:


Hi.

Thanks for the explanation.

Now for the easier part - how do I specify the user when  
connecting? :)


Is it a config file level, or run-time level setting?

Regards.

2009/4/8 Brian Bockelman bbock...@cse.unl.edu


Hey Stas,

Did you try this as a privileged user?  There might be some  
permission
errors... in most of the released versions, getUsed() is only  
available to

the Hadoop superuser.  It may be that the exception isn't propagating
correctly.

Brian


On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:

Hi.


I'm trying to use the API to get the overall used and free spaces.

I tried this function getUsed(), but it always returns 0.

Any idea?

Thanks.








run the pipes wordcount example with nopipe

2009-04-08 Thread Jianmin Woo
Hi, 

With several days investigation, the wordcount-nopipe example in the 
hadoop-0.19.1 package can be run finally. However, there are some changes I did 
but not sure if this is proper/correct way. Could anyone please help to verify?

1. start up the job with the -inputformat argument with value 
org.apache.hadoop.mapred.pipes.WordCountInputFormat
2. since the RecordReader/Writer in C++ is used, so no Java based 
RecordReader/Writer should be used. However, I faced the error like 
RecordReader defined while not needed from the pipes. After checking the 
org.apache.hadoop.mapred.pipes.Submitter.java, I found this code snippet: 

if (results.hasOption(-inputformat)) {
setIsJavaRecordReader(job, true);
job.setInputFormat(getClass(results, -inputformat, job,
 InputFormat.class));
}

So it seems that with -inputformat specified, the JavaRecordReader will be 
enabled. This caused the error before. Then I comment the line 
setIsJavaRecordReader(job, true);. Then the examples can be run. Is this the 
proper way to make the wordcount-nopipe example works? I see from the code that 
there is a commented line //cli.addArgument(javareader, false, is the 
RecordReader in Java); . Should this line be uncommented to support the 
disable of JavaRecordReader in command line option?

3. It seems that the wordcount-nopipe only works for input/output with local 
URI  file:///home/...  Is this true?

Thanks,
Jianmin


  

RE: Example of deploying jars through DistributedCache?

2009-04-08 Thread Brian MacKay

I use addArchiveToClassPath, and it works for me.

DistributedCache.addArchiveToClassPath(new Path(path), conf);

I was curious about this block of code.  Why are you coping to tmp?

FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path(aaronTest2.jar), new
 Path(tmp/aaronTest2.jar));

-Original Message-
From: Tom White [mailto:t...@cloudera.com] 
Sent: Wednesday, April 08, 2009 9:36 AM
To: core-user@hadoop.apache.org
Subject: Re: Example of deploying jars through DistributedCache?

Does it work if you use addArchiveToClassPath()?

Also, it may be more convenient to use GenericOptionsParser's -libjars option.

Tom

On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote:
 Hi all,

 I'm stumped as to how to use the distributed cache's classpath feature. I
 have a library of Java classes I'd like to distribute to jobs and use in my
 mapper; I figured the DCache's addFileToClassPath() method was the correct
 means, given the example at
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html.


 I've boiled it down to the following non-working example:

 in TestDriver.java:


  private void runJob() throws IOException {
    JobConf conf = new JobConf(getConf(), TestDriver.class);

    // do standard job configuration.
    FileInputFormat.addInputPath(conf, new Path(input));
    FileOutputFormat.setOutputPath(conf, new Path(output));

    conf.setMapperClass(TestMapper.class);
    conf.setNumReduceTasks(0);

    // load aaronTest2.jar into the dcache; this contains the class
 ValueProvider
    FileSystem fs = FileSystem.get(conf);
    fs.copyFromLocalFile(new Path(aaronTest2.jar), new
 Path(tmp/aaronTest2.jar));
    DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar),
 conf);

    // run the job.
    JobClient.runJob(conf);
  }


  and then in TestMapper:

  public void map(LongWritable key, Text value,
 OutputCollectorLongWritable, Text output,
      Reporter reporter) throws IOException {

    try {
      ValueProvider vp = (ValueProvider)
 Class.forName(ValueProvider).newInstance();
      Text val = vp.getValue();
      output.collect(new LongWritable(1), val);
    } catch (ClassNotFoundException e) {
      throw new IOException(not found:  + e.toString()); // newInstance()
 throws to here.
    } catch (Exception e) {
      throw new IOException(Exception: + e.toString());
    }
  }


 The class ValueProvider is to be loaded from aaronTest2.jar. I can verify
 that this code works if I put ValueProvider into the main jar I deploy. I
 can verify that aaronTest2.jar makes it into the
 ${mapred.local.dir}/taskTracker/archive/

 But when run with ValueProvider in aaronTest2.jar, the job fails with:

 $ bin/hadoop jar aaronTest1.jar TestDriver
 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
 : 10
 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
 : 10
 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005
 09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
 09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
 attempt_200903012210_0005_m_00_0, Status : FAILED
 java.io.IOException: not found: java.lang.ClassNotFoundException:
 ValueProvider
    at TestMapper.map(Unknown Source)
    at TestMapper.map(Unknown Source)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


 Do I need to do something else (maybe in Mapper.configure()?) to actually
 classload the jar? The documentation makes me believe it should already be
 in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3.

 Thanks,
 - Aaron


_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this message in error, please contact the sender and delete the material 
from any computer.




Re: Web ui

2009-04-08 Thread Rasit OZDAS
@Nick, I'm using ajax very often and previously done projects with ZK
and JQuery, I can easily say that GWT was the easiest of them.
Javascript is only needed where core features aren't enough. I can
easily assume that we won't need any inline javascript.

@Philip,
Thanks for the point. That is a better solution than I imagine,
actually, and I won't have to wait since it's a resolved issue.

-- 
M. Raşit ÖZDAŞ


RE: Hadoop data nodes failing to start

2009-04-08 Thread Kevin Eppinger
Unfortunately not.  I don't have much leeway to experiment with this cluster.

-kevin

-Original Message-
From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel 
Cryans
Sent: Wednesday, April 08, 2009 8:30 AM
To: core-user@hadoop.apache.org
Subject: Re: Hadoop data nodes failing to start

Kevin,

I'm glad it worked for you.

We talked a bit about 5114 yesterday, any chance of trying 0.18 branch
on that same cluster without the socket timeout thing?

Thx,

J-D

On Wed, Apr 8, 2009 at 9:24 AM, Kevin Eppinger
keppin...@adknowledge.com wrote:
 FYI:  Problem fixed.  It was apparently a timeout condition present in 0.18.3 
 that only popped up when the additional nodes were added.  The solution was 
 to put the following entry in hadoop-site.xml:

 property
   namedfs.datanode.socket.write.timeout/name
   value0/value
 /property

 Thanks to 'jdcryans' and 'digarok' from IRC for the help.

 -kevin

 -Original Message-
 From: Kevin Eppinger [mailto:keppin...@adknowledge.com]
 Sent: Tuesday, April 07, 2009 1:05 PM
 To: core-user@hadoop.apache.org
 Subject: Hadoop data nodes failing to start

 Hello everyone-

 So I have a 5 node cluster that I've been running for a few weeks with no 
 problems.  Today I decided to add nodes and double its size to 10.  After 
 doing all the setup and starting the cluster, I discovered that four out of 
 the 10 nodes had failed to startup.  Specifically, the data nodes didn't 
 start.  The task trackers seemed to start fine.  Thinking it was something I 
 did incorrectly with the expansion, I then reverted back to the 5 node 
 configuration but I'm experiencing the same problem...with only 2 of 5 nodes 
 starting correctly.  Here is what I'm seeing in the hadoop-*-datanode*.log 
 files:

 2009-04-07 12:35:40,628 INFO org.apache.hadoop.dfs.DataNode: Starting 
 Periodic block scanner.
 2009-04-07 12:35:45,548 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 
 9269 blocks got processed in 1128 msecs
 2009-04-07 12:35:45,584 ERROR org.apache.hadoop.dfs.DataNode: 
 DatanodeRegistration(10.254.165.223:50010, storageID=DS-202528624-10.254.13
 1.244-50010-1238604807366, infoPort=50075, ipcPort=50020):DataXceiveServer: 
 Exiting due to:java.nio.channels.ClosedSelectorException
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:66)
        at sun.nio.ch.SelectorImpl.selectNow(SelectorImpl.java:88)
        at sun.nio.ch.Util.releaseTemporarySelector(Util.java:135)
        at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:120)
        at 
 org.apache.hadoop.dfs.DataNode$DataXceiveServer.run(DataNode.java:997)
        at java.lang.Thread.run(Thread.java:619)

 After this the data node shuts down.  This same message is appearing on all 
 the failed nodes.  Help!

 -kevin



Chaining Multiple Map reduce jobs.

2009-04-08 Thread asif md
hi everyone,

i have to chain multiple map reduce jobs  actually 2 to 4 jobs , each of
the jobs depends on the o/p of preceding job. In the reducer of each job I'm
doing very little  just grouping by key from the maps. I want to give the
output of one MapReduce job to the next job without having to go to the
disk. Does anyone have any ideas on how to do this?

Thanx.


Re: Getting free and used space

2009-04-08 Thread Aaron Kimball
You can insert this propery into the jobconf, or specify it on the command
line e.g.: -D hadoop.job.ugi=username,group,group,group.

- Aaron

On Wed, Apr 8, 2009 at 7:04 AM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Stas,

 What we do locally is apply the latest patch for this issue:
 https://issues.apache.org/jira/browse/HADOOP-4368

 This makes getUsed (actually, it switches to FileSystem.getStatus) not a
 privileged action.

 As far as specifying the user ... gee, I can't think of it off the top of
 my head.  It's a variable you can insert into the JobConf, but I'd have to
 poke around google or the code to remember which one (I try to not override
 it if possible).

 Brian


 On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote:

  Hi.

 Thanks for the explanation.

 Now for the easier part - how do I specify the user when connecting? :)

 Is it a config file level, or run-time level setting?

 Regards.

 2009/4/8 Brian Bockelman bbock...@cse.unl.edu

  Hey Stas,

 Did you try this as a privileged user?  There might be some permission
 errors... in most of the released versions, getUsed() is only available
 to
 the Hadoop superuser.  It may be that the exception isn't propagating
 correctly.

 Brian


 On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:

 Hi.


 I'm trying to use the API to get the overall used and free spaces.

 I tried this function getUsed(), but it always returns 0.

 Any idea?

 Thanks.







Re: How many people is using Hadoop Streaming ?

2009-04-08 Thread Aaron Kimball
Excellent. Thanks
- A

On Tue, Apr 7, 2009 at 2:16 PM, Owen O'Malley omal...@apache.org wrote:


 On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:

  Owen,

 Is binary streaming actually readily available?


 https://issues.apache.org/jira/browse/HADOOP-1722




Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Aaron Kimball
Ooh. The other DCache-based operations assume that you're dcaching files
already resident in HDFS. I guess this assumes that the filenames are on the
local filesystem.

- Aaron

On Wed, Apr 8, 2009 at 8:32 AM, Brian MacKay brian.mac...@medecision.comwrote:


 I use addArchiveToClassPath, and it works for me.

 DistributedCache.addArchiveToClassPath(new Path(path), conf);

 I was curious about this block of code.  Why are you coping to tmp?

 FileSystem fs = FileSystem.get(conf);
 fs.copyFromLocalFile(new Path(aaronTest2.jar), new
  Path(tmp/aaronTest2.jar));

 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Wednesday, April 08, 2009 9:36 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Example of deploying jars through DistributedCache?

 Does it work if you use addArchiveToClassPath()?

 Also, it may be more convenient to use GenericOptionsParser's -libjars
 option.

 Tom

 On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote:
  Hi all,
 
  I'm stumped as to how to use the distributed cache's classpath feature. I
  have a library of Java classes I'd like to distribute to jobs and use in
 my
  mapper; I figured the DCache's addFileToClassPath() method was the
 correct
  means, given the example at
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
 .
 
 
  I've boiled it down to the following non-working example:
 
  in TestDriver.java:
 
 
   private void runJob() throws IOException {
 JobConf conf = new JobConf(getConf(), TestDriver.class);
 
 // do standard job configuration.
 FileInputFormat.addInputPath(conf, new Path(input));
 FileOutputFormat.setOutputPath(conf, new Path(output));
 
 conf.setMapperClass(TestMapper.class);
 conf.setNumReduceTasks(0);
 
 // load aaronTest2.jar into the dcache; this contains the class
  ValueProvider
 FileSystem fs = FileSystem.get(conf);
 fs.copyFromLocalFile(new Path(aaronTest2.jar), new
  Path(tmp/aaronTest2.jar));
 DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar),
  conf);
 
 // run the job.
 JobClient.runJob(conf);
   }
 
 
   and then in TestMapper:
 
   public void map(LongWritable key, Text value,
  OutputCollectorLongWritable, Text output,
   Reporter reporter) throws IOException {
 
 try {
   ValueProvider vp = (ValueProvider)
  Class.forName(ValueProvider).newInstance();
   Text val = vp.getValue();
   output.collect(new LongWritable(1), val);
 } catch (ClassNotFoundException e) {
   throw new IOException(not found:  + e.toString()); //
 newInstance()
  throws to here.
 } catch (Exception e) {
   throw new IOException(Exception: + e.toString());
 }
   }
 
 
  The class ValueProvider is to be loaded from aaronTest2.jar. I can
 verify
  that this code works if I put ValueProvider into the main jar I deploy. I
  can verify that aaronTest2.jar makes it into the
  ${mapred.local.dir}/taskTracker/archive/
 
  But when run with ValueProvider in aaronTest2.jar, the job fails with:
 
  $ bin/hadoop jar aaronTest1.jar TestDriver
  09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to
 process
  : 10
  09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to
 process
  : 10
  09/03/01 22:36:04 INFO mapred.JobClient: Running job:
 job_200903012210_0005
  09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
  09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
  attempt_200903012210_0005_m_00_0, Status : FAILED
  java.io.IOException: not found: java.lang.ClassNotFoundException:
  ValueProvider
 at TestMapper.map(Unknown Source)
 at TestMapper.map(Unknown Source)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 at
  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 
 
  Do I need to do something else (maybe in Mapper.configure()?) to actually
  classload the jar? The documentation makes me believe it should already
 be
  in the classpath by doing only what I've done above. I'm on Hadoop
 0.18.3.
 
  Thanks,
  - Aaron
 

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.





Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Lukáš Vlček
Hi,
by far I am not an Hadoop expert but I think you can not start Map task
until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a] it may
not fit into memory and b] you would risk data loss if the system crashes).
As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html)http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html

Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:

 hi everyone,

 i have to chain multiple map reduce jobs  actually 2 to 4 jobs , each of
 the jobs depends on the o/p of preceding job. In the reducer of each job
 I'm
 doing very little  just grouping by key from the maps. I want to give the
 output of one MapReduce job to the next job without having to go to the
 disk. Does anyone have any ideas on how to do this?

 Thanx.




-- 
http://blog.lukas-vlcek.com/


Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread Nathan Marz
You can also try decreasing the replication factor for the  
intermediate files between jobs. This will make writing those files  
faster.


On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:


Hi,
by far I am not an Hadoop expert but I think you can not start Map  
task

until the previous Reduce is finished. Saying this it means that you
probably have to store the Map output to the disk first (because a]  
it may
not fit into memory and b] you would risk data loss if the system  
crashes).

As for the job chaining you can check JobControl class (
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html) 
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html 



Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

Regards,
Lukas

On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:


hi everyone,

i have to chain multiple map reduce jobs  actually 2 to 4 jobs ,  
each of
the jobs depends on the o/p of preceding job. In the reducer of  
each job

I'm
doing very little  just grouping by key from the maps. I want to  
give the
output of one MapReduce job to the next job without having to go to  
the

disk. Does anyone have any ideas on how to do this?

Thanx.





--
http://blog.lukas-vlcek.com/




Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Todd Lipcon
Hi Bing,

The issue here is that BytesWritable uses an internal buffer which is grown
but not shrunk. The cause of this is that Writables in general are single
instances that are shared across multiple input records. If you look at the
internals of the input reader, you'll see that a single BytesWritable is
instantiated, and then each time a record is read, it's read into that same
instance. The purpose here is to avoid the allocation cost for each row.

The end result is, as you've seen, that getBytes() returns an array which
may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not zero.

Unfortunately, if the protobuffer API doesn't allow you to deserialize out
of a smaller portion of a byte array, you're out of luck and will have to do
the copy like you've mentioned. I imagine, though, that there's some way
around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote:


 I tried to store protocolbuffer as BytesWritable in a sequence file Text,
 BytesWritable.  It's stored using SequenceFile.Writer(new Text(key), new
 BytesWritable(protobuf.convertToBytes())).  When reading the values from
 key/value pairs using value.get(), it returns more then what's stored.
 However, value.getSize() returns the correct number.  This means in order
 to
 convert the byte[] to protocol buffer again, I have to do
 Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
 few entries in the sequence file below.  The extra bytes in value.get() all
 have values of zero.

 value.getSize(): 7066   value.get().length: 10599
 value.getSize(): 36456  value.get().length: 54684
 value.getSize(): 32275  value.get().length: 54684
 value.getSize(): 40561  value.get().length: 54684
 value.getSize(): 16855  value.get().length: 54684
 value.getSize(): 66304  value.get().length: 99456
 value.getSize(): 26488  value.get().length: 99456
 value.getSize(): 59327  value.get().length: 99456
 value.getSize(): 36865  value.get().length: 99456

 --
 View this message in context:
 http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread bzheng

I tried to store protocolbuffer as BytesWritable in a sequence file Text,
BytesWritable.  It's stored using SequenceFile.Writer(new Text(key), new
BytesWritable(protobuf.convertToBytes())).  When reading the values from
key/value pairs using value.get(), it returns more then what's stored. 
However, value.getSize() returns the correct number.  This means in order to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()).  This happens on both version
0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for a
few entries in the sequence file below.  The extra bytes in value.get() all
have values of zero.  

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

-- 
View this message in context: 
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Gaurav Chandalia
Arrays.copyOf isn't required, protocol buffer has a method to merge  
from bytes. you can do:


protobuf.newBuilder().mergeFrom(value.getBytes(), 0, value.getLength())

the above is for hadoop 0.19.1, the corresponding method names for  
BytesWritable for earlier version of hadoop might be slightly different.


--
gaurav



On Apr 8, 2009, at 7:13 PM, Todd Lipcon wrote:


Hi Bing,

The issue here is that BytesWritable uses an internal buffer which  
is grown
but not shrunk. The cause of this is that Writables in general are  
single
instances that are shared across multiple input records. If you look  
at the
internals of the input reader, you'll see that a single  
BytesWritable is
instantiated, and then each time a record is read, it's read into  
that same
instance. The purpose here is to avoid the allocation cost for each  
row.


The end result is, as you've seen, that getBytes() returns an array  
which

may be larger than the actual amount of data. In fact, the extra bytes
(between .getSize() and .get().length) have undefined contents, not  
zero.


Unfortunately, if the protobuffer API doesn't allow you to  
deserialize out
of a smaller portion of a byte array, you're out of luck and will  
have to do
the copy like you've mentioned. I imagine, though, that there's some  
way

around this in the protobuffer API - perhaps you can use a
ByteArrayInputStream here to your advantage.

Hope that helps
-Todd

On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote:



I tried to store protocolbuffer as BytesWritable in a sequence file  
Text,
BytesWritable.  It's stored using SequenceFile.Writer(new  
Text(key), new
BytesWritable(protobuf.convertToBytes())).  When reading the values  
from
key/value pairs using value.get(), it returns more then what's  
stored.
However, value.getSize() returns the correct number.  This means in  
order

to
convert the byte[] to protocol buffer again, I have to do
Arrays.copyOf(value.get(), value.getSize()).  This happens on both  
version
0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample  
sizes for a
few entries in the sequence file below.  The extra bytes in  
value.get() all

have values of zero.

value.getSize(): 7066   value.get().length: 10599
value.getSize(): 36456  value.get().length: 54684
value.getSize(): 32275  value.get().length: 54684
value.getSize(): 40561  value.get().length: 54684
value.getSize(): 16855  value.get().length: 54684
value.getSize(): 66304  value.get().length: 99456
value.getSize(): 26488  value.get().length: 99456
value.getSize(): 59327  value.get().length: 99456
value.getSize(): 36865  value.get().length: 99456

--
View this message in context:
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.






Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread bzheng

Thanks for the clarification.  Though I still find it strange why not have
the get() method return what's actually stored regardless of buffer size. 
Is there any reason why you'd want to use/examine what's in the buffer?


Todd Lipcon-4 wrote:
 
 Hi Bing,
 
 The issue here is that BytesWritable uses an internal buffer which is
 grown
 but not shrunk. The cause of this is that Writables in general are single
 instances that are shared across multiple input records. If you look at
 the
 internals of the input reader, you'll see that a single BytesWritable is
 instantiated, and then each time a record is read, it's read into that
 same
 instance. The purpose here is to avoid the allocation cost for each row.
 
 The end result is, as you've seen, that getBytes() returns an array which
 may be larger than the actual amount of data. In fact, the extra bytes
 (between .getSize() and .get().length) have undefined contents, not zero.
 
 Unfortunately, if the protobuffer API doesn't allow you to deserialize out
 of a smaller portion of a byte array, you're out of luck and will have to
 do
 the copy like you've mentioned. I imagine, though, that there's some way
 around this in the protobuffer API - perhaps you can use a
 ByteArrayInputStream here to your advantage.
 
 Hope that helps
 -Todd
 
 On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote:
 

 I tried to store protocolbuffer as BytesWritable in a sequence file
 Text,
 BytesWritable.  It's stored using SequenceFile.Writer(new Text(key), new
 BytesWritable(protobuf.convertToBytes())).  When reading the values from
 key/value pairs using value.get(), it returns more then what's stored.
 However, value.getSize() returns the correct number.  This means in order
 to
 convert the byte[] to protocol buffer again, I have to do
 Arrays.copyOf(value.get(), value.getSize()).  This happens on both
 version
 0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
 a
 few entries in the sequence file below.  The extra bytes in value.get()
 all
 have values of zero.

 value.getSize(): 7066   value.get().length: 10599
 value.getSize(): 36456  value.get().length: 54684
 value.getSize(): 32275  value.get().length: 54684
 value.getSize(): 40561  value.get().length: 54684
 value.getSize(): 16855  value.get().length: 54684
 value.getSize(): 66304  value.get().length: 99456
 value.getSize(): 26488  value.get().length: 99456
 value.getSize(): 59327  value.get().length: 99456
 value.getSize(): 36865  value.get().length: 99456

 --
 View this message in context:
 http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.


 
 

-- 
View this message in context: 
http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Todd Lipcon
On Wed, Apr 8, 2009 at 7:14 PM, bzheng bing.zh...@gmail.com wrote:


 Thanks for the clarification.  Though I still find it strange why not have
 the get() method return what's actually stored regardless of buffer size.
 Is there any reason why you'd want to use/examine what's in the buffer?


Because doing so requires an array copy. It's important for hadoop
performance to avoid needless copies of data when they're unnecessary. Most
APIs that take byte[] arrays have a version that includes an offset and
length.

-Todd





 Todd Lipcon-4 wrote:
 
  Hi Bing,
 
  The issue here is that BytesWritable uses an internal buffer which is
  grown
  but not shrunk. The cause of this is that Writables in general are single
  instances that are shared across multiple input records. If you look at
  the
  internals of the input reader, you'll see that a single BytesWritable is
  instantiated, and then each time a record is read, it's read into that
  same
  instance. The purpose here is to avoid the allocation cost for each row.
 
  The end result is, as you've seen, that getBytes() returns an array which
  may be larger than the actual amount of data. In fact, the extra bytes
  (between .getSize() and .get().length) have undefined contents, not zero.
 
  Unfortunately, if the protobuffer API doesn't allow you to deserialize
 out
  of a smaller portion of a byte array, you're out of luck and will have to
  do
  the copy like you've mentioned. I imagine, though, that there's some way
  around this in the protobuffer API - perhaps you can use a
  ByteArrayInputStream here to your advantage.
 
  Hope that helps
  -Todd
 
  On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote:
 
 
  I tried to store protocolbuffer as BytesWritable in a sequence file
  Text,
  BytesWritable.  It's stored using SequenceFile.Writer(new Text(key),
 new
  BytesWritable(protobuf.convertToBytes())).  When reading the values from
  key/value pairs using value.get(), it returns more then what's stored.
  However, value.getSize() returns the correct number.  This means in
 order
  to
  convert the byte[] to protocol buffer again, I have to do
  Arrays.copyOf(value.get(), value.getSize()).  This happens on both
  version
  0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes for
  a
  few entries in the sequence file below.  The extra bytes in value.get()
  all
  have values of zero.
 
  value.getSize(): 7066   value.get().length: 10599
  value.getSize(): 36456  value.get().length: 54684
  value.getSize(): 32275  value.get().length: 54684
  value.getSize(): 40561  value.get().length: 54684
  value.getSize(): 16855  value.get().length: 54684
  value.getSize(): 66304  value.get().length: 99456
  value.getSize(): 26488  value.get().length: 99456
  value.getSize(): 59327  value.get().length: 99456
  value.getSize(): 36865  value.get().length: 99456
 
  --
  View this message in context:
 
 http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: BytesWritable get() returns more bytes then what's stored

2009-04-08 Thread Alex Loddengaard
FYI: this (open) JIRA might be interesting to you:

http://issues.apache.org/jira/browse/HADOOP-3788

Alex

On Wed, Apr 8, 2009 at 7:18 PM, Todd Lipcon t...@cloudera.com wrote:

 On Wed, Apr 8, 2009 at 7:14 PM, bzheng bing.zh...@gmail.com wrote:

 
  Thanks for the clarification.  Though I still find it strange why not
 have
  the get() method return what's actually stored regardless of buffer size.
  Is there any reason why you'd want to use/examine what's in the buffer?
 

 Because doing so requires an array copy. It's important for hadoop
 performance to avoid needless copies of data when they're unnecessary. Most
 APIs that take byte[] arrays have a version that includes an offset and
 length.

 -Todd



 
 
  Todd Lipcon-4 wrote:
  
   Hi Bing,
  
   The issue here is that BytesWritable uses an internal buffer which is
   grown
   but not shrunk. The cause of this is that Writables in general are
 single
   instances that are shared across multiple input records. If you look at
   the
   internals of the input reader, you'll see that a single BytesWritable
 is
   instantiated, and then each time a record is read, it's read into that
   same
   instance. The purpose here is to avoid the allocation cost for each
 row.
  
   The end result is, as you've seen, that getBytes() returns an array
 which
   may be larger than the actual amount of data. In fact, the extra bytes
   (between .getSize() and .get().length) have undefined contents, not
 zero.
  
   Unfortunately, if the protobuffer API doesn't allow you to deserialize
  out
   of a smaller portion of a byte array, you're out of luck and will have
 to
   do
   the copy like you've mentioned. I imagine, though, that there's some
 way
   around this in the protobuffer API - perhaps you can use a
   ByteArrayInputStream here to your advantage.
  
   Hope that helps
   -Todd
  
   On Wed, Apr 8, 2009 at 4:59 PM, bzheng bing.zh...@gmail.com wrote:
  
  
   I tried to store protocolbuffer as BytesWritable in a sequence file
   Text,
   BytesWritable.  It's stored using SequenceFile.Writer(new Text(key),
  new
   BytesWritable(protobuf.convertToBytes())).  When reading the values
 from
   key/value pairs using value.get(), it returns more then what's stored.
   However, value.getSize() returns the correct number.  This means in
  order
   to
   convert the byte[] to protocol buffer again, I have to do
   Arrays.copyOf(value.get(), value.getSize()).  This happens on both
   version
   0.17.2 and 0.18.3.  Does anyone know why this happens?  Sample sizes
 for
   a
   few entries in the sequence file below.  The extra bytes in
 value.get()
   all
   have values of zero.
  
   value.getSize(): 7066   value.get().length: 10599
   value.getSize(): 36456  value.get().length: 54684
   value.getSize(): 32275  value.get().length: 54684
   value.getSize(): 40561  value.get().length: 54684
   value.getSize(): 16855  value.get().length: 54684
   value.getSize(): 66304  value.get().length: 99456
   value.getSize(): 26488  value.get().length: 99456
   value.getSize(): 59327  value.get().length: 99456
   value.getSize(): 36865  value.get().length: 99456
  
   --
   View this message in context:
  
 
 http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22962146.html
   Sent from the Hadoop core-user mailing list archive at Nabble.com.
  
  
  
  
 
  --
  View this message in context:
 
 http://www.nabble.com/BytesWritable-get%28%29-returns-more-bytes-then-what%27s-stored-tp22962146p22963309.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: Chaining Multiple Map reduce jobs.

2009-04-08 Thread jason hadoop
Chapter 8 of my book covers this in detail, the alpha chapter should be
available at the apress web site
Chain mapping rules!
http://www.apress.com/book/view/1430219424

On Wed, Apr 8, 2009 at 3:30 PM, Nathan Marz nat...@rapleaf.com wrote:

 You can also try decreasing the replication factor for the intermediate
 files between jobs. This will make writing those files faster.


 On Apr 8, 2009, at 3:14 PM, Lukáš Vlček wrote:

  Hi,
 by far I am not an Hadoop expert but I think you can not start Map task
 until the previous Reduce is finished. Saying this it means that you
 probably have to store the Map output to the disk first (because a] it may
 not fit into memory and b] you would risk data loss if the system
 crashes).
 As for the job chaining you can check JobControl class (

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
 )
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/jobcontrol/JobControl.html
 

 Also you can look at https://issues.apache.org/jira/browse/HADOOP-3702

 Regards,
 Lukas

 On Wed, Apr 8, 2009 at 11:30 PM, asif md asif.d...@gmail.com wrote:

  hi everyone,

 i have to chain multiple map reduce jobs  actually 2 to 4 jobs , each
 of
 the jobs depends on the o/p of preceding job. In the reducer of each job
 I'm
 doing very little  just grouping by key from the maps. I want to give
 the
 output of one MapReduce job to the next job without having to go to the
 disk. Does anyone have any ideas on how to do this?

 Thanx.




 --
 http://blog.lukas-vlcek.com/





-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


CloudBurst: Hadoop for DNA Sequence Analysis

2009-04-08 Thread Michael Schatz
Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz


RE: CloudBurst: Hadoop for DNA Sequence Analysis

2009-04-08 Thread Dmitry Pushkarev
As a matter of fact it is nowhere near close to being data intensive, it
does take gigabytes of input data to process, however it is mostly RAM and
CPU intensive. Although post-processing of alignment files is exactly where
hadoop excels. At least as far as I understand majority of time is spent on
DP alignment whereas navigation in seed space and N*log(n) sort requires
only a fraction of that time - that was my experience applying hadoop
cluster to sequencing human genomes.



---
Dmitry Pushkarev
+1-650-644-8988

-Original Message-
From: michael.sch...@gmail.com [mailto:michael.sch...@gmail.com] On Behalf
Of Michael Schatz
Sent: Wednesday, April 08, 2009 9:19 PM
To: core-user@hadoop.apache.org
Subject: CloudBurst: Hadoop for DNA Sequence Analysis

Hadoop Users,

I just wanted to announce my Hadoop application 'CloudBurst' is available
open source at:
http://cloudburst-bio.sourceforge.net

In a nutshell, it is an application for mapping millions of short DNA
sequences to a reference genome to, for example, map out differences in one
individual's genome compared to the reference genome. As you might imagine,
this is a very data intense problem, but Hadoop enables the application to
scale up linearly to large clusters.

A full description of the program is available in the journal
Bioinformatics:
http://bioinformatics.oxfordjournals.org/cgi/content/abstract/btp236

I also wanted to take this opportunity to thank everyone on this mailing
list. The discussions posted were essential for navigating the ins and outs
of hadoop during the development of CloudBurst.

Thanks everyone!

Michael Schatz

http://www.cbcb.umd.edu/~mschatz