RE: not a SequenceFile?

2009-06-18 Thread Shravan Mahankali
Thanks for your suggestion Eason.

I think sequence file issue is resolved, how ever success is far on the hill
top!!!

*Below is the command I have executed in the terminal:*
#/bin/hadoop jar AggregateWordCount.jar
org.apache.hadoop.examples.AggregateWordCount words/*
aggregatewordcount_output 2 textinputformat

*Error:*

09/06/18 11:51:42 INFO mapred.JobClient: Task Id :
attempt_200906181145_0005_r_01_1, Status : FAILED
java.lang.RuntimeException: java.lang.ClassNotFoundException:
org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass
at
org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
createInstance(UserDefinedValueAggregatorDescriptor.java:57)
at
org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
createAggregator(UserDefinedValueAggregatorDescriptor.java:64)
at
org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
init(UserDefinedValueAggregatorDescriptor.java:76)
at
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getValueAggreg
atorDescriptor(ValueAggregatorJobBase.java:54)
at
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getAggregatorD
escriptors(ValueAggregatorJobBase.java:65)
at
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.initializeMySp
ec(ValueAggregatorJobBase.java:74)
at
org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.configure(Valu
eAggregatorJobBase.java:42)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:242)
at
org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
createInstance(UserDefinedValueAggregatorDescriptor.java:52)
... 10 more


*However, I could see that
org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass class is
available in the AggregateWordCount.jar, below is the listing of this jar: *

# jar -tvf AggregateWordCount.jar
 0 Tue Jun 09 18:11:42 IST 2009 META-INF/
71 Tue Jun 09 18:11:42 IST 2009 META-INF/MANIFEST.MF
 0 Tue Jun 09 18:10:58 IST 2009 org/
 0 Tue Jun 09 18:10:58 IST 2009 org/apache/
 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/
 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/examples/
  1298 Tue Jun 09 18:11:32 IST 2009
org/apache/hadoop/examples/AggregateWordCount$WordCountPlugInClass.class
   846 Tue Jun 09 18:11:32 IST 2009
org/apache/hadoop/examples/AggregateWordCount.class


Could you please advice what could be the problem here?

Thank You,
Shravan Kumar. M 
Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
-
This email and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
If you have received this email in error please notify the system
administrator - netopshelpd...@catalytic.com
-Original Message-
From: Eason.Lee [mailto:leongf...@gmail.com] 
Sent: Thursday, June 18, 2009 11:27 AM
To: core-user@hadoop.apache.org; shravan.mahank...@catalytic.com
Subject: Re: not a SequenceFile?

you'd better run it like this:

bin/hadoop jar hadoop-*-examples.jar aggregatewordcount input output
numOfReducers *textinputformat*
*
*
input textinputformat to use textinputformat instand of SequenceFile

hope it is helpful!

2009/6/18 Shravan Mahankali shravan.mahank...@catalytic.com

 Hi Nick,



 Thanks for your response.



 I am very new to Hadoop. I was trying to execute the AggregateWordCount
 example program provided by Hadoop in the distribution from the linux but
 was having the issue as stated in my earlier email.



 I have also tried executing the MultiFetch example with no success.



 If this example program needs the input file to be a sequence file. how
 should I create one, please advice?



 Thank You,

 Shravan Kumar. M

 Catalytic Software Ltd. [SEI-CMMI Level 5 Company]

 -

 This email and any files transmitted with it are confidential and intended
 solely for the use of the 

Re: not a SequenceFile?

2009-06-18 Thread Eason.Lee
http://www.nabble.com/ClassNotFoundException-td23441528.htmlhttp://www.nabble.com/ClassNotFoundException-td23441528.html
http://www.nabble.com/ClassNotFoundException-td23441528.htmlYou must make
all of the required jars available to all of your tasks. You
can either install them all the tasktracker machines and setup the
tasktracker classpath to include them, or distributed them via the
distributed cache.


2009/6/18 Shravan Mahankali shravan.mahank...@catalytic.com

 Thanks for your suggestion Eason.

 I think sequence file issue is resolved, how ever success is far on the
 hill
 top!!!

 *Below is the command I have executed in the terminal:*
 #/bin/hadoop jar AggregateWordCount.jar
 org.apache.hadoop.examples.AggregateWordCount words/*
 aggregatewordcount_output 2 textinputformat

 *Error:*

 09/06/18 11:51:42 INFO mapred.JobClient: Task Id :
 attempt_200906181145_0005_r_01_1, Status : FAILED
 java.lang.RuntimeException: java.lang.ClassNotFoundException:
 org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass
at

 org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
 createInstance(UserDefinedValueAggregatorDescriptor.java:57)
at

 org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
 createAggregator(UserDefinedValueAggregatorDescriptor.java:64)
at

 org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
 init(UserDefinedValueAggregatorDescriptor.java:76)
at

 org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getValueAggreg
 atorDescriptor(ValueAggregatorJobBase.java:54)
at

 org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getAggregatorD
 escriptors(ValueAggregatorJobBase.java:65)
at

 org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.initializeMySp
 ec(ValueAggregatorJobBase.java:74)
at

 org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.configure(Valu
 eAggregatorJobBase.java:42)
at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240)
at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
 Caused by: java.lang.ClassNotFoundException:
 org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:242)
at

 org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor.
 createInstance(UserDefinedValueAggregatorDescriptor.java:52)
... 10 more


 *However, I could see that
 org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass class is
 available in the AggregateWordCount.jar, below is the listing of this jar:
 *

 # jar -tvf AggregateWordCount.jar
 0 Tue Jun 09 18:11:42 IST 2009 META-INF/
71 Tue Jun 09 18:11:42 IST 2009 META-INF/MANIFEST.MF
 0 Tue Jun 09 18:10:58 IST 2009 org/
 0 Tue Jun 09 18:10:58 IST 2009 org/apache/
 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/
 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/examples/
  1298 Tue Jun 09 18:11:32 IST 2009
 org/apache/hadoop/examples/AggregateWordCount$WordCountPlugInClass.class
   846 Tue Jun 09 18:11:32 IST 2009
 org/apache/hadoop/examples/AggregateWordCount.class


 Could you please advice what could be the problem here?

 Thank You,
 Shravan Kumar. M
 Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
 -
 This email and any files transmitted with it are confidential and intended
 solely for the use of the individual or entity to whom they are addressed.
 If you have received this email in error please notify the system
 administrator - netopshelpd...@catalytic.com
 -Original Message-
 From: Eason.Lee [mailto:leongf...@gmail.com]
 Sent: Thursday, June 18, 2009 11:27 AM
 To: core-user@hadoop.apache.org; shravan.mahank...@catalytic.com
 Subject: Re: not a SequenceFile?

 you'd better run it like this:

 bin/hadoop jar hadoop-*-examples.jar aggregatewordcount input output
 numOfReducers *textinputformat*
 *
 *
 input textinputformat to use textinputformat instand of SequenceFile

 hope it is helpful!

 2009/6/18 Shravan Mahankali shravan.mahank...@catalytic.com

  Hi Nick,
 
 
 
  Thanks for your response.
 
 
 
  I am very new to Hadoop. I was trying to execute the 

Data replication and moving computation

2009-06-18 Thread rajeev gupta

I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS cluster 
and replication factor is 1. A large file is there on one of those three 
cluster machines in its local file system. If I put that file in HDFS will it 
be divided and distributed across all three machines? I had this doubt as HDFS 
moving computation is cheaper than moving data. 

If file is distributed across all three machines, lots of data transfer will be 
there, whereas, if file is NOT distributed then compute power of other machine 
will be unused. Am I missing something here?

-Raj



  


Re: Hadoop Eclipse Plugin

2009-06-18 Thread Rajeev Gupta
You need to give
1) Map/reduce Master Host : Host where mapred.sh is running.
2)Map/reduce Master port: 19001 (see hadoop-site.xml file)
3)DFS master: Host where your start-dfs.sh is running
 4) DFS master port: 19000

These parameters will be sufficient to access HDFS. You may need to setup
some advanced parameters to give permissions to Window's user on hosts
where hadoop is running.

Thanks and regards.
-Rajeev Gupta



   
 Praveen   
 Yarlagadda
 praveen.yarlagad  To 
 d...@gmail.com core-user@hadoop.apache.org 
cc 
 06/18/2009 08:39  
 AMSubject 
   Hadoop Eclipse Plugin   
   
 Please respond to 
 core-u...@hadoop. 
apache.org 
   
   




Hi,

I have a problem configuring Hadoop Map/Reduce plugin with Eclipse.

Setup Details:

I have a namenode, a jobtracker and two data nodes, all running on ubuntu.
My set up works fine with example programs. I want to connect to this setup
from eclipse.

namenode - 10.20.104.62 - 54310(port)
jobtracker - 10.20.104.53 - 54311(port)

I run eclipse on a different windows m/c. I want to configure map/reduce
plugin
with eclipse, so that I can access HDFS from windows.

Map/Reduce master
Host - With jobtracker IP, it did not work
Port - With jobtracker port, it did not work

DFS master
Host - With namenode IP, It did not work
Port - With namenode port, it did not work

I tried other combination too by giving namenode details for Map/Reduce
master
and jobtracker details for DFS master. It did not work either.

If anyone has configured plugin with eclipse, please let me know. Even the
pointers
to how to configure it will be highly appreciated.

Thanks,
Praveen




Re: Data replication and moving computation

2009-06-18 Thread Harish Mallipeddi
On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta graj1...@yahoo.com wrote:


 I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS
 cluster and replication factor is 1. A large file is there on one of those
 three cluster machines in its local file system. If I put that file in HDFS
 will it be divided and distributed across all three machines? I had this
 doubt as HDFS moving computation is cheaper than moving data.

 If file is distributed across all three machines, lots of data transfer
 will be there, whereas, if file is NOT distributed then compute power of
 other machine will be unused. Am I missing something here?

 -Raj



Irrespective of what you set as the replication factor, large files will
always be split into chunks (chunk size is what you set as your HDFS
block-size) and they'll be distributed across your entire cluster.


-- 
Harish Mallipeddi
http://blog.poundbang.in


Re: Pipes example wordcount-nopipe.cc failed when reading from input splits

2009-06-18 Thread Roshan James
I did get this working. InputSplit information is not returned clearly. You
may want to look at this thread -
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cee216d470906121602k7f914179u5d9555e7bb080...@mail.gmail.com%3e


On Thu, Jun 18, 2009 at 12:49 AM, Jianmin Woo jianmin_...@yahoo.com wrote:


 I tried this example and it seems that the input/output should only be in
 file:///... format to get correct results.

 - Jianmin




 
 From: Viral K khaju...@yahoo-inc.com
 To: core-user@hadoop.apache.org
 Sent: Thursday, June 18, 2009 8:57:47 AM
 Subject: Re: Pipes example wordcount-nopipe.cc failed when reading from
 input splits


 Does anybody have any updates on this?

 How can we have our own RecordReader in Hadoop pipes?  When I try to print
 the context.getInputSplit, I get the filenames along with some junk
 characters.  As a result the file open fails.

 Anybody got it working?

 Viral.



 11 Nov. wrote:
 
  I traced into the c++ recordreader code:
WordCountReader(HadoopPipes::MapContext context) {
  std::string filename;
  HadoopUtils::StringInStream stream(context.getInputSplit());
  HadoopUtils::deserializeString(filename, stream);
  struct stat statResult;
  stat(filename.c_str(), statResult);
  bytesTotal = statResult.st_size;
  bytesRead = 0;
  cout  filenameendl;
  file = fopen(filename.c_str(), rt);
  HADOOP_ASSERT(file != NULL, failed to open  + filename);
}
 
  I got nothing for the filename virable, which showed the InputSplit is
  empty.
 
  2008/3/4, 11 Nov. nov.eleve...@gmail.com:
 
  hi colleagues,
 I have set up the single node cluster to test pipes examples.
 wordcount-simple and wordcount-part work just fine. but
  wordcount-nopipe can't run. Here is my commnad line:
 
   bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml -input
  input/ -output out-dir-nopipe1
 
  and here is the error message printed on my console:
 
  08/03/03 23:23:06 WARN mapred.JobClient: No job jar file set.  User
  classes may not be found. See JobConf(Class) or JobConf#setJar(String).
  08/03/03 23:23:06 INFO mapred.FileInputFormat: Total input paths to
  process : 1
  08/03/03 23:23:07 INFO mapred.JobClient: Running job:
  job_200803032218_0004
  08/03/03 23:23:08 INFO mapred.JobClient:  map 0% reduce 0%
  08/03/03 23:23:11 INFO mapred.JobClient: Task Id :
  task_200803032218_0004_m_00_0, Status : FAILED
  java.io.IOException: pipe child exception
  at org.apache.hadoop.mapred.pipes.Application.abort(
  Application.java:138)
  at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(
  PipesMapRunner.java:83)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
  at org.apache.hadoop.mapred.TaskTracker$Child.main(
  TaskTracker.java:1787)
  Caused by: java.io.EOFException
  at java.io.DataInputStream.readByte(DataInputStream.java:250)
  at
  org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java
  :313)
  at
 org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java
  :335)
  at
  org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(
  BinaryProtocol.java:112)
 
  task_200803032218_0004_m_00_0:
  task_200803032218_0004_m_00_0:
  task_200803032218_0004_m_00_0:
  task_200803032218_0004_m_00_0: Hadoop Pipes Exception: failed to
 open
  at /home/hadoop/hadoop-0.15.2-single-cluster
  /src/examples/pipes/impl/wordcount-nopipe.cc:67 in
  WordCountReader::WordCountReader(HadoopPipes::MapContext)
 
 
  Could anybody tell me how to fix this? That will be appreciated.
  Thanks a lot!
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Pipes-example-wordcount-nopipe.cc-failed-when-reading-from-input-splits-tp15807856p24084734.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.






Re: JobControl for Pipes?

2009-06-18 Thread Roshan James
Can you give me a url or so to both? I cant seem to find either one after a
couple of basic web searches.

Also, when you say JobControl is coming to Hadoop - I can already see the
Java JobControl classes that lets one express dependancies between jobs. So
I assume this already works in Java - does it not? I was asking if this
functionality is exposed via Pipes in some way.

Roshan


On Wed, Jun 17, 2009 at 10:59 PM, jason hadoop jason.had...@gmail.comwrote:

 Job control is coming with the Hadoop WorkFlow manager, in the mean time
 there is cascade by chris wensel. I do not have any personal experience
 with
 either. I do not know how pipes interacts with either.

 On Wed, Jun 17, 2009 at 12:43 PM, Roshan James 
 roshan.james.subscript...@gmail.com wrote:

  Hello, Is there any way to express dependencies between map-reduce jobs
  (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs?  The
  provided header Pipes.hh does not seem to reflect any such capabilities.
 
  best,
  Roshan
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Getting Task ID inside a Mapper

2009-06-18 Thread Mark Desnoyer
Hi,

I was wondering if it's possible to get a hold of the task id inside a
mapper? I cant' seem to find a way by trolling through the API reference.
I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation
and I need to be able to initialize a random number generator in a task
specific way so that if the task fails and is rerun elsewhere, the results
are the same. Thanks in advance.

Cheers,
Mark Desnoyer


Re: Getting Task ID inside a Mapper

2009-06-18 Thread Piotr Praczyk
Hi
Why don't you provide a seed of random generator generated outside the task
? Then when the task fails, you can provide the same value stored somewhere
outside.
You could use the task configuration to do so.

I don't know anything about obtaining the task ID from within.


regards
Piotr

2009/6/18 Mark Desnoyer mdesno...@gmail.com

 Hi,

 I was wondering if it's possible to get a hold of the task id inside a
 mapper? I cant' seem to find a way by trolling through the API reference.
 I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation
 and I need to be able to initialize a random number generator in a task
 specific way so that if the task fails and is rerun elsewhere, the results
 are the same. Thanks in advance.

 Cheers,
 Mark Desnoyer



Re: Getting Task ID inside a Mapper

2009-06-18 Thread Mark Desnoyer
Thanks! I'll try that.

-Mark

On Thu, Jun 18, 2009 at 10:27 AM, Jingkei Ly jingkei...@detica.com wrote:

 I think you can use job.getInt(mapred.task.partition,-1) to get the
 mapper ID, which should be the same for the mapper across task reruns.

 -Original Message-
 From: Piotr Praczyk [mailto:piotr.prac...@gmail.com]
 Sent: 18 June 2009 15:19
 To: core-user@hadoop.apache.org
 Subject: Re: Getting Task ID inside a Mapper

 Hi
 Why don't you provide a seed of random generator generated outside the
 task
 ? Then when the task fails, you can provide the same value stored
 somewhere
 outside.
 You could use the task configuration to do so.

 I don't know anything about obtaining the task ID from within.


 regards
 Piotr

 2009/6/18 Mark Desnoyer mdesno...@gmail.com

  Hi,
 
  I was wondering if it's possible to get a hold of the task id inside a
  mapper? I cant' seem to find a way by trolling through the API
 reference.
  I'm trying to implement a Map Reduce version of Latent Dirichlet
 Allocation
  and I need to be able to initialize a random number generator in a
 task
  specific way so that if the task fails and is rerun elsewhere, the
 results
  are the same. Thanks in advance.
 
  Cheers,
  Mark Desnoyer
 



 This message should be regarded as confidential. If you have received this
 email in error please notify the sender and destroy it immediately.
 Statements of intent shall only become binding when confirmed in hard copy
 by an authorised signatory.  The contents of this email may relate to
 dealings with other companies within the Detica Group plc group of
 companies.

 Detica Limited is registered in England under No: 1337451.

 Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
 England.





Fwd: Need help

2009-06-18 Thread ashish pareek
Hello,
I am doing my master my final year project is on Hadoop ...so I
would like to know some thing about Hadoop cluster i.e, Do new version of
Hadoop are able to handle heterogeneous hardware.If you have any
informantion regarding these please mail me as my project is in heterogenous
environment.


Thanks!

Reagrds,
Ashish Pareek


Re: JobControl for Pipes?

2009-06-18 Thread jason hadoop
http://www.cascading.org/
https://issues.apache.org/jira/browse/HADOOP-5303 (oozie)

On Thu, Jun 18, 2009 at 6:19 AM, Roshan James 
roshan.james.subscript...@gmail.com wrote:

 Can you give me a url or so to both? I cant seem to find either one after a
 couple of basic web searches.

 Also, when you say JobControl is coming to Hadoop - I can already see the
 Java JobControl classes that lets one express dependancies between jobs. So
 I assume this already works in Java - does it not? I was asking if this
 functionality is exposed via Pipes in some way.

 Roshan


 On Wed, Jun 17, 2009 at 10:59 PM, jason hadoop jason.had...@gmail.com
 wrote:

  Job control is coming with the Hadoop WorkFlow manager, in the mean time
  there is cascade by chris wensel. I do not have any personal experience
  with
  either. I do not know how pipes interacts with either.
 
  On Wed, Jun 17, 2009 at 12:43 PM, Roshan James 
  roshan.james.subscript...@gmail.com wrote:
 
   Hello, Is there any way to express dependencies between map-reduce jobs
   (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs?  The
   provided header Pipes.hh does not seem to reflect any such
 capabilities.
  
   best,
   Roshan
  
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Getting Task ID inside a Mapper

2009-06-18 Thread jason hadoop
The task id is readily available, if you override the configure method.
The MapReduceBase class in the Pro Hadoop Book examples does this and makes
the taskId available as a class field.

On Thu, Jun 18, 2009 at 7:33 AM, Mark Desnoyer mdesno...@gmail.com wrote:

 Thanks! I'll try that.

 -Mark

 On Thu, Jun 18, 2009 at 10:27 AM, Jingkei Ly jingkei...@detica.com
 wrote:

  I think you can use job.getInt(mapred.task.partition,-1) to get the
  mapper ID, which should be the same for the mapper across task reruns.
 
  -Original Message-
  From: Piotr Praczyk [mailto:piotr.prac...@gmail.com]
  Sent: 18 June 2009 15:19
  To: core-user@hadoop.apache.org
  Subject: Re: Getting Task ID inside a Mapper
 
  Hi
  Why don't you provide a seed of random generator generated outside the
  task
  ? Then when the task fails, you can provide the same value stored
  somewhere
  outside.
  You could use the task configuration to do so.
 
  I don't know anything about obtaining the task ID from within.
 
 
  regards
  Piotr
 
  2009/6/18 Mark Desnoyer mdesno...@gmail.com
 
   Hi,
  
   I was wondering if it's possible to get a hold of the task id inside a
   mapper? I cant' seem to find a way by trolling through the API
  reference.
   I'm trying to implement a Map Reduce version of Latent Dirichlet
  Allocation
   and I need to be able to initialize a random number generator in a
  task
   specific way so that if the task fails and is rerun elsewhere, the
  results
   are the same. Thanks in advance.
  
   Cheers,
   Mark Desnoyer
  
 
 
 
  This message should be regarded as confidential. If you have received
 this
  email in error please notify the sender and destroy it immediately.
  Statements of intent shall only become binding when confirmed in hard
 copy
  by an authorised signatory.  The contents of this email may relate to
  dealings with other companies within the Detica Group plc group of
  companies.
 
  Detica Limited is registered in England under No: 1337451.
 
  Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
  England.
 
 
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Need help

2009-06-18 Thread ashish pareek
Does that mean hadoop is not scalable wrt heterogeneous environment? and one
more question is can we run different application on the same hadoop cluster
.

Thanks.
Regards,
Ashish

On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop jason.had...@gmail.comwrote:

 Hadoop has always been reasonably agnostic wrt hardware and homogeneity.
 There are optimizations in configuration for near homogeneous machines.



 On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek pareek...@gmail.com
 wrote:

  Hello,
 I am doing my master my final year project is on Hadoop ...so
 I
  would like to know some thing about Hadoop cluster i.e, Do new version of
  Hadoop are able to handle heterogeneous hardware.If you have any
  informantion regarding these please mail me as my project is in
  heterogenous
  environment.
 
 
  Thanks!
 
  Reagrds,
  Ashish Pareek
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello,

I wasn't able to find this anywhere, so I'm sorry if this has been asked before.

I am wondering whether there is a practical limit of the amount of bytes that 
an emitted Map/Reduce value can be. Other than the obvious drawbacks of 
emitting huge values such as performance issues, I would like to know whether 
there are any hard constraints; I can imagine that a value can never be larger 
than the dfs.block.size .

Does anyone have any idea, or can provide me with some pointers where to look ?

Thanks in advance!

Regards,

Leon Mergen


Re: Need help

2009-06-18 Thread ashish pareek
Can you tell few of the challenges in configuring heterogeneous cluster...or
can pass on some link where I would get some information regarding
challenges in running Hadoop on heterogeneous hardware

One more things is How about running different applications on the same
Hadoop cluster?and what challenges are involved in it ?

Thanks,
Regards,
Ashish


On Thu, Jun 18, 2009 at 8:53 PM, jason hadoop jason.had...@gmail.comwrote:

 I don't know anyone who has a completely homogeneous cluster.

 So hadoop is scalable across heterogeneous environments.

 I stated that configuration is simpler if the machines are similar (There
 are optimizations in configuration for near homogeneous machines.)

 On Thu, Jun 18, 2009 at 8:10 AM, ashish pareek pareek...@gmail.com
 wrote:

  Does that mean hadoop is not scalable wrt heterogeneous environment? and
  one
  more question is can we run different application on the same hadoop
  cluster
  .
 
  Thanks.
  Regards,
  Ashish
 
  On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop jason.had...@gmail.com
  wrote:
 
   Hadoop has always been reasonably agnostic wrt hardware and
 homogeneity.
   There are optimizations in configuration for near homogeneous machines.
  
  
  
   On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek pareek...@gmail.com
   wrote:
  
Hello,
   I am doing my master my final year project is on Hadoop
  ...so
   I
would like to know some thing about Hadoop cluster i.e, Do new
 version
  of
Hadoop are able to handle heterogeneous hardware.If you have any
informantion regarding these please mail me as my project is in
heterogenous
environment.
   
   
Thanks!
   
Reagrds,
Ashish Pareek
   
  
  
  
   --
   Pro Hadoop, a book to guide you from beginner to hadoop mastery,
   http://www.amazon.com/dp/1430219424?tag=jewlerymall
   www.prohadoopbook.com a community for Hadoop Professionals
  
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Practical limit on emitted map/reduce values

2009-06-18 Thread Owen O'Malley
Keys and values can be large. They are certainly capped above by
Java's 2GB limit on byte arrays. More practically, you will have
problems running out of memory with keys or values of 100 MB. There is
no restriction that a key/value pair fits in a single hdfs block, but
performance would suffer. (In particular, the FileInputFormats split
at block sized chunks, which means you will have maps that scan an
entire block without processing anything.)

-- Owen


RE: Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello Owen,

 Keys and values can be large. They are certainly capped above by
 Java's 2GB limit on byte arrays. More practically, you will have
 problems running out of memory with keys or values of 100 MB. There is
 no restriction that a key/value pair fits in a single hdfs block, but
 performance would suffer. (In particular, the FileInputFormats split
 at block sized chunks, which means you will have maps that scan an
 entire block without processing anything.)

Thanks for the quick reply.

Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that 
is caused by the Java VM heap size ? If so, could that, for example, be 
increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?

Regards,

Leon Mergen



Re: Practical limit on emitted map/reduce values

2009-06-18 Thread jason hadoop
In general if the values become very large, it becomes simpler to store them
outline in hdfs, and just pass the hdfs path for the item as the value in
the map reduce task.
This greatly reduces the amount of IO done, and doesn't blow up the sort
space on the reducer.
You loose the magic of data locality, but given the item size, and you gain
the IO back by not having to pass the full values to the reducer, or handle
them when sorting the map outputs.

On Thu, Jun 18, 2009 at 8:45 AM, Leon Mergen l.p.mer...@solatis.com wrote:

 Hello Owen,

  Keys and values can be large. They are certainly capped above by
  Java's 2GB limit on byte arrays. More practically, you will have
  problems running out of memory with keys or values of 100 MB. There is
  no restriction that a key/value pair fits in a single hdfs block, but
  performance would suffer. (In particular, the FileInputFormats split
  at block sized chunks, which means you will have maps that scan an
  entire block without processing anything.)

 Thanks for the quick reply.

 Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit
 that is caused by the Java VM heap size ? If so, could that, for example, be
 increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?

 Regards,

 Leon Mergen




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


RE: Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello Jason,

 In general if the values become very large, it becomes simpler to store
 them
 outline in hdfs, and just pass the hdfs path for the item as the value
 in
 the map reduce task.
 This greatly reduces the amount of IO done, and doesn't blow up the
 sort
 space on the reducer.
 You loose the magic of data locality, but given the item size, and you
 gain
 the IO back by not having to pass the full values to the reducer, or
 handle
 them when sorting the map outputs.

Ah that actually sounds like a nice idea; instead of having the reducer emit 
the huge value, it can create a temporarely file and emit the filename instead.

I wasn't really planning on having huge values anyway (values above 1MB will be 
the exception rather than the rule), but since it's theoretically possible for 
our software to generate them, it seemed like a good idea to investigate any 
real constraints that we might run into.

Your idea sounds like a good workaround for this. Thanks!


Regards,

Leon Mergen












Re: Need help

2009-06-18 Thread ashish pareek
Hello Everybody,

  How can we handle different applications having
different requirement being run on the same hadoop cluster ? What are the
various approaches to solve such problem.. if possible please mention some
of those ideas.

Does such implementation exists ?

Thanks ,

Regards,
Ashish

On Thu, Jun 18, 2009 at 9:36 PM, jason hadoop jason.had...@gmail.comwrote:

 For me, I like to have one configuration file that I distribute to all of
 the machines in my cluster via rsync.

 In there are things like the number of tasks per node to run, and where to
 store dfs data and local temporary data, and the limits to storage for the
 machines.

 If the machines are very different, it becomes important to tailor the
 configuration file per machine or type of machine.

 At this point, you are pretty much going to have to spend the time, reading
 through the details of configuring a hadoop cluster.


 On Thu, Jun 18, 2009 at 8:33 AM, ashish pareek pareek...@gmail.com
 wrote:

  Can you tell few of the challenges in configuring heterogeneous
  cluster...or
  can pass on some link where I would get some information regarding
  challenges in running Hadoop on heterogeneous hardware
 
  One more things is How about running different applications on the same
  Hadoop cluster?and what challenges are involved in it ?
 
  Thanks,
  Regards,
  Ashish
 
 
  On Thu, Jun 18, 2009 at 8:53 PM, jason hadoop jason.had...@gmail.com
  wrote:
 
   I don't know anyone who has a completely homogeneous cluster.
  
   So hadoop is scalable across heterogeneous environments.
  
   I stated that configuration is simpler if the machines are similar
 (There
   are optimizations in configuration for near homogeneous machines.)
  
   On Thu, Jun 18, 2009 at 8:10 AM, ashish pareek pareek...@gmail.com
   wrote:
  
Does that mean hadoop is not scalable wrt heterogeneous environment?
  and
one
more question is can we run different application on the same hadoop
cluster
.
   
Thanks.
Regards,
Ashish
   
On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop 
 jason.had...@gmail.com
wrote:
   
 Hadoop has always been reasonably agnostic wrt hardware and
   homogeneity.
 There are optimizations in configuration for near homogeneous
  machines.



 On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek 
 pareek...@gmail.com
 wrote:

  Hello,
 I am doing my master my final year project is on
 Hadoop
...so
 I
  would like to know some thing about Hadoop cluster i.e, Do new
   version
of
  Hadoop are able to handle heterogeneous hardware.If you have any
  informantion regarding these please mail me as my project is in
  heterogenous
  environment.
 
 
  Thanks!
 
  Reagrds,
  Ashish Pareek
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals

   
  
  
  
   --
   Pro Hadoop, a book to guide you from beginner to hadoop mastery,
   http://www.amazon.com/dp/1430219424?tag=jewlerymall
   www.prohadoopbook.com a community for Hadoop Professionals
  
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Restrict output of mappers to reducers running on same node?

2009-06-18 Thread Tarandeep Singh
Jason, correct me if I am wrong-

opening Sequence file in the configure (or setup method in 0.20) and writing
to it is same as doing output.collect( ), unless you mean I should make the
sequence file writer static variable and set reuse jvm flag to -1. In that
case the subsequent mappers might be run in the same JVM and they can use
the same writer and hence produce one file. But in that case I need to add a
hook to close the writer - may be use the shutdown hook.

Jothi, the idea of combine input format is good. But I guess I have to write
somethign of my own to make it work in my case.

Thanks guys for the suggestions... but I feel we should have some support
from the framework to merge the output of mapper only job so that we don't
get a lot number of smaller files. Sometimes you just don't want to run
reducers and unnecessarily transfer a whole lot of data across the network.

Thanks,
Tarandeep

On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop jason.had...@gmail.comwrote:

 You can open your sequence file in the mapper configure method, write to it
 in your map, and close it in the mapper close method.
 Then you end up with 1 sequence file per map. I am making an assumption
 that
 each key,value to your map some how represents a single xml file/item.

 On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan joth...@yahoo-inc.com
 wrote:

  You could look at CombineFileInputFormat to generate a single split out
 of
  several files.
 
  Your partitioner would be able to assign keys to specific reducers, but
 you
  would not have control on which node a given reduce task will run.
 
  Jothi
 
 
  On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote:
 
   Hi,
  
   Can I restrict the output of mappers running on a node to go to
  reducer(s)
   running on the same node?
  
   Let me explain why I want to do this-
  
   I am converting huge number of XML files into SequenceFiles. So
   theoretically I don't even need reducers, mappers would read xml files
  and
   output Sequencefiles. But the problem with this approach is I will end
 up
   getting huge number of small output files.
  
   To avoid generating large number of smaller files, I can Identity
  reducers.
   But by running reducers, I am unnecessarily transfering data over
  network. I
   ran some test case using a small subset of my data (~90GB). With map
 only
   jobs, my cluster finished conversion in only 6 minutes. But with map
 and
   Identity reducers job, it takes around 38 minutes.
  
   I have to process close to a terabyte of data. So I was thinking of a
  faster
   alternatives-
  
   * Writing a custom OutputFormat
   * Somehow restrict output of mappers running on a node to go to
 reducers
   running on the same node. May be I can write my own partitioner
 (simple)
  but
   not sure how Hadoop's framework assigns partitions to reduce tasks.
  
   Any pointers ?
  
   Or this is not possible at all ?
  
   Thanks,
   Tarandeep
 
 


 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Practical limit on emitted map/reduce values

2009-06-18 Thread Owen O'Malley

On Jun 18, 2009, at 8:45 AM, Leon Mergen wrote:

Could you perhaps elaborate on that 100 MB limit ? Is that due to a  
limit that is caused by the Java VM heap size ? If so, could that,  
for example, be increased to 512MB by setting mapred.child.java.opts  
to '-Xmx512m' ?


A couple of points:
  1. The 100MB was just for ballpark calculations. Of course if you  
have a large heap, you can fit larger values. Don't forget that the  
framework is allocating big chunks of the heap for its own buffers,  
when figuring out how big to make your heaps.
  2. Having large keys is much harder than large values. When doing a  
N-way merge, the framework has N+1 keys and 1 value in memory at a time.


-- Owen


Re: Need help

2009-06-18 Thread Matt Massie
Hadoop can be run on a hardware heterogeneous cluster.  Currently,  
Hadoop clusters really only run well on Linux although you can run a  
Hadoop client on non-Linux machines.


You will need to have a special configuration for each of the machine  
in your cluster based on their hardware profile.  Ideally, you'll be  
able to group the machines in your cluster into classes of machines  
(e.g. machines with 1GB of RAM and 2 core versus 4GB of RAM and 4  
core) to reduce the burden of managing multiple configurations.  If  
you are talking about a Hadoop cluster that is completely  
heterogeneous (each machine is completely different), the management  
overhead could be high.


Configuration variables like mapred.tasktracker.map.tasks.maximum  
and mapred.tasktracker.reduce.tasks.maximum should be set based on  
the number of cores/memory in each machine.  Variables like  
mapred.child.java.opts need to be set differently based on the  
amount of memory the machine has (e.g. -Xmx250m).  You should have  
at least 250MB of memory dedicated to each task although more is  
better.  It's also wise to make sure that each task has the same  
amount of memory regardless of the machine it's scheduled on;  
otherwise, tasks might succeed or fail based on which machine gets the  
task.  This asymmetry will make debugging harder.


You can use our online configurator (http://www.cloudera.com/configurator/ 
), to generate optimized configurations for each class of machines in  
your cluster.  It will ask simple question about your configuration  
and then produce a hadoop-site.xml file.


Good luck!
-Matt

On Jun 18, 2009, at 8:33 AM, ashish pareek wrote:

Can you tell few of the challenges in configuring heterogeneous  
cluster...or

can pass on some link where I would get some information regarding
challenges in running Hadoop on heterogeneous hardware

One more things is How about running different applications on the  
same

Hadoop cluster?and what challenges are involved in it ?

Thanks,
Regards,
Ashish


On Thu, Jun 18, 2009 at 8:53 PM, jason hadoop  
jason.had...@gmail.comwrote:



I don't know anyone who has a completely homogeneous cluster.

So hadoop is scalable across heterogeneous environments.

I stated that configuration is simpler if the machines are similar  
(There

are optimizations in configuration for near homogeneous machines.)

On Thu, Jun 18, 2009 at 8:10 AM, ashish pareek pareek...@gmail.com
wrote:

Does that mean hadoop is not scalable wrt heterogeneous  
environment? and

one
more question is can we run different application on the same hadoop
cluster
.

Thanks.
Regards,
Ashish

On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop  
jason.had...@gmail.com

wrote:



Hadoop has always been reasonably agnostic wrt hardware and

homogeneity.
There are optimizations in configuration for near homogeneous  
machines.




On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek  
pareek...@gmail.com

wrote:


Hello,
  I am doing my master my final year project is on Hadoop

...so

I

would like to know some thing about Hadoop cluster i.e, Do new

version

of

Hadoop are able to handle heterogeneous hardware.If you have any
informantion regarding these please mail me as my project is in
heterogenous
environment.


Thanks!

Reagrds,
Ashish Pareek





--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals







--
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals





Upgrading from .19 to .20 problems

2009-06-18 Thread llpind

Hey All,

I'm able to start my master server, but none of the slave nodes come up
(unless I list the master as the slave).  After searching a bit, seems
people have this problem when they forget to set df.default.name, but i've
got it set in core-site.xml (listed below).  They all have the error below
on start up: 

STARTUP_MSG: Starting DataNode 
STARTUP_MSG:   host = slave1/192.168.0.234 
STARTUP_MSG:   args = [] 
STARTUP_MSG:   version = 0.20.0 
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504;
compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009 
/ 
2009-06-18 09:06:49,369 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
java.lang.NullPointerException 
at
org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:134) 
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:156) 
at
org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:160) 
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:246)
 
at
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) 
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 

2009-06-18 09:06:49,370 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down DataNode at slave1/192.168.0.234
/

==
core-site.xml
==

property 
   namefs.default.name/name 
   valuehdfs://master:54310/value 
   descriptionThe name of the default file system.  A URI whose 
   scheme and authority determine the FileSystem implementation.  The 
   uri's scheme determines the config property (fs.SCHEME.impl) naming 
   the FileSystem implementation class.  The uri's authority is used to 
   determine the host, port, etc. for a filesystem./description 
/property 
property 
  namehadoop.tmp.dir/name 
  value/data/hadoop-0.20.0-${user.name}/value 
  descriptionA base for other temporary directories./description 
/property 
-- 
View this message in context: 
http://www.nabble.com/Upgrading-from-.19-to-.20-problems-tp24095348p24095348.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



RE: Practical limit on emitted map/reduce values

2009-06-18 Thread Leon Mergen
Hello Owen,

  Could you perhaps elaborate on that 100 MB limit ? Is that due to a
  limit that is caused by the Java VM heap size ? If so, could that,
  for example, be increased to 512MB by setting mapred.child.java.opts
  to '-Xmx512m' ?
 
 A couple of points:
1. The 100MB was just for ballpark calculations. Of course if you
 have a large heap, you can fit larger values. Don't forget that the
 framework is allocating big chunks of the heap for its own buffers,
 when figuring out how big to make your heaps.
2. Having large keys is much harder than large values. When doing a
 N-way merge, the framework has N+1 keys and 1 value in memory at a
 time.

Ok, that makes sense. Thanks for the information!


Regards,

Leon Mergen




Re: Hadoop Eclipse Plugin

2009-06-18 Thread Praveen Yarlagadda
Hi,

Thank you for your response. netstat and telnet commands executed fine, but
couldn't connect to them through eclipse. I doubt something wrong with the
eclipse plugin.

Do you know any other development environments people use to develop
map/reduce applications?

Regards,
Praveen

On Thu, Jun 18, 2009 at 2:49 AM, Steve Loughran ste...@apache.org wrote:

 Praveen Yarlagadda wrote:

 Hi,

 I have a problem configuring Hadoop Map/Reduce plugin with Eclipse.

 Setup Details:

 I have a namenode, a jobtracker and two data nodes, all running on ubuntu.
 My set up works fine with example programs. I want to connect to this
 setup
 from eclipse.

 namenode - 10.20.104.62 - 54310(port)
 jobtracker - 10.20.104.53 - 54311(port)

 I run eclipse on a different windows m/c. I want to configure map/reduce
 plugin
 with eclipse, so that I can access HDFS from windows.

 Map/Reduce master
 Host - With jobtracker IP, it did not work
 Port - With jobtracker port, it did not work

 DFS master
 Host - With namenode IP, It did not work
 Port - With namenode port, it did not work

 I tried other combination too by giving namenode details for Map/Reduce
 master
 and jobtracker details for DFS master. It did not work either.


 1. check the ports really are open by doing a netstat -a -p on the namenode
 and job tracker ,
 netstat -a -p | grep 54310 on the NN
 netstat -a -p | grep 54311 on the JT

 2l Then, from the windows machine, see if you can connect to them oustide
 ecipse

 telnet 10.20.104.62  54310
 telnet 10.20.104.53 - 54311

 If you can't connect, then firewalls are interfering

 If everything works, the problem is in the eclipse plugin (which I don't
 use, and cannot assist with)

 --
 Steve Loughran  http://www.1060.org/blogxter/publish/5
 Author: Ant in Action   http://antbook.org/




-- 
Regards,
Praveen


slaves registers on JobTracker as localhost!!!

2009-06-18 Thread b


slaves registers on JobTracker as localhost
And JobTracker trying to fetch data from localhost when he wants to  
fetch data from slaves.
But word localhost not contains neither in /etc/hosts, not in any other  
linux network or hadoop config file.

For example, my slave's /etc/hosts:
---
192.168.2.22gentoo1
---
and JobTracker trying to connect to it by localhost.
(but many times mentions in hadoop's java-sources)
Thank you.


Re: Read/write dependency wrt total data size on hdfs

2009-06-18 Thread Alex Loddengaard
I'm a little confused what you're question is.  Are you asking why HDFS has
consistent read/write speeds even as your cluster gets more and more data?

If so, two HDFS bottlenecks that would change read/write performance as used
capacity changes are name node (NN) RAM and the amount of data each of your
data nodes (DNs) are storing.  If you have so much meta data (lots of files,
blocks, etc.) that the NN java process uses most of your NN's memory, then
you'll see a big decrease in performance.  This bottleneck usually only
shows itself on large clusters with tons of metadata, though a small cluster
with a wimpy NN machine will have the same bottleneck.  Similarly, if each
of your DNs are storing close to their capacity, then reads/writes will
begin to slow down, as each node will be responsible for streaming more and
more data in and out.  Does that make sense?

You should fill your cluster up to 80-90%.  I imagine you'd probably see a
decrease in read/write performance depending on the tests you're running,
though I can't say I've done this performance test before.  I'm merely
speculating.

Hope this clears things up.

Alex

On Thu, Jun 18, 2009 at 9:30 AM, Wasim Bari wasimb...@msn.com wrote:

 Hi,
 I am storing data on a HDFS cluster(4 machines).  I have seen that
 read/write is not very much effected with the size of data on HDFS (Total
 data size of HDFS). I have used

 20-30% of cluster and didn't completely filled it.  Can someone explain me
 why its so and HDFS promises such feature or I am missing some stuff?

 Thanks,

 wasim


multiple file input

2009-06-18 Thread pmg

I am evaluating hadoop for a problem that do a Cartesian product of input
from one file of 600K (File A) with another set of file set (FileB1, FileB2,
FileB3) with 2 millions line in total.

Each line from FileA gets compared with every line from FileB1, FileB2 etc.
etc. FileB1, FileB2 etc. are in a different input directory

So

Two input directories 

1. input1 directory with a single file of 600K records - FileA
2. input2 directory segmented into different files with 2Million records -
FileB1, FileB2 etc.

How can I have a map that reads a line from a FileA in directory input1 and
compares the line with each line from input2? 

What is the best way forward? I have seen plenty of examples that maps each
record from single input file and reduces into an output forward.

thanks
-- 
View this message in context: 
http://www.nabble.com/multiple-file-input-tp24095358p24095358.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Read/write dependency wrt total data size on hdfs

2009-06-18 Thread Todd Lipcon
On Thu, Jun 18, 2009 at 10:55 AM, Alex Loddengaard a...@cloudera.comwrote:

 I'm a little confused what you're question is.  Are you asking why HDFS has
 consistent read/write speeds even as your cluster gets more and more data?

 If so, two HDFS bottlenecks that would change read/write performance as
 used
 capacity changes are name node (NN) RAM and the amount of data each of your
 data nodes (DNs) are storing.  If you have so much meta data (lots of
 files,
 blocks, etc.) that the NN java process uses most of your NN's memory, then
 you'll see a big decrease in performance.


To avoid this issue, simply watch swap usage on your NN. If your NN starts
swapping you will likely run into problems with your metadata operation
speed. This won't affect throughput of read/writes within a block, though.


  This bottleneck usually only
 shows itself on large clusters with tons of metadata, though a small
 cluster
 with a wimpy NN machine will have the same bottleneck.  Similarly, if each
 of your DNs are storing close to their capacity, then reads/writes will
 begin to slow down, as each node will be responsible for streaming more and
 more data in and out.  Does that make sense?

 You should fill your cluster up to 80-90%.  I imagine you'd probably see a
 decrease in read/write performance depending on the tests you're running,
 though I can't say I've done this performance test before.  I'm merely
 speculating.


Another thing to keep in mind is that local filesystem performance begins to
suffer once a disk is more than 80% or so full. This is due to the ways that
filesystems endeavour to keep file fragmentation low. When there is little
extra space on the drive, the file system has fewer options for relocating
blocks and fighting fragmentation, so sequential writes and reads will
actually incur seeks on the local disk. Since the datanodes store their
blocks on the local file system, this is a factor worth considering.

-Todd


Heed help. 0.18.3. pipes. thanks.

2009-06-18 Thread pavel kolodin


2 nodes: ibmT43, gentoo1.

ibmT43 = NameNode + JobTracker   +   TaskTracker + DataNode
gentoo1 = TaskTracker + DataNode

===conf/hadoop-site.xml=== - identical on 2 hosts:

?xml version=1.0?
?xml-stylesheet type=text/xsl href=configuration.xsl?

!-- Put site-specific property overrides in this file. --

configuration
property
namefs.default.name/name
valuehdfs://ibmT43:9000//value
/property
property
namemapred.job.tracker/name
valueibmT43:9001/value
/property
property
namedfs.replication/name
value2/value
/property
property
namemapred.task.timeout/name
value2/value
/property

property
namehadoop.pipes.executable/name
value/bin/logparser/value
/property

property
  namemapred.map.tasks/name
  value10/value
/property
property
  namemapred.reduce.tasks/name
  value1/value
/property
/configuration

---

/etc/hosts on all two hosts also identical:

192.168.2.1ibmT43
192.168.2.22   gentoo1
# 127.0.0.1 localhost
# it will have no effect if i will uncomment localhost
--

binary hdfs://bin/logparser is correct - it have been working in past.

--

bin/hadoop pipes -input /input -output /output1 -conf 123test.xml

123test.xml===:
?xml version=1.0?
configuration

!--
property
  namemapred.reduce.tasks/name
  value2/value
/property
--

property
  namehadoop.pipes.java.recordreader/name
  valuetrue/value
/property

property
  namehadoop.pipes.java.recordwriter/name
  valuetrue/value
/property

/configuration

-

Running job:
had...@ibmt43 ~/hadoop-0.18.3 $ bin/hadoop pipes -input /input -output  
/output1 -conf 123test.xml
09/06/18 22:38:50 WARN mapred.JobClient: Use GenericOptionsParser for  
parsing the arguments. Applications should implement Tool for the same.
09/06/18 22:38:50 WARN mapred.JobClient: No job jar file set.  User  
classes may not be found. See JobConf(Class) or JobConf#setJar(String).
09/06/18 22:38:50 INFO mapred.FileInputFormat: Total input paths to  
process : 4
09/06/18 22:38:50 INFO mapred.FileInputFormat: Total input paths to  
process : 4

09/06/18 22:38:51 INFO mapred.JobClient: Running job: job_200906182236_0001
09/06/18 22:38:52 INFO mapred.JobClient:  map 0% reduce 0%
09/06/18 22:39:03 INFO mapred.JobClient:  map 5% reduce 0%
09/06/18 22:39:12 INFO mapred.JobClient:  map 11% reduce 0%
09/06/18 22:39:23 INFO mapred.JobClient:  map 5% reduce 0%
09/06/18 22:39:23 INFO mapred.JobClient: Task Id :  
attempt_200906182236_0001_m_02_0, Status : FAILED
Task attempt_200906182236_0001_m_02_0 failed to report status for 22  
seconds. Killing!
09/06/18 22:39:24 INFO mapred.JobClient: Task Id :  
attempt_200906182236_0001_m_03_0, Status : FAILED
Task attempt_200906182236_0001_m_03_0 failed to report status for 22  
seconds. Killing!

09/06/18 22:39:33 INFO mapred.JobClient:  map 0% reduce 0%
09/06/18 22:39:33 INFO mapred.JobClient: Task Id :  
attempt_200906182236_0001_m_00_0, Status : FAILED
Task attempt_200906182236_0001_m_00_0 failed to report status for 23  
seconds. Killing!
attempt_200906182236_0001_m_00_0: Hadoop Pipes Exception: write error  
to file: Connection reset by peer at SerialUtils.cc:129 in virtual void  
Hadoo

pUtils::FileOutStream::write(const void*, size_t)
09/06/18 22:39:33 INFO mapred.JobClient: Task Id :  
attempt_200906182236_0001_m_01_0, Status : FAILED
Task attempt_200906182236_0001_m_01_0 failed to report status for 23  
seconds. Killing!

09/06/18 22:39:34 INFO mapred.JobClient:  map 2% reduce 0%
09/06/18 22:39:38 INFO mapred.JobClient:  map 5% reduce 0%
09/06/18 22:39:43 INFO mapred.JobClient:  map 8% reduce 0%
09/06/18 22:39:48 INFO mapred.JobClient:  map 12% reduce 0%
09/06/18 22:39:54 INFO mapred.JobClient:  map 9% reduce 0%
09/06/18 22:39:54 INFO mapred.JobClient: Task Id :  
attempt_200906182236_0001_m_04_0, Status : FAILED
Task attempt_200906182236_0001_m_04_0 failed to report status for 22  
seconds. Killing!
attempt_200906182236_0001_m_04_0: Hadoop Pipes Exception: write error  
to file: Connection reset by peer at SerialUtils.cc:129 in virtual void  
Hadoo

pUtils::FileOutStream::write(const void*, size_t)
09/06/18 22:39:59 INFO mapred.JobClient:  map 6% reduce 0%
09/06/18 22:39:59 INFO mapred.JobClient: Task Id :  
attempt_200906182236_0001_m_05_0, Status : FAILED
Task attempt_200906182236_0001_m_05_0 failed to report status for 22  
seconds. Killing!
attempt_200906182236_0001_m_05_0: Hadoop Pipes Exception: write error  
to file: Connection reset by peer at SerialUtils.cc:129 in virtual void  
Hadoo

pUtils::FileOutStream::write(const void*, size_t)
09/06/18 22:40:03 INFO 

HDFS is not loading evenly across all nodes.

2009-06-18 Thread openresearch

Hi all

I dfs put a large dataset onto a 10-node cluster.

When I observe the Hadoop progress (via web:50070) and each local file
system (via df -k),
I notice that my master node is hit 5-10 times harder than others, so hard
drive is get full quicker than others. Last night load, it actually crash
when hard drive was full. 

To my understand,  data should wrap around all nodes evenly (in a
round-robin fashion using 64M as a unit). 

Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting
way?

Thanks


-- 
View this message in context: 
http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Aaron Kimball
Did you run the dfs put commands from the master node?  If you're inserting
into HDFS from a machine running a DataNode, the local datanode will always
be chosen as one of the three replica targets. For more balanced loading,
you should use an off-cluster machine as the point of origin.

If you experience uneven block distribution, you should also periodically
rebalance your cluster by running bin/start-balancer.sh every so often. It
will work in the background to move blocks from heavily-laden nodes to
underutilized ones.

- Aaron

On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
qiming...@openresearchinc.com wrote:


 Hi all

 I dfs put a large dataset onto a 10-node cluster.

 When I observe the Hadoop progress (via web:50070) and each local file
 system (via df -k),
 I notice that my master node is hit 5-10 times harder than others, so hard
 drive is get full quicker than others. Last night load, it actually crash
 when hard drive was full.

 To my understand,  data should wrap around all nodes evenly (in a
 round-robin fashion using 64M as a unit).

 Is it expected behavior of Hadoop? Can anyone suggest a good
 troubleshooting
 way?

 Thanks


 --
 View this message in context:
 http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Aaron Kimball
As an addendum, running a DataNode on the same machine as a NameNode is
generally considered a bad idea because it hurts the NameNode's ability to
maintain high throughput.

- Aaron

On Thu, Jun 18, 2009 at 1:26 PM, Aaron Kimball aa...@cloudera.com wrote:

 Did you run the dfs put commands from the master node?  If you're inserting
 into HDFS from a machine running a DataNode, the local datanode will always
 be chosen as one of the three replica targets. For more balanced loading,
 you should use an off-cluster machine as the point of origin.

 If you experience uneven block distribution, you should also periodically
 rebalance your cluster by running bin/start-balancer.sh every so often. It
 will work in the background to move blocks from heavily-laden nodes to
 underutilized ones.

 - Aaron


 On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
 qiming...@openresearchinc.com wrote:


 Hi all

 I dfs put a large dataset onto a 10-node cluster.

 When I observe the Hadoop progress (via web:50070) and each local file
 system (via df -k),
 I notice that my master node is hit 5-10 times harder than others, so hard
 drive is get full quicker than others. Last night load, it actually crash
 when hard drive was full.

 To my understand,  data should wrap around all nodes evenly (in a
 round-robin fashion using 64M as a unit).

 Is it expected behavior of Hadoop? Can anyone suggest a good
 troubleshooting
 way?

 Thanks


 --
 View this message in context:
 http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: Trying to setup Cluster

2009-06-18 Thread Aaron Kimball
Are you encountering specific problems?

I don't think that hadoop's config files will evaluate environment
variables. So $HADOOP_HOME won't be interpreted correctly.

For passwordless ssh, see
http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html or
just check the manpage for ssh-keygen.

- Aaron

On Wed, Jun 17, 2009 at 9:30 AM, Divij Durve divij.t...@gmail.com wrote:

 Im trying to setup a cluster with 3 different machines running Fedora. I
 cant get them to log into the localhost without the password but thats the
 least of my worries at the moment.

 I am posting my config files and the master and slave files let me know if
 anyone can spot a problem with the configs...


 Hadoop-site.xml
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
 property
namedfs.data.dir/name
value$HADOOP_HOME/dfs-data/value
finaltrue/final
  /property

  property
 namedfs.name.dir/name
 value$HADOOP_HOME/dfs-name/value
 finaltrue/final
   /property


 property
  namehadoop.tmp.dir/name
value$HADOOP_HOME/hadoop-tmp/value
  descriptionA base for other temporary directories./description
  /property


 property
  namefs.default.name/name
valuehdfs://gobi.something.something:54310/value
  descriptionThe name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl)
 naming
the FileSystem implementation class.  The uri's authority is
 used to
  determine the host, port, etc. for a FileSystem./description
  /property

 property
  namemapred.job.tracker/name
valuekalahari.something.something:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
at.  If local, then jobs are run in-process as a single map
  and reduce task.
/description
/property

  property
 namemapred.system.dir/name
 value$HADOOP_HOME/mapred-system/value
 finaltrue/final
   /property

 property
  namedfs.replication/name
value1/value
  descriptionDefault block replication.
The actual number of replications can be specified when the file is
 created.
  The default is used if replication is not specified in create
 time.
/description
/property


 property
  namemapred.local.dir/name
value$HADOOP_HOME/mapred-local/value
  namedfs.replication/name
value1/value
 /property


 /configuration


 Slave:
 kongur.something.something

 master:
 kalahari.something.something

 i execute the dfs-start.sh command from gobi.something.something.

 is there any other info that i should provide in order to help? Also Kongur
 is where im running the data node the master file on kongur should have
 localhost in it rite? thanks for the help

 Divij



Re: multiple file input

2009-06-18 Thread Owen O'Malley

On Jun 18, 2009, at 10:56 AM, pmg wrote:

Each line from FileA gets compared with every line from FileB1,  
FileB2 etc.

etc. FileB1, FileB2 etc. are in a different input directory


In the general case, I'd define an InputFormat that takes two  
directories, computes the input splits for each directory and  
generates a new list of InputSplits that is the cross-product of the  
two lists. So instead of FileSplit, it would use a FileSplitPair that  
gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
reader would return a TextPair with left and right records (ie.  
lines). Clearly, you read the first line of split1 and cross it by  
each line from split2, then move to the second line of split1 and  
process each line from split2, etc.


You'll need to ensure that you don't overwhelm the system with either  
too many input splits (ie. maps). Also don't forget that N^2/M grows  
much faster with the size of the input (N) than the M machines can  
handle in a fixed amount of time.



Two input directories

1. input1 directory with a single file of 600K records - FileA
2. input2 directory segmented into different files with 2Million  
records -

FileB1, FileB2 etc.


In this particular case, it would be right to load all of FileA into  
memory and process the chunks of FileB/part-*. Then it would be much  
faster than needing to re-read the file over and over again, but  
otherwise it would be the same.


-- Owen


Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-18 Thread akhil1988

Hi Jason!

I finally found out that there was some problem in reserving the HEAPSIZE
which I have resolved now. Actually we cannot change the HADOOP_HEAPSIZE
using export from our user account, after we have started the Hadoop. It has
to changed by the root.

I have a user account on the cluster and I was trying to change the
Hadoop_heapsize from my user account using 'export' which had no effect.
So I had to request my cluster administrator to increase the HADOOP_HEAPSIZE
in hadoop-env.sh and then restart hadoop. Now the program is running
absolutely fine. Thanks for your help.

One thing that I would like to ask you is that can we use DistributerCache
for transferring directories to the local cache of the tasks?

Thanks,
Akhil



akhil1988 wrote:
 
 Hi Jason!
 
 Thanks for going with me to solve my problem.
 
 To restate things and make it more easier to understand: I am working in
 local mode in the directory which contains the job jar and also the Config
 and Data directories.
 
 I just removed the following three statements from my code:
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
 DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
 DistributedCache.createSymlink(conf);
 
 The program executes till the same point as before now also and
 terminates. That means the above three statements are of no use while
 working in local mode. In local mode, the working directory for the
 mapreduce tasks becomes the current woking direcotry in which you started
 the hadoop command to execute the job.
 
 Since I have removed the DistributedCache.add. statements there should
 be no issue whether I am giving a file name or a directory name as
 argument to it. Now it seems to me that there is some problem in reading
 the binary file using binaryRead.
 
 Please let me know if I am going wrong anywhere.
 
 Thanks,
 Akhil
  
 
 
 
 
 jason hadoop wrote:
 
 I have only ever used the distributed cache to add files, including
 binary
 files such as shared libraries.
 It looks like you are adding a directory.
 
 The DistributedCache is not generally used for passing data, but for
 passing
 file names.
 The files must be stored in a shared file system (hdfs for simplicity)
 already.
 
 The distributed cache makes the names available to the tasks, and the the
 files are extracted from hdfs and stored in the task local work area on
 each
 task tracker node.
 It looks like you may be storing the contents of your files in the
 distributed cache.
 
 On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote:
 

 Thanks Jason.

 I went inside the code of the statement and found out that it eventually
 makes some binaryRead function call to read a binary file and there it
 strucks.

 Do you know whether there is any problem in giving a binary file for
 addition to the distributed cache.
 In the statement DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory
 which contains some text as well as some binary files. In the statement
 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I
 can
 see(in the output messages) that it is able to read the text files but
 it
 gets struck at the binary files.

 So, I think here the problem is: it is not able to read the binary files
 which either have not been transferred to the cache or a binary file
 cannot
 be read.

 Do you know the solution to this?

 Thanks,
 Akhil


 jason hadoop wrote:
 
  Something is happening inside of your (Parameters.
  readConfigAndLoadExternalData(Config/allLayer1.config);)
  code, and the framework is killing the job for not heartbeating for
 600
  seconds
 
  On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com
 wrote:
 
 
  One more thing, finally it terminates there (after some time) by
 giving
  the
  final Exception:
 
  java.io.IOException: Job failed!
 at
 org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  at LbjTagger.NerTagger.main(NerTagger.java:109)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
 
  akhil1988 wrote:
  
   Thank you Jason for your reply.
  
   My Map class is an inner class and it is a static class. Here is
 the
   structure of my code.
  
   public class NerTagger {
  
   public static class Map extends 

Re: Data replication and moving computation

2009-06-18 Thread Roshan James
Further, look at the namenode file system browser for your cluster to see
the chunking in action.

http://wiki.apache.org/hadoop/WebApp%20URLs

Roshan

On Thu, Jun 18, 2009 at 6:28 AM, Harish Mallipeddi 
harish.mallipe...@gmail.com wrote:

 On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta graj1...@yahoo.com wrote:

 
  I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS
  cluster and replication factor is 1. A large file is there on one of
 those
  three cluster machines in its local file system. If I put that file in
 HDFS
  will it be divided and distributed across all three machines? I had this
  doubt as HDFS moving computation is cheaper than moving data.
 
  If file is distributed across all three machines, lots of data transfer
  will be there, whereas, if file is NOT distributed then compute power of
  other machine will be unused. Am I missing something here?
 
  -Raj
 
 
 
 Irrespective of what you set as the replication factor, large files will
 always be split into chunks (chunk size is what you set as your HDFS
 block-size) and they'll be distributed across your entire cluster.


 --
 Harish Mallipeddi
 http://blog.poundbang.in



Hadoop error help- file system closed, could only be replicated to 0 nodes, instead of 1

2009-06-18 Thread terrianne.erickson
Hi,
 
I am extremely new to Hadoop and have come across a few errors that I'm not 
sure how to fix. I am running Hadoop version 0.19.0 from an image through 
Elasticfox and S3. I am on windows and use puTTY as my ssh. I am trying to run 
a wordcount with 5 slaves. This is what I do so far:
 
1. boot up the instance through ElasticFox
2. cd /usr/local/hadoop-0.19.0
3. bin/hadoop namenode -format
4. bin/start-all.sh
5. jps --( shows jps, jobtracker, secondarynamenode)
6.bin/stop-all.sh
7. ant examples
8. bin/start-all.sh
9. bin/hadoop jar build/hadoop-0.19.0-examples.jar pi 0 100
 
Then I get this error trace:
 
Number of Maps = 0 Samples per Map = 100
Starting Job
09/06/18 17:31:25 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: File 
/mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be 
replicated to 0 nodes, instead of 1
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
 
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
 
09/06/18 17:31:25 WARN hdfs.DFSClient: NotReplicatedYetException sleeping 
/mnt/hadoop/mapred/system/job_200906181730_0001/job.jar retries left 4
09/06/18 17:31:25 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: 
java.io.IOException: File 
/mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be 
replicated to 0 nodes, instead of 1
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflec,t.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
 
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
at 
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
 
09/06/18 17:31:25 WARN hdfs.DFSClient: NotReplicatedYetException sleeping 
/mnt/hadoop/mapred/system/job_200906181730_0001/job.jar retries left 3
09/06/18 

Can a hadoop pipes job be given multiple input directories?

2009-06-18 Thread Roshan James
In the documentation for Hadoop Streaming it says that the -input option
can be specified multiple times for multiples input directories. The same
does not seem to work with Pipes.

Is there some way to specify multiple input directories for pipes jobs?

Roshan

ps. With muliple input dirs this is what happens (i.e. there is no clear
error message of any sort).

*+ bin/hadoop pipes -conf pipes.xml -input /in-dir-har/test.har -input
/in-dir -output /out-dir
bin/hadoop pipes
  [-input path] // Input directory
  [-output path] // Output directory
  [-jar jar file // jar filename
  [-inputformat class] // InputFormat class
  [-map class] // Java Map class
  [-partitioner class] // Java Partitioner
  [-reduce class] // Java Reduce class
  [-writer class] // Java RecordWriter
  [-program executable] // executable URI
  [-reduces num] // number of reduces

Generic options supported are
-conf configuration file specify an application configuration file
-D property=valueuse value for given property
-fs local|namenode:port  specify a namenode
-jt local|jobtracker:portspecify a job tracker
-files comma separated list of filesspecify comma separated files to
be copied to the map reduce cluster
-libjars comma separated list of jarsspecify comma separated jar files
to include in the classpath.
-archives comma separated list of archivesspecify comma separated
archives to be unarchived on the compute machines.

The general command line syntax is
bin/hadoop command [genericOptions] [commandOptions]
*


Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Rajeev Gupta
If you're inserting
into HDFS from a machine running a DataNode, the local datanode will always
be chosen as one of the three replica targets.
Does that mean that if replication factor is 1, whole file will be kept on
one node only?

Thanks and regards.
-Rajeev Gupta



   
 Aaron Kimball 
 aa...@cloudera.c 
 omTo 
   core-user@hadoop.apache.org 
 06/19/2009 01:56   cc 
 AM
   Subject 
   Re: HDFS is not loading evenly  
 Please respond to across all nodes.   
 core-u...@hadoop. 
apache.org 
   
   
   
   




Did you run the dfs put commands from the master node?  If you're inserting
into HDFS from a machine running a DataNode, the local datanode will always
be chosen as one of the three replica targets. For more balanced loading,
you should use an off-cluster machine as the point of origin.

If you experience uneven block distribution, you should also periodically
rebalance your cluster by running bin/start-balancer.sh every so often. It
will work in the background to move blocks from heavily-laden nodes to
underutilized ones.

- Aaron

On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
qiming...@openresearchinc.com wrote:


 Hi all

 I dfs put a large dataset onto a 10-node cluster.

 When I observe the Hadoop progress (via web:50070) and each local file
 system (via df -k),
 I notice that my master node is hit 5-10 times harder than others, so
hard
 drive is get full quicker than others. Last night load, it actually crash
 when hard drive was full.

 To my understand,  data should wrap around all nodes evenly (in a
 round-robin fashion using 64M as a unit).

 Is it expected behavior of Hadoop? Can anyone suggest a good
 troubleshooting
 way?

 Thanks


 --
 View this message in context:

http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html

 Sent from the Hadoop core-user mailing list archive at Nabble.com.






Re: Pipes example wordcount-nopipe.cc failed when reading from input splits

2009-06-18 Thread Jianmin Woo
Hi, Roshan ,
Thanks a lot for your information about the InputSplit between Java and pipes.

-Jianmin





From: Roshan James roshan.james.subscript...@gmail.com
To: core-user@hadoop.apache.org
Sent: Thursday, June 18, 2009 9:11:41 PM
Subject: Re: Pipes example wordcount-nopipe.cc failed when reading from input  
splits

I did get this working. InputSplit information is not returned clearly. You
may want to look at this thread -
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cee216d470906121602k7f914179u5d9555e7bb080...@mail.gmail.com%3e


On Thu, Jun 18, 2009 at 12:49 AM, Jianmin Woo jianmin_...@yahoo.com wrote:


 I tried this example and it seems that the input/output should only be in
 file:///... format to get correct results.

 - Jianmin




 
 From: Viral K khaju...@yahoo-inc.com
 To: core-user@hadoop.apache.org
 Sent: Thursday, June 18, 2009 8:57:47 AM
 Subject: Re: Pipes example wordcount-nopipe.cc failed when reading from
 input splits


 Does anybody have any updates on this?

 How can we have our own RecordReader in Hadoop pipes?  When I try to print
 the context.getInputSplit, I get the filenames along with some junk
 characters.  As a result the file open fails.

 Anybody got it working?

 Viral.



 11 Nov. wrote:
 
  I traced into the c++ recordreader code:
WordCountReader(HadoopPipes::MapContext context) {
  std::string filename;
  HadoopUtils::StringInStream stream(context.getInputSplit());
  HadoopUtils::deserializeString(filename, stream);
  struct stat statResult;
  stat(filename.c_str(), statResult);
  bytesTotal = statResult.st_size;
  bytesRead = 0;
  cout  filenameendl;
  file = fopen(filename.c_str(), rt);
  HADOOP_ASSERT(file != NULL, failed to open  + filename);
}
 
  I got nothing for the filename virable, which showed the InputSplit is
  empty.
 
  2008/3/4, 11 Nov. nov.eleve...@gmail.com:
 
  hi colleagues,
 I have set up the single node cluster to test pipes examples.
 wordcount-simple and wordcount-part work just fine. but
  wordcount-nopipe can't run. Here is my commnad line:
 
   bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml -input
  input/ -output out-dir-nopipe1
 
  and here is the error message printed on my console:
 
  08/03/03 23:23:06 WARN mapred.JobClient: No job jar file set.  User
  classes may not be found. See JobConf(Class) or JobConf#setJar(String).
  08/03/03 23:23:06 INFO mapred.FileInputFormat: Total input paths to
  process : 1
  08/03/03 23:23:07 INFO mapred.JobClient: Running job:
  job_200803032218_0004
  08/03/03 23:23:08 INFO mapred.JobClient:  map 0% reduce 0%
  08/03/03 23:23:11 INFO mapred.JobClient: Task Id :
  task_200803032218_0004_m_00_0, Status : FAILED
  java.io.IOException: pipe child exception
  at org.apache.hadoop.mapred.pipes.Application.abort(
  Application.java:138)
  at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(
  PipesMapRunner.java:83)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
  at org.apache.hadoop.mapred.TaskTracker$Child.main(
  TaskTracker.java:1787)
  Caused by: java.io.EOFException
  at java.io.DataInputStream.readByte(DataInputStream.java:250)
  at
  org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java
  :313)
  at
 org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java
  :335)
  at
  org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(
  BinaryProtocol.java:112)
 
  task_200803032218_0004_m_00_0:
  task_200803032218_0004_m_00_0:
  task_200803032218_0004_m_00_0:
  task_200803032218_0004_m_00_0: Hadoop Pipes Exception: failed to
 open
  at /home/hadoop/hadoop-0.15.2-single-cluster
  /src/examples/pipes/impl/wordcount-nopipe.cc:67 in
  WordCountReader::WordCountReader(HadoopPipes::MapContext)
 
 
  Could anybody tell me how to fix this? That will be appreciated.
  Thanks a lot!
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Pipes-example-wordcount-nopipe.cc-failed-when-reading-from-input-splits-tp15807856p24084734.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.







  

Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Taeho Kang
Yes, it will be kept on the machine you issue the dfs -put command if it's
got a datanode running. Otherwise, a random datanode will be chosen to store
the datablocks.


On Fri, Jun 19, 2009 at 10:41 AM, Rajeev Gupta graj...@in.ibm.com wrote:

 If you're inserting
 into HDFS from a machine running a DataNode, the local datanode will always
 be chosen as one of the three replica targets.
 Does that mean that if replication factor is 1, whole file will be kept on
 one node only?

 Thanks and regards.
 -Rajeev Gupta




 Aaron Kimball
 aa...@cloudera.c
 omTo
   core-user@hadoop.apache.org
 06/19/2009 01:56   cc
 AM
   Subject
   Re: HDFS is not loading evenly
 Please respond to across all nodes.
 core-u...@hadoop.
apache.org








 Did you run the dfs put commands from the master node?  If you're inserting
 into HDFS from a machine running a DataNode, the local datanode will always
 be chosen as one of the three replica targets. For more balanced loading,
 you should use an off-cluster machine as the point of origin.

 If you experience uneven block distribution, you should also periodically
 rebalance your cluster by running bin/start-balancer.sh every so often. It
 will work in the background to move blocks from heavily-laden nodes to
 underutilized ones.

 - Aaron

 On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
 qiming...@openresearchinc.com wrote:

 
  Hi all
 
  I dfs put a large dataset onto a 10-node cluster.
 
  When I observe the Hadoop progress (via web:50070) and each local file
  system (via df -k),
  I notice that my master node is hit 5-10 times harder than others, so
 hard
  drive is get full quicker than others. Last night load, it actually crash
  when hard drive was full.
 
  To my understand,  data should wrap around all nodes evenly (in a
  round-robin fashion using 64M as a unit).
 
  Is it expected behavior of Hadoop? Can anyone suggest a good
  troubleshooting
  way?
 
  Thanks
 
 
  --
  View this message in context:
 

 http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html

  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 





Re: Hadoop error help- file system closed, could only be replicated to 0 nodes, instead of 1

2009-06-18 Thread ashish pareek
HI ,
 What seems from your details is that datanode is not running.can
you run *bin/hadoop dfsadmin -report*  and find out whether your datanodes
are up ? then post your observation and it would be better if you post even
your hadoop-site.xml file deatils also.

Regards,
Ashish.

On Fri, Jun 19, 2009 at 3:16 AM, terrianne.erick...@accenture.com wrote:

 Hi,

 I am extremely new to Hadoop and have come across a few errors that I'm not
 sure how to fix. I am running Hadoop version 0.19.0 from an image through
 Elasticfox and S3. I am on windows and use puTTY as my ssh. I am trying to
 run a wordcount with 5 slaves. This is what I do so far:

 1. boot up the instance through ElasticFox
 2. cd /usr/local/hadoop-0.19.0
 3. bin/hadoop namenode -format
 4. bin/start-all.sh
 5. jps --( shows jps, jobtracker, secondarynamenode)
 6.bin/stop-all.sh
 7. ant examples
 8. bin/start-all.sh
 9. bin/hadoop jar build/hadoop-0.19.0-examples.jar pi 0 100

 Then I get this error trace:

 Number of Maps = 0 Samples per Map = 100
 Starting Job
 09/06/18 17:31:25 INFO hdfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be
 replicated to 0 nodes, instead of 1
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)

 09/06/18 17:31:25 WARN hdfs.DFSClient: NotReplicatedYetException sleeping
 /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar retries left 4
 09/06/18 17:31:25 INFO hdfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be
 replicated to 0 nodes, instead of 1
at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflec,t.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
at
 

A simple performance benchmark for Hadoop, Hive and Pig

2009-06-18 Thread Zheng Shao
Hi all,

Yuntao Jia, our intern this summer, did a simple performance benchmark for 
Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A 
Comparison of Approaches to Large-Scale Data Analysis

The report and the performance test kit are both attached here:
http://issues.apache.org/jira/browse/HIVE-396


We tried our best to get good performance out of Hive and Pig, and we keep the 
hadoop program as close as it is from the SIGMOD paper.  We welcome all 
suggestions on how we can improve the performance more by both changing the 
configuration or improving the code.


While we tried our best to be fair, system settings and environments do affect 
the result a lot.  So we encourage everybody to try out the performance test 
kit on their own cluster, and we will appreciate if everybody can share their 
results.


Here is the summary.  The details are in the report 
hive_benchmark_2009-06-18.pdf from the link above.

Query: GREP SELECT
Hadoop: 136.1s
Hive:   125.4s
Pig:247.8s

Query: RANKINGS SELECT
Hadoop: 26.1s
Hive:   31.0s
Pig:38.4s

Query: USERVISITS AGGREGATION
Hadoop: 533.8s
Hive:   768.8s
Pig:855.4s

Query: RANKINGS USERVISITS JOIN
Hadoop: 470.0s
Hive:   471.3s
Pig:763.9s

Please take a look at hive_benchmark_2009-06-18.pdf from the link above for 
details. Let's keep discussions on 
http://issues.apache.org/jira/browse/HIVE-396 so it's easier to keep track.


Zheng