RE: not a SequenceFile?
Thanks for your suggestion Eason. I think sequence file issue is resolved, how ever success is far on the hill top!!! *Below is the command I have executed in the terminal:* #/bin/hadoop jar AggregateWordCount.jar org.apache.hadoop.examples.AggregateWordCount words/* aggregatewordcount_output 2 textinputformat *Error:* 09/06/18 11:51:42 INFO mapred.JobClient: Task Id : attempt_200906181145_0005_r_01_1, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. createInstance(UserDefinedValueAggregatorDescriptor.java:57) at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. createAggregator(UserDefinedValueAggregatorDescriptor.java:64) at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. init(UserDefinedValueAggregatorDescriptor.java:76) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getValueAggreg atorDescriptor(ValueAggregatorJobBase.java:54) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getAggregatorD escriptors(ValueAggregatorJobBase.java:65) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.initializeMySp ec(ValueAggregatorJobBase.java:74) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.configure(Valu eAggregatorJobBase.java:42) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:242) at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. createInstance(UserDefinedValueAggregatorDescriptor.java:52) ... 10 more *However, I could see that org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass class is available in the AggregateWordCount.jar, below is the listing of this jar: * # jar -tvf AggregateWordCount.jar 0 Tue Jun 09 18:11:42 IST 2009 META-INF/ 71 Tue Jun 09 18:11:42 IST 2009 META-INF/MANIFEST.MF 0 Tue Jun 09 18:10:58 IST 2009 org/ 0 Tue Jun 09 18:10:58 IST 2009 org/apache/ 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/ 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/examples/ 1298 Tue Jun 09 18:11:32 IST 2009 org/apache/hadoop/examples/AggregateWordCount$WordCountPlugInClass.class 846 Tue Jun 09 18:11:32 IST 2009 org/apache/hadoop/examples/AggregateWordCount.class Could you please advice what could be the problem here? Thank You, Shravan Kumar. M Catalytic Software Ltd. [SEI-CMMI Level 5 Company] - This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system administrator - netopshelpd...@catalytic.com -Original Message- From: Eason.Lee [mailto:leongf...@gmail.com] Sent: Thursday, June 18, 2009 11:27 AM To: core-user@hadoop.apache.org; shravan.mahank...@catalytic.com Subject: Re: not a SequenceFile? you'd better run it like this: bin/hadoop jar hadoop-*-examples.jar aggregatewordcount input output numOfReducers *textinputformat* * * input textinputformat to use textinputformat instand of SequenceFile hope it is helpful! 2009/6/18 Shravan Mahankali shravan.mahank...@catalytic.com Hi Nick, Thanks for your response. I am very new to Hadoop. I was trying to execute the AggregateWordCount example program provided by Hadoop in the distribution from the linux but was having the issue as stated in my earlier email. I have also tried executing the MultiFetch example with no success. If this example program needs the input file to be a sequence file. how should I create one, please advice? Thank You, Shravan Kumar. M Catalytic Software Ltd. [SEI-CMMI Level 5 Company] - This email and any files transmitted with it are confidential and intended solely for the use of the
Re: not a SequenceFile?
http://www.nabble.com/ClassNotFoundException-td23441528.htmlhttp://www.nabble.com/ClassNotFoundException-td23441528.html http://www.nabble.com/ClassNotFoundException-td23441528.htmlYou must make all of the required jars available to all of your tasks. You can either install them all the tasktracker machines and setup the tasktracker classpath to include them, or distributed them via the distributed cache. 2009/6/18 Shravan Mahankali shravan.mahank...@catalytic.com Thanks for your suggestion Eason. I think sequence file issue is resolved, how ever success is far on the hill top!!! *Below is the command I have executed in the terminal:* #/bin/hadoop jar AggregateWordCount.jar org.apache.hadoop.examples.AggregateWordCount words/* aggregatewordcount_output 2 textinputformat *Error:* 09/06/18 11:51:42 INFO mapred.JobClient: Task Id : attempt_200906181145_0005_r_01_1, Status : FAILED java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. createInstance(UserDefinedValueAggregatorDescriptor.java:57) at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. createAggregator(UserDefinedValueAggregatorDescriptor.java:64) at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. init(UserDefinedValueAggregatorDescriptor.java:76) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getValueAggreg atorDescriptor(ValueAggregatorJobBase.java:54) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.getAggregatorD escriptors(ValueAggregatorJobBase.java:65) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.initializeMySp ec(ValueAggregatorJobBase.java:74) at org.apache.hadoop.mapred.lib.aggregate.ValueAggregatorJobBase.configure(Valu eAggregatorJobBase.java:42) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:240) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:242) at org.apache.hadoop.mapred.lib.aggregate.UserDefinedValueAggregatorDescriptor. createInstance(UserDefinedValueAggregatorDescriptor.java:52) ... 10 more *However, I could see that org.apache.hadoop.examples.AggregateWordCount$WordCountPlugInClass class is available in the AggregateWordCount.jar, below is the listing of this jar: * # jar -tvf AggregateWordCount.jar 0 Tue Jun 09 18:11:42 IST 2009 META-INF/ 71 Tue Jun 09 18:11:42 IST 2009 META-INF/MANIFEST.MF 0 Tue Jun 09 18:10:58 IST 2009 org/ 0 Tue Jun 09 18:10:58 IST 2009 org/apache/ 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/ 0 Tue Jun 09 18:10:58 IST 2009 org/apache/hadoop/examples/ 1298 Tue Jun 09 18:11:32 IST 2009 org/apache/hadoop/examples/AggregateWordCount$WordCountPlugInClass.class 846 Tue Jun 09 18:11:32 IST 2009 org/apache/hadoop/examples/AggregateWordCount.class Could you please advice what could be the problem here? Thank You, Shravan Kumar. M Catalytic Software Ltd. [SEI-CMMI Level 5 Company] - This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify the system administrator - netopshelpd...@catalytic.com -Original Message- From: Eason.Lee [mailto:leongf...@gmail.com] Sent: Thursday, June 18, 2009 11:27 AM To: core-user@hadoop.apache.org; shravan.mahank...@catalytic.com Subject: Re: not a SequenceFile? you'd better run it like this: bin/hadoop jar hadoop-*-examples.jar aggregatewordcount input output numOfReducers *textinputformat* * * input textinputformat to use textinputformat instand of SequenceFile hope it is helpful! 2009/6/18 Shravan Mahankali shravan.mahank...@catalytic.com Hi Nick, Thanks for your response. I am very new to Hadoop. I was trying to execute the
Data replication and moving computation
I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS cluster and replication factor is 1. A large file is there on one of those three cluster machines in its local file system. If I put that file in HDFS will it be divided and distributed across all three machines? I had this doubt as HDFS moving computation is cheaper than moving data. If file is distributed across all three machines, lots of data transfer will be there, whereas, if file is NOT distributed then compute power of other machine will be unused. Am I missing something here? -Raj
Re: Hadoop Eclipse Plugin
You need to give 1) Map/reduce Master Host : Host where mapred.sh is running. 2)Map/reduce Master port: 19001 (see hadoop-site.xml file) 3)DFS master: Host where your start-dfs.sh is running 4) DFS master port: 19000 These parameters will be sufficient to access HDFS. You may need to setup some advanced parameters to give permissions to Window's user on hosts where hadoop is running. Thanks and regards. -Rajeev Gupta Praveen Yarlagadda praveen.yarlagad To d...@gmail.com core-user@hadoop.apache.org cc 06/18/2009 08:39 AMSubject Hadoop Eclipse Plugin Please respond to core-u...@hadoop. apache.org Hi, I have a problem configuring Hadoop Map/Reduce plugin with Eclipse. Setup Details: I have a namenode, a jobtracker and two data nodes, all running on ubuntu. My set up works fine with example programs. I want to connect to this setup from eclipse. namenode - 10.20.104.62 - 54310(port) jobtracker - 10.20.104.53 - 54311(port) I run eclipse on a different windows m/c. I want to configure map/reduce plugin with eclipse, so that I can access HDFS from windows. Map/Reduce master Host - With jobtracker IP, it did not work Port - With jobtracker port, it did not work DFS master Host - With namenode IP, It did not work Port - With namenode port, it did not work I tried other combination too by giving namenode details for Map/Reduce master and jobtracker details for DFS master. It did not work either. If anyone has configured plugin with eclipse, please let me know. Even the pointers to how to configure it will be highly appreciated. Thanks, Praveen
Re: Data replication and moving computation
On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta graj1...@yahoo.com wrote: I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS cluster and replication factor is 1. A large file is there on one of those three cluster machines in its local file system. If I put that file in HDFS will it be divided and distributed across all three machines? I had this doubt as HDFS moving computation is cheaper than moving data. If file is distributed across all three machines, lots of data transfer will be there, whereas, if file is NOT distributed then compute power of other machine will be unused. Am I missing something here? -Raj Irrespective of what you set as the replication factor, large files will always be split into chunks (chunk size is what you set as your HDFS block-size) and they'll be distributed across your entire cluster. -- Harish Mallipeddi http://blog.poundbang.in
Re: Pipes example wordcount-nopipe.cc failed when reading from input splits
I did get this working. InputSplit information is not returned clearly. You may want to look at this thread - http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cee216d470906121602k7f914179u5d9555e7bb080...@mail.gmail.com%3e On Thu, Jun 18, 2009 at 12:49 AM, Jianmin Woo jianmin_...@yahoo.com wrote: I tried this example and it seems that the input/output should only be in file:///... format to get correct results. - Jianmin From: Viral K khaju...@yahoo-inc.com To: core-user@hadoop.apache.org Sent: Thursday, June 18, 2009 8:57:47 AM Subject: Re: Pipes example wordcount-nopipe.cc failed when reading from input splits Does anybody have any updates on this? How can we have our own RecordReader in Hadoop pipes? When I try to print the context.getInputSplit, I get the filenames along with some junk characters. As a result the file open fails. Anybody got it working? Viral. 11 Nov. wrote: I traced into the c++ recordreader code: WordCountReader(HadoopPipes::MapContext context) { std::string filename; HadoopUtils::StringInStream stream(context.getInputSplit()); HadoopUtils::deserializeString(filename, stream); struct stat statResult; stat(filename.c_str(), statResult); bytesTotal = statResult.st_size; bytesRead = 0; cout filenameendl; file = fopen(filename.c_str(), rt); HADOOP_ASSERT(file != NULL, failed to open + filename); } I got nothing for the filename virable, which showed the InputSplit is empty. 2008/3/4, 11 Nov. nov.eleve...@gmail.com: hi colleagues, I have set up the single node cluster to test pipes examples. wordcount-simple and wordcount-part work just fine. but wordcount-nopipe can't run. Here is my commnad line: bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml -input input/ -output out-dir-nopipe1 and here is the error message printed on my console: 08/03/03 23:23:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 08/03/03 23:23:06 INFO mapred.FileInputFormat: Total input paths to process : 1 08/03/03 23:23:07 INFO mapred.JobClient: Running job: job_200803032218_0004 08/03/03 23:23:08 INFO mapred.JobClient: map 0% reduce 0% 08/03/03 23:23:11 INFO mapred.JobClient: Task Id : task_200803032218_0004_m_00_0, Status : FAILED java.io.IOException: pipe child exception at org.apache.hadoop.mapred.pipes.Application.abort( Application.java:138) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run( PipesMapRunner.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker$Child.main( TaskTracker.java:1787) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java :313) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java :335) at org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run( BinaryProtocol.java:112) task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: Hadoop Pipes Exception: failed to open at /home/hadoop/hadoop-0.15.2-single-cluster /src/examples/pipes/impl/wordcount-nopipe.cc:67 in WordCountReader::WordCountReader(HadoopPipes::MapContext) Could anybody tell me how to fix this? That will be appreciated. Thanks a lot! -- View this message in context: http://www.nabble.com/Pipes-example-wordcount-nopipe.cc-failed-when-reading-from-input-splits-tp15807856p24084734.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: JobControl for Pipes?
Can you give me a url or so to both? I cant seem to find either one after a couple of basic web searches. Also, when you say JobControl is coming to Hadoop - I can already see the Java JobControl classes that lets one express dependancies between jobs. So I assume this already works in Java - does it not? I was asking if this functionality is exposed via Pipes in some way. Roshan On Wed, Jun 17, 2009 at 10:59 PM, jason hadoop jason.had...@gmail.comwrote: Job control is coming with the Hadoop WorkFlow manager, in the mean time there is cascade by chris wensel. I do not have any personal experience with either. I do not know how pipes interacts with either. On Wed, Jun 17, 2009 at 12:43 PM, Roshan James roshan.james.subscript...@gmail.com wrote: Hello, Is there any way to express dependencies between map-reduce jobs (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs? The provided header Pipes.hh does not seem to reflect any such capabilities. best, Roshan -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Getting Task ID inside a Mapper
Hi, I was wondering if it's possible to get a hold of the task id inside a mapper? I cant' seem to find a way by trolling through the API reference. I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation and I need to be able to initialize a random number generator in a task specific way so that if the task fails and is rerun elsewhere, the results are the same. Thanks in advance. Cheers, Mark Desnoyer
Re: Getting Task ID inside a Mapper
Hi Why don't you provide a seed of random generator generated outside the task ? Then when the task fails, you can provide the same value stored somewhere outside. You could use the task configuration to do so. I don't know anything about obtaining the task ID from within. regards Piotr 2009/6/18 Mark Desnoyer mdesno...@gmail.com Hi, I was wondering if it's possible to get a hold of the task id inside a mapper? I cant' seem to find a way by trolling through the API reference. I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation and I need to be able to initialize a random number generator in a task specific way so that if the task fails and is rerun elsewhere, the results are the same. Thanks in advance. Cheers, Mark Desnoyer
Re: Getting Task ID inside a Mapper
Thanks! I'll try that. -Mark On Thu, Jun 18, 2009 at 10:27 AM, Jingkei Ly jingkei...@detica.com wrote: I think you can use job.getInt(mapred.task.partition,-1) to get the mapper ID, which should be the same for the mapper across task reruns. -Original Message- From: Piotr Praczyk [mailto:piotr.prac...@gmail.com] Sent: 18 June 2009 15:19 To: core-user@hadoop.apache.org Subject: Re: Getting Task ID inside a Mapper Hi Why don't you provide a seed of random generator generated outside the task ? Then when the task fails, you can provide the same value stored somewhere outside. You could use the task configuration to do so. I don't know anything about obtaining the task ID from within. regards Piotr 2009/6/18 Mark Desnoyer mdesno...@gmail.com Hi, I was wondering if it's possible to get a hold of the task id inside a mapper? I cant' seem to find a way by trolling through the API reference. I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation and I need to be able to initialize a random number generator in a task specific way so that if the task fails and is rerun elsewhere, the results are the same. Thanks in advance. Cheers, Mark Desnoyer This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Group plc group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England.
Fwd: Need help
Hello, I am doing my master my final year project is on Hadoop ...so I would like to know some thing about Hadoop cluster i.e, Do new version of Hadoop are able to handle heterogeneous hardware.If you have any informantion regarding these please mail me as my project is in heterogenous environment. Thanks! Reagrds, Ashish Pareek
Re: JobControl for Pipes?
http://www.cascading.org/ https://issues.apache.org/jira/browse/HADOOP-5303 (oozie) On Thu, Jun 18, 2009 at 6:19 AM, Roshan James roshan.james.subscript...@gmail.com wrote: Can you give me a url or so to both? I cant seem to find either one after a couple of basic web searches. Also, when you say JobControl is coming to Hadoop - I can already see the Java JobControl classes that lets one express dependancies between jobs. So I assume this already works in Java - does it not? I was asking if this functionality is exposed via Pipes in some way. Roshan On Wed, Jun 17, 2009 at 10:59 PM, jason hadoop jason.had...@gmail.com wrote: Job control is coming with the Hadoop WorkFlow manager, in the mean time there is cascade by chris wensel. I do not have any personal experience with either. I do not know how pipes interacts with either. On Wed, Jun 17, 2009 at 12:43 PM, Roshan James roshan.james.subscript...@gmail.com wrote: Hello, Is there any way to express dependencies between map-reduce jobs (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs? The provided header Pipes.hh does not seem to reflect any such capabilities. best, Roshan -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Getting Task ID inside a Mapper
The task id is readily available, if you override the configure method. The MapReduceBase class in the Pro Hadoop Book examples does this and makes the taskId available as a class field. On Thu, Jun 18, 2009 at 7:33 AM, Mark Desnoyer mdesno...@gmail.com wrote: Thanks! I'll try that. -Mark On Thu, Jun 18, 2009 at 10:27 AM, Jingkei Ly jingkei...@detica.com wrote: I think you can use job.getInt(mapred.task.partition,-1) to get the mapper ID, which should be the same for the mapper across task reruns. -Original Message- From: Piotr Praczyk [mailto:piotr.prac...@gmail.com] Sent: 18 June 2009 15:19 To: core-user@hadoop.apache.org Subject: Re: Getting Task ID inside a Mapper Hi Why don't you provide a seed of random generator generated outside the task ? Then when the task fails, you can provide the same value stored somewhere outside. You could use the task configuration to do so. I don't know anything about obtaining the task ID from within. regards Piotr 2009/6/18 Mark Desnoyer mdesno...@gmail.com Hi, I was wondering if it's possible to get a hold of the task id inside a mapper? I cant' seem to find a way by trolling through the API reference. I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation and I need to be able to initialize a random number generator in a task specific way so that if the task fails and is rerun elsewhere, the results are the same. Thanks in advance. Cheers, Mark Desnoyer This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Group plc group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Need help
Does that mean hadoop is not scalable wrt heterogeneous environment? and one more question is can we run different application on the same hadoop cluster . Thanks. Regards, Ashish On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop jason.had...@gmail.comwrote: Hadoop has always been reasonably agnostic wrt hardware and homogeneity. There are optimizations in configuration for near homogeneous machines. On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek pareek...@gmail.com wrote: Hello, I am doing my master my final year project is on Hadoop ...so I would like to know some thing about Hadoop cluster i.e, Do new version of Hadoop are able to handle heterogeneous hardware.If you have any informantion regarding these please mail me as my project is in heterogenous environment. Thanks! Reagrds, Ashish Pareek -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Practical limit on emitted map/reduce values
Hello, I wasn't able to find this anywhere, so I'm sorry if this has been asked before. I am wondering whether there is a practical limit of the amount of bytes that an emitted Map/Reduce value can be. Other than the obvious drawbacks of emitting huge values such as performance issues, I would like to know whether there are any hard constraints; I can imagine that a value can never be larger than the dfs.block.size . Does anyone have any idea, or can provide me with some pointers where to look ? Thanks in advance! Regards, Leon Mergen
Re: Need help
Can you tell few of the challenges in configuring heterogeneous cluster...or can pass on some link where I would get some information regarding challenges in running Hadoop on heterogeneous hardware One more things is How about running different applications on the same Hadoop cluster?and what challenges are involved in it ? Thanks, Regards, Ashish On Thu, Jun 18, 2009 at 8:53 PM, jason hadoop jason.had...@gmail.comwrote: I don't know anyone who has a completely homogeneous cluster. So hadoop is scalable across heterogeneous environments. I stated that configuration is simpler if the machines are similar (There are optimizations in configuration for near homogeneous machines.) On Thu, Jun 18, 2009 at 8:10 AM, ashish pareek pareek...@gmail.com wrote: Does that mean hadoop is not scalable wrt heterogeneous environment? and one more question is can we run different application on the same hadoop cluster . Thanks. Regards, Ashish On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop jason.had...@gmail.com wrote: Hadoop has always been reasonably agnostic wrt hardware and homogeneity. There are optimizations in configuration for near homogeneous machines. On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek pareek...@gmail.com wrote: Hello, I am doing my master my final year project is on Hadoop ...so I would like to know some thing about Hadoop cluster i.e, Do new version of Hadoop are able to handle heterogeneous hardware.If you have any informantion regarding these please mail me as my project is in heterogenous environment. Thanks! Reagrds, Ashish Pareek -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Practical limit on emitted map/reduce values
Keys and values can be large. They are certainly capped above by Java's 2GB limit on byte arrays. More practically, you will have problems running out of memory with keys or values of 100 MB. There is no restriction that a key/value pair fits in a single hdfs block, but performance would suffer. (In particular, the FileInputFormats split at block sized chunks, which means you will have maps that scan an entire block without processing anything.) -- Owen
RE: Practical limit on emitted map/reduce values
Hello Owen, Keys and values can be large. They are certainly capped above by Java's 2GB limit on byte arrays. More practically, you will have problems running out of memory with keys or values of 100 MB. There is no restriction that a key/value pair fits in a single hdfs block, but performance would suffer. (In particular, the FileInputFormats split at block sized chunks, which means you will have maps that scan an entire block without processing anything.) Thanks for the quick reply. Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ? Regards, Leon Mergen
Re: Practical limit on emitted map/reduce values
In general if the values become very large, it becomes simpler to store them outline in hdfs, and just pass the hdfs path for the item as the value in the map reduce task. This greatly reduces the amount of IO done, and doesn't blow up the sort space on the reducer. You loose the magic of data locality, but given the item size, and you gain the IO back by not having to pass the full values to the reducer, or handle them when sorting the map outputs. On Thu, Jun 18, 2009 at 8:45 AM, Leon Mergen l.p.mer...@solatis.com wrote: Hello Owen, Keys and values can be large. They are certainly capped above by Java's 2GB limit on byte arrays. More practically, you will have problems running out of memory with keys or values of 100 MB. There is no restriction that a key/value pair fits in a single hdfs block, but performance would suffer. (In particular, the FileInputFormats split at block sized chunks, which means you will have maps that scan an entire block without processing anything.) Thanks for the quick reply. Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ? Regards, Leon Mergen -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
RE: Practical limit on emitted map/reduce values
Hello Jason, In general if the values become very large, it becomes simpler to store them outline in hdfs, and just pass the hdfs path for the item as the value in the map reduce task. This greatly reduces the amount of IO done, and doesn't blow up the sort space on the reducer. You loose the magic of data locality, but given the item size, and you gain the IO back by not having to pass the full values to the reducer, or handle them when sorting the map outputs. Ah that actually sounds like a nice idea; instead of having the reducer emit the huge value, it can create a temporarely file and emit the filename instead. I wasn't really planning on having huge values anyway (values above 1MB will be the exception rather than the rule), but since it's theoretically possible for our software to generate them, it seemed like a good idea to investigate any real constraints that we might run into. Your idea sounds like a good workaround for this. Thanks! Regards, Leon Mergen
Re: Need help
Hello Everybody, How can we handle different applications having different requirement being run on the same hadoop cluster ? What are the various approaches to solve such problem.. if possible please mention some of those ideas. Does such implementation exists ? Thanks , Regards, Ashish On Thu, Jun 18, 2009 at 9:36 PM, jason hadoop jason.had...@gmail.comwrote: For me, I like to have one configuration file that I distribute to all of the machines in my cluster via rsync. In there are things like the number of tasks per node to run, and where to store dfs data and local temporary data, and the limits to storage for the machines. If the machines are very different, it becomes important to tailor the configuration file per machine or type of machine. At this point, you are pretty much going to have to spend the time, reading through the details of configuring a hadoop cluster. On Thu, Jun 18, 2009 at 8:33 AM, ashish pareek pareek...@gmail.com wrote: Can you tell few of the challenges in configuring heterogeneous cluster...or can pass on some link where I would get some information regarding challenges in running Hadoop on heterogeneous hardware One more things is How about running different applications on the same Hadoop cluster?and what challenges are involved in it ? Thanks, Regards, Ashish On Thu, Jun 18, 2009 at 8:53 PM, jason hadoop jason.had...@gmail.com wrote: I don't know anyone who has a completely homogeneous cluster. So hadoop is scalable across heterogeneous environments. I stated that configuration is simpler if the machines are similar (There are optimizations in configuration for near homogeneous machines.) On Thu, Jun 18, 2009 at 8:10 AM, ashish pareek pareek...@gmail.com wrote: Does that mean hadoop is not scalable wrt heterogeneous environment? and one more question is can we run different application on the same hadoop cluster . Thanks. Regards, Ashish On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop jason.had...@gmail.com wrote: Hadoop has always been reasonably agnostic wrt hardware and homogeneity. There are optimizations in configuration for near homogeneous machines. On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek pareek...@gmail.com wrote: Hello, I am doing my master my final year project is on Hadoop ...so I would like to know some thing about Hadoop cluster i.e, Do new version of Hadoop are able to handle heterogeneous hardware.If you have any informantion regarding these please mail me as my project is in heterogenous environment. Thanks! Reagrds, Ashish Pareek -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Restrict output of mappers to reducers running on same node?
Jason, correct me if I am wrong- opening Sequence file in the configure (or setup method in 0.20) and writing to it is same as doing output.collect( ), unless you mean I should make the sequence file writer static variable and set reuse jvm flag to -1. In that case the subsequent mappers might be run in the same JVM and they can use the same writer and hence produce one file. But in that case I need to add a hook to close the writer - may be use the shutdown hook. Jothi, the idea of combine input format is good. But I guess I have to write somethign of my own to make it work in my case. Thanks guys for the suggestions... but I feel we should have some support from the framework to merge the output of mapper only job so that we don't get a lot number of smaller files. Sometimes you just don't want to run reducers and unnecessarily transfer a whole lot of data across the network. Thanks, Tarandeep On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop jason.had...@gmail.comwrote: You can open your sequence file in the mapper configure method, write to it in your map, and close it in the mapper close method. Then you end up with 1 sequence file per map. I am making an assumption that each key,value to your map some how represents a single xml file/item. On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan joth...@yahoo-inc.com wrote: You could look at CombineFileInputFormat to generate a single split out of several files. Your partitioner would be able to assign keys to specific reducers, but you would not have control on which node a given reduce task will run. Jothi On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote: Hi, Can I restrict the output of mappers running on a node to go to reducer(s) running on the same node? Let me explain why I want to do this- I am converting huge number of XML files into SequenceFiles. So theoretically I don't even need reducers, mappers would read xml files and output Sequencefiles. But the problem with this approach is I will end up getting huge number of small output files. To avoid generating large number of smaller files, I can Identity reducers. But by running reducers, I am unnecessarily transfering data over network. I ran some test case using a small subset of my data (~90GB). With map only jobs, my cluster finished conversion in only 6 minutes. But with map and Identity reducers job, it takes around 38 minutes. I have to process close to a terabyte of data. So I was thinking of a faster alternatives- * Writing a custom OutputFormat * Somehow restrict output of mappers running on a node to go to reducers running on the same node. May be I can write my own partitioner (simple) but not sure how Hadoop's framework assigns partitions to reduce tasks. Any pointers ? Or this is not possible at all ? Thanks, Tarandeep -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Practical limit on emitted map/reduce values
On Jun 18, 2009, at 8:45 AM, Leon Mergen wrote: Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ? A couple of points: 1. The 100MB was just for ballpark calculations. Of course if you have a large heap, you can fit larger values. Don't forget that the framework is allocating big chunks of the heap for its own buffers, when figuring out how big to make your heaps. 2. Having large keys is much harder than large values. When doing a N-way merge, the framework has N+1 keys and 1 value in memory at a time. -- Owen
Re: Need help
Hadoop can be run on a hardware heterogeneous cluster. Currently, Hadoop clusters really only run well on Linux although you can run a Hadoop client on non-Linux machines. You will need to have a special configuration for each of the machine in your cluster based on their hardware profile. Ideally, you'll be able to group the machines in your cluster into classes of machines (e.g. machines with 1GB of RAM and 2 core versus 4GB of RAM and 4 core) to reduce the burden of managing multiple configurations. If you are talking about a Hadoop cluster that is completely heterogeneous (each machine is completely different), the management overhead could be high. Configuration variables like mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum should be set based on the number of cores/memory in each machine. Variables like mapred.child.java.opts need to be set differently based on the amount of memory the machine has (e.g. -Xmx250m). You should have at least 250MB of memory dedicated to each task although more is better. It's also wise to make sure that each task has the same amount of memory regardless of the machine it's scheduled on; otherwise, tasks might succeed or fail based on which machine gets the task. This asymmetry will make debugging harder. You can use our online configurator (http://www.cloudera.com/configurator/ ), to generate optimized configurations for each class of machines in your cluster. It will ask simple question about your configuration and then produce a hadoop-site.xml file. Good luck! -Matt On Jun 18, 2009, at 8:33 AM, ashish pareek wrote: Can you tell few of the challenges in configuring heterogeneous cluster...or can pass on some link where I would get some information regarding challenges in running Hadoop on heterogeneous hardware One more things is How about running different applications on the same Hadoop cluster?and what challenges are involved in it ? Thanks, Regards, Ashish On Thu, Jun 18, 2009 at 8:53 PM, jason hadoop jason.had...@gmail.comwrote: I don't know anyone who has a completely homogeneous cluster. So hadoop is scalable across heterogeneous environments. I stated that configuration is simpler if the machines are similar (There are optimizations in configuration for near homogeneous machines.) On Thu, Jun 18, 2009 at 8:10 AM, ashish pareek pareek...@gmail.com wrote: Does that mean hadoop is not scalable wrt heterogeneous environment? and one more question is can we run different application on the same hadoop cluster . Thanks. Regards, Ashish On Thu, Jun 18, 2009 at 8:30 PM, jason hadoop jason.had...@gmail.com wrote: Hadoop has always been reasonably agnostic wrt hardware and homogeneity. There are optimizations in configuration for near homogeneous machines. On Thu, Jun 18, 2009 at 7:46 AM, ashish pareek pareek...@gmail.com wrote: Hello, I am doing my master my final year project is on Hadoop ...so I would like to know some thing about Hadoop cluster i.e, Do new version of Hadoop are able to handle heterogeneous hardware.If you have any informantion regarding these please mail me as my project is in heterogenous environment. Thanks! Reagrds, Ashish Pareek -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Upgrading from .19 to .20 problems
Hey All, I'm able to start my master server, but none of the slave nodes come up (unless I list the master as the slave). After searching a bit, seems people have this problem when they forget to set df.default.name, but i've got it set in core-site.xml (listed below). They all have the error below on start up: STARTUP_MSG: Starting DataNode STARTUP_MSG: host = slave1/192.168.0.234 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-06-18 09:06:49,369 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.lang.NullPointerException at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:134) at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:156) at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:160) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:246) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2009-06-18 09:06:49,370 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at slave1/192.168.0.234 / == core-site.xml == property namefs.default.name/name valuehdfs://master:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem./description /property property namehadoop.tmp.dir/name value/data/hadoop-0.20.0-${user.name}/value descriptionA base for other temporary directories./description /property -- View this message in context: http://www.nabble.com/Upgrading-from-.19-to-.20-problems-tp24095348p24095348.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
RE: Practical limit on emitted map/reduce values
Hello Owen, Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ? A couple of points: 1. The 100MB was just for ballpark calculations. Of course if you have a large heap, you can fit larger values. Don't forget that the framework is allocating big chunks of the heap for its own buffers, when figuring out how big to make your heaps. 2. Having large keys is much harder than large values. When doing a N-way merge, the framework has N+1 keys and 1 value in memory at a time. Ok, that makes sense. Thanks for the information! Regards, Leon Mergen
Re: Hadoop Eclipse Plugin
Hi, Thank you for your response. netstat and telnet commands executed fine, but couldn't connect to them through eclipse. I doubt something wrong with the eclipse plugin. Do you know any other development environments people use to develop map/reduce applications? Regards, Praveen On Thu, Jun 18, 2009 at 2:49 AM, Steve Loughran ste...@apache.org wrote: Praveen Yarlagadda wrote: Hi, I have a problem configuring Hadoop Map/Reduce plugin with Eclipse. Setup Details: I have a namenode, a jobtracker and two data nodes, all running on ubuntu. My set up works fine with example programs. I want to connect to this setup from eclipse. namenode - 10.20.104.62 - 54310(port) jobtracker - 10.20.104.53 - 54311(port) I run eclipse on a different windows m/c. I want to configure map/reduce plugin with eclipse, so that I can access HDFS from windows. Map/Reduce master Host - With jobtracker IP, it did not work Port - With jobtracker port, it did not work DFS master Host - With namenode IP, It did not work Port - With namenode port, it did not work I tried other combination too by giving namenode details for Map/Reduce master and jobtracker details for DFS master. It did not work either. 1. check the ports really are open by doing a netstat -a -p on the namenode and job tracker , netstat -a -p | grep 54310 on the NN netstat -a -p | grep 54311 on the JT 2l Then, from the windows machine, see if you can connect to them oustide ecipse telnet 10.20.104.62 54310 telnet 10.20.104.53 - 54311 If you can't connect, then firewalls are interfering If everything works, the problem is in the eclipse plugin (which I don't use, and cannot assist with) -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/ -- Regards, Praveen
slaves registers on JobTracker as localhost!!!
slaves registers on JobTracker as localhost And JobTracker trying to fetch data from localhost when he wants to fetch data from slaves. But word localhost not contains neither in /etc/hosts, not in any other linux network or hadoop config file. For example, my slave's /etc/hosts: --- 192.168.2.22gentoo1 --- and JobTracker trying to connect to it by localhost. (but many times mentions in hadoop's java-sources) Thank you.
Re: Read/write dependency wrt total data size on hdfs
I'm a little confused what you're question is. Are you asking why HDFS has consistent read/write speeds even as your cluster gets more and more data? If so, two HDFS bottlenecks that would change read/write performance as used capacity changes are name node (NN) RAM and the amount of data each of your data nodes (DNs) are storing. If you have so much meta data (lots of files, blocks, etc.) that the NN java process uses most of your NN's memory, then you'll see a big decrease in performance. This bottleneck usually only shows itself on large clusters with tons of metadata, though a small cluster with a wimpy NN machine will have the same bottleneck. Similarly, if each of your DNs are storing close to their capacity, then reads/writes will begin to slow down, as each node will be responsible for streaming more and more data in and out. Does that make sense? You should fill your cluster up to 80-90%. I imagine you'd probably see a decrease in read/write performance depending on the tests you're running, though I can't say I've done this performance test before. I'm merely speculating. Hope this clears things up. Alex On Thu, Jun 18, 2009 at 9:30 AM, Wasim Bari wasimb...@msn.com wrote: Hi, I am storing data on a HDFS cluster(4 machines). I have seen that read/write is not very much effected with the size of data on HDFS (Total data size of HDFS). I have used 20-30% of cluster and didn't completely filled it. Can someone explain me why its so and HDFS promises such feature or I am missing some stuff? Thanks, wasim
multiple file input
I am evaluating hadoop for a problem that do a Cartesian product of input from one file of 600K (File A) with another set of file set (FileB1, FileB2, FileB3) with 2 millions line in total. Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory So Two input directories 1. input1 directory with a single file of 600K records - FileA 2. input2 directory segmented into different files with 2Million records - FileB1, FileB2 etc. How can I have a map that reads a line from a FileA in directory input1 and compares the line with each line from input2? What is the best way forward? I have seen plenty of examples that maps each record from single input file and reduces into an output forward. thanks -- View this message in context: http://www.nabble.com/multiple-file-input-tp24095358p24095358.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Read/write dependency wrt total data size on hdfs
On Thu, Jun 18, 2009 at 10:55 AM, Alex Loddengaard a...@cloudera.comwrote: I'm a little confused what you're question is. Are you asking why HDFS has consistent read/write speeds even as your cluster gets more and more data? If so, two HDFS bottlenecks that would change read/write performance as used capacity changes are name node (NN) RAM and the amount of data each of your data nodes (DNs) are storing. If you have so much meta data (lots of files, blocks, etc.) that the NN java process uses most of your NN's memory, then you'll see a big decrease in performance. To avoid this issue, simply watch swap usage on your NN. If your NN starts swapping you will likely run into problems with your metadata operation speed. This won't affect throughput of read/writes within a block, though. This bottleneck usually only shows itself on large clusters with tons of metadata, though a small cluster with a wimpy NN machine will have the same bottleneck. Similarly, if each of your DNs are storing close to their capacity, then reads/writes will begin to slow down, as each node will be responsible for streaming more and more data in and out. Does that make sense? You should fill your cluster up to 80-90%. I imagine you'd probably see a decrease in read/write performance depending on the tests you're running, though I can't say I've done this performance test before. I'm merely speculating. Another thing to keep in mind is that local filesystem performance begins to suffer once a disk is more than 80% or so full. This is due to the ways that filesystems endeavour to keep file fragmentation low. When there is little extra space on the drive, the file system has fewer options for relocating blocks and fighting fragmentation, so sequential writes and reads will actually incur seeks on the local disk. Since the datanodes store their blocks on the local file system, this is a factor worth considering. -Todd
Heed help. 0.18.3. pipes. thanks.
2 nodes: ibmT43, gentoo1. ibmT43 = NameNode + JobTracker + TaskTracker + DataNode gentoo1 = TaskTracker + DataNode ===conf/hadoop-site.xml=== - identical on 2 hosts: ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namefs.default.name/name valuehdfs://ibmT43:9000//value /property property namemapred.job.tracker/name valueibmT43:9001/value /property property namedfs.replication/name value2/value /property property namemapred.task.timeout/name value2/value /property property namehadoop.pipes.executable/name value/bin/logparser/value /property property namemapred.map.tasks/name value10/value /property property namemapred.reduce.tasks/name value1/value /property /configuration --- /etc/hosts on all two hosts also identical: 192.168.2.1ibmT43 192.168.2.22 gentoo1 # 127.0.0.1 localhost # it will have no effect if i will uncomment localhost -- binary hdfs://bin/logparser is correct - it have been working in past. -- bin/hadoop pipes -input /input -output /output1 -conf 123test.xml 123test.xml===: ?xml version=1.0? configuration !-- property namemapred.reduce.tasks/name value2/value /property -- property namehadoop.pipes.java.recordreader/name valuetrue/value /property property namehadoop.pipes.java.recordwriter/name valuetrue/value /property /configuration - Running job: had...@ibmt43 ~/hadoop-0.18.3 $ bin/hadoop pipes -input /input -output /output1 -conf 123test.xml 09/06/18 22:38:50 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 09/06/18 22:38:50 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 09/06/18 22:38:50 INFO mapred.FileInputFormat: Total input paths to process : 4 09/06/18 22:38:50 INFO mapred.FileInputFormat: Total input paths to process : 4 09/06/18 22:38:51 INFO mapred.JobClient: Running job: job_200906182236_0001 09/06/18 22:38:52 INFO mapred.JobClient: map 0% reduce 0% 09/06/18 22:39:03 INFO mapred.JobClient: map 5% reduce 0% 09/06/18 22:39:12 INFO mapred.JobClient: map 11% reduce 0% 09/06/18 22:39:23 INFO mapred.JobClient: map 5% reduce 0% 09/06/18 22:39:23 INFO mapred.JobClient: Task Id : attempt_200906182236_0001_m_02_0, Status : FAILED Task attempt_200906182236_0001_m_02_0 failed to report status for 22 seconds. Killing! 09/06/18 22:39:24 INFO mapred.JobClient: Task Id : attempt_200906182236_0001_m_03_0, Status : FAILED Task attempt_200906182236_0001_m_03_0 failed to report status for 22 seconds. Killing! 09/06/18 22:39:33 INFO mapred.JobClient: map 0% reduce 0% 09/06/18 22:39:33 INFO mapred.JobClient: Task Id : attempt_200906182236_0001_m_00_0, Status : FAILED Task attempt_200906182236_0001_m_00_0 failed to report status for 23 seconds. Killing! attempt_200906182236_0001_m_00_0: Hadoop Pipes Exception: write error to file: Connection reset by peer at SerialUtils.cc:129 in virtual void Hadoo pUtils::FileOutStream::write(const void*, size_t) 09/06/18 22:39:33 INFO mapred.JobClient: Task Id : attempt_200906182236_0001_m_01_0, Status : FAILED Task attempt_200906182236_0001_m_01_0 failed to report status for 23 seconds. Killing! 09/06/18 22:39:34 INFO mapred.JobClient: map 2% reduce 0% 09/06/18 22:39:38 INFO mapred.JobClient: map 5% reduce 0% 09/06/18 22:39:43 INFO mapred.JobClient: map 8% reduce 0% 09/06/18 22:39:48 INFO mapred.JobClient: map 12% reduce 0% 09/06/18 22:39:54 INFO mapred.JobClient: map 9% reduce 0% 09/06/18 22:39:54 INFO mapred.JobClient: Task Id : attempt_200906182236_0001_m_04_0, Status : FAILED Task attempt_200906182236_0001_m_04_0 failed to report status for 22 seconds. Killing! attempt_200906182236_0001_m_04_0: Hadoop Pipes Exception: write error to file: Connection reset by peer at SerialUtils.cc:129 in virtual void Hadoo pUtils::FileOutStream::write(const void*, size_t) 09/06/18 22:39:59 INFO mapred.JobClient: map 6% reduce 0% 09/06/18 22:39:59 INFO mapred.JobClient: Task Id : attempt_200906182236_0001_m_05_0, Status : FAILED Task attempt_200906182236_0001_m_05_0 failed to report status for 22 seconds. Killing! attempt_200906182236_0001_m_05_0: Hadoop Pipes Exception: write error to file: Connection reset by peer at SerialUtils.cc:129 in virtual void Hadoo pUtils::FileOutStream::write(const void*, size_t) 09/06/18 22:40:03 INFO
HDFS is not loading evenly across all nodes.
Hi all I dfs put a large dataset onto a 10-node cluster. When I observe the Hadoop progress (via web:50070) and each local file system (via df -k), I notice that my master node is hit 5-10 times harder than others, so hard drive is get full quicker than others. Last night load, it actually crash when hard drive was full. To my understand, data should wrap around all nodes evenly (in a round-robin fashion using 64M as a unit). Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting way? Thanks -- View this message in context: http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HDFS is not loading evenly across all nodes.
Did you run the dfs put commands from the master node? If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. For more balanced loading, you should use an off-cluster machine as the point of origin. If you experience uneven block distribution, you should also periodically rebalance your cluster by running bin/start-balancer.sh every so often. It will work in the background to move blocks from heavily-laden nodes to underutilized ones. - Aaron On Thu, Jun 18, 2009 at 12:57 PM, openresearch qiming...@openresearchinc.com wrote: Hi all I dfs put a large dataset onto a 10-node cluster. When I observe the Hadoop progress (via web:50070) and each local file system (via df -k), I notice that my master node is hit 5-10 times harder than others, so hard drive is get full quicker than others. Last night load, it actually crash when hard drive was full. To my understand, data should wrap around all nodes evenly (in a round-robin fashion using 64M as a unit). Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting way? Thanks -- View this message in context: http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HDFS is not loading evenly across all nodes.
As an addendum, running a DataNode on the same machine as a NameNode is generally considered a bad idea because it hurts the NameNode's ability to maintain high throughput. - Aaron On Thu, Jun 18, 2009 at 1:26 PM, Aaron Kimball aa...@cloudera.com wrote: Did you run the dfs put commands from the master node? If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. For more balanced loading, you should use an off-cluster machine as the point of origin. If you experience uneven block distribution, you should also periodically rebalance your cluster by running bin/start-balancer.sh every so often. It will work in the background to move blocks from heavily-laden nodes to underutilized ones. - Aaron On Thu, Jun 18, 2009 at 12:57 PM, openresearch qiming...@openresearchinc.com wrote: Hi all I dfs put a large dataset onto a 10-node cluster. When I observe the Hadoop progress (via web:50070) and each local file system (via df -k), I notice that my master node is hit 5-10 times harder than others, so hard drive is get full quicker than others. Last night load, it actually crash when hard drive was full. To my understand, data should wrap around all nodes evenly (in a round-robin fashion using 64M as a unit). Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting way? Thanks -- View this message in context: http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Trying to setup Cluster
Are you encountering specific problems? I don't think that hadoop's config files will evaluate environment variables. So $HADOOP_HOME won't be interpreted correctly. For passwordless ssh, see http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html or just check the manpage for ssh-keygen. - Aaron On Wed, Jun 17, 2009 at 9:30 AM, Divij Durve divij.t...@gmail.com wrote: Im trying to setup a cluster with 3 different machines running Fedora. I cant get them to log into the localhost without the password but thats the least of my worries at the moment. I am posting my config files and the master and slave files let me know if anyone can spot a problem with the configs... Hadoop-site.xml ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namedfs.data.dir/name value$HADOOP_HOME/dfs-data/value finaltrue/final /property property namedfs.name.dir/name value$HADOOP_HOME/dfs-name/value finaltrue/final /property property namehadoop.tmp.dir/name value$HADOOP_HOME/hadoop-tmp/value descriptionA base for other temporary directories./description /property property namefs.default.name/name valuehdfs://gobi.something.something:54310/value descriptionThe name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a FileSystem./description /property property namemapred.job.tracker/name valuekalahari.something.something:54311/value descriptionThe host and port that the MapReduce job tracker runs at. If local, then jobs are run in-process as a single map and reduce task. /description /property property namemapred.system.dir/name value$HADOOP_HOME/mapred-system/value finaltrue/final /property property namedfs.replication/name value1/value descriptionDefault block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. /description /property property namemapred.local.dir/name value$HADOOP_HOME/mapred-local/value namedfs.replication/name value1/value /property /configuration Slave: kongur.something.something master: kalahari.something.something i execute the dfs-start.sh command from gobi.something.something. is there any other info that i should provide in order to help? Also Kongur is where im running the data node the master file on kongur should have localhost in it rite? thanks for the help Divij
Re: multiple file input
On Jun 18, 2009, at 10:56 AM, pmg wrote: Each line from FileA gets compared with every line from FileB1, FileB2 etc. etc. FileB1, FileB2 etc. are in a different input directory In the general case, I'd define an InputFormat that takes two directories, computes the input splits for each directory and generates a new list of InputSplits that is the cross-product of the two lists. So instead of FileSplit, it would use a FileSplitPair that gives the FileSplit for dir1 and the FileSplit for dir2 and the record reader would return a TextPair with left and right records (ie. lines). Clearly, you read the first line of split1 and cross it by each line from split2, then move to the second line of split1 and process each line from split2, etc. You'll need to ensure that you don't overwhelm the system with either too many input splits (ie. maps). Also don't forget that N^2/M grows much faster with the size of the input (N) than the M machines can handle in a fixed amount of time. Two input directories 1. input1 directory with a single file of 600K records - FileA 2. input2 directory segmented into different files with 2Million records - FileB1, FileB2 etc. In this particular case, it would be right to load all of FileA into memory and process the chunks of FileB/part-*. Then it would be much faster than needing to re-read the file over and over again, but otherwise it would be the same. -- Owen
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Hi Jason! I finally found out that there was some problem in reserving the HEAPSIZE which I have resolved now. Actually we cannot change the HADOOP_HEAPSIZE using export from our user account, after we have started the Hadoop. It has to changed by the root. I have a user account on the cluster and I was trying to change the Hadoop_heapsize from my user account using 'export' which had no effect. So I had to request my cluster administrator to increase the HADOOP_HEAPSIZE in hadoop-env.sh and then restart hadoop. Now the program is running absolutely fine. Thanks for your help. One thing that I would like to ask you is that can we use DistributerCache for transferring directories to the local cache of the tasks? Thanks, Akhil akhil1988 wrote: Hi Jason! Thanks for going with me to solve my problem. To restate things and make it more easier to understand: I am working in local mode in the directory which contains the job jar and also the Config and Data directories. I just removed the following three statements from my code: DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); The program executes till the same point as before now also and terminates. That means the above three statements are of no use while working in local mode. In local mode, the working directory for the mapreduce tasks becomes the current woking direcotry in which you started the hadoop command to execute the job. Since I have removed the DistributedCache.add. statements there should be no issue whether I am giving a file name or a directory name as argument to it. Now it seems to me that there is some problem in reading the binary file using binaryRead. Please let me know if I am going wrong anywhere. Thanks, Akhil jason hadoop wrote: I have only ever used the distributed cache to add files, including binary files such as shared libraries. It looks like you are adding a directory. The DistributedCache is not generally used for passing data, but for passing file names. The files must be stored in a shared file system (hdfs for simplicity) already. The distributed cache makes the names available to the tasks, and the the files are extracted from hdfs and stored in the task local work area on each task tracker node. It looks like you may be storing the contents of your files in the distributed cache. On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote: Thanks Jason. I went inside the code of the statement and found out that it eventually makes some binaryRead function call to read a binary file and there it strucks. Do you know whether there is any problem in giving a binary file for addition to the distributed cache. In the statement DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read the text files but it gets struck at the binary files. So, I think here the problem is: it is not able to read the binary files which either have not been transferred to the cache or a binary file cannot be read. Do you know the solution to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends
Re: Data replication and moving computation
Further, look at the namenode file system browser for your cluster to see the chunking in action. http://wiki.apache.org/hadoop/WebApp%20URLs Roshan On Thu, Jun 18, 2009 at 6:28 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Thu, Jun 18, 2009 at 3:43 PM, rajeev gupta graj1...@yahoo.com wrote: I have this doubt regarding HDFS. Suppose I have 3 machines in my HDFS cluster and replication factor is 1. A large file is there on one of those three cluster machines in its local file system. If I put that file in HDFS will it be divided and distributed across all three machines? I had this doubt as HDFS moving computation is cheaper than moving data. If file is distributed across all three machines, lots of data transfer will be there, whereas, if file is NOT distributed then compute power of other machine will be unused. Am I missing something here? -Raj Irrespective of what you set as the replication factor, large files will always be split into chunks (chunk size is what you set as your HDFS block-size) and they'll be distributed across your entire cluster. -- Harish Mallipeddi http://blog.poundbang.in
Hadoop error help- file system closed, could only be replicated to 0 nodes, instead of 1
Hi, I am extremely new to Hadoop and have come across a few errors that I'm not sure how to fix. I am running Hadoop version 0.19.0 from an image through Elasticfox and S3. I am on windows and use puTTY as my ssh. I am trying to run a wordcount with 5 slaves. This is what I do so far: 1. boot up the instance through ElasticFox 2. cd /usr/local/hadoop-0.19.0 3. bin/hadoop namenode -format 4. bin/start-all.sh 5. jps --( shows jps, jobtracker, secondarynamenode) 6.bin/stop-all.sh 7. ant examples 8. bin/start-all.sh 9. bin/hadoop jar build/hadoop-0.19.0-examples.jar pi 0 100 Then I get this error trace: Number of Maps = 0 Samples per Map = 100 Starting Job 09/06/18 17:31:25 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 09/06/18 17:31:25 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar retries left 4 09/06/18 17:31:25 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) at sun.reflec,t.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 09/06/18 17:31:25 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar retries left 3 09/06/18
Can a hadoop pipes job be given multiple input directories?
In the documentation for Hadoop Streaming it says that the -input option can be specified multiple times for multiples input directories. The same does not seem to work with Pipes. Is there some way to specify multiple input directories for pipes jobs? Roshan ps. With muliple input dirs this is what happens (i.e. there is no clear error message of any sort). *+ bin/hadoop pipes -conf pipes.xml -input /in-dir-har/test.har -input /in-dir -output /out-dir bin/hadoop pipes [-input path] // Input directory [-output path] // Output directory [-jar jar file // jar filename [-inputformat class] // InputFormat class [-map class] // Java Map class [-partitioner class] // Java Partitioner [-reduce class] // Java Reduce class [-writer class] // Java RecordWriter [-program executable] // executable URI [-reduces num] // number of reduces Generic options supported are -conf configuration file specify an application configuration file -D property=valueuse value for given property -fs local|namenode:port specify a namenode -jt local|jobtracker:portspecify a job tracker -files comma separated list of filesspecify comma separated files to be copied to the map reduce cluster -libjars comma separated list of jarsspecify comma separated jar files to include in the classpath. -archives comma separated list of archivesspecify comma separated archives to be unarchived on the compute machines. The general command line syntax is bin/hadoop command [genericOptions] [commandOptions] *
Re: HDFS is not loading evenly across all nodes.
If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. Does that mean that if replication factor is 1, whole file will be kept on one node only? Thanks and regards. -Rajeev Gupta Aaron Kimball aa...@cloudera.c omTo core-user@hadoop.apache.org 06/19/2009 01:56 cc AM Subject Re: HDFS is not loading evenly Please respond to across all nodes. core-u...@hadoop. apache.org Did you run the dfs put commands from the master node? If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. For more balanced loading, you should use an off-cluster machine as the point of origin. If you experience uneven block distribution, you should also periodically rebalance your cluster by running bin/start-balancer.sh every so often. It will work in the background to move blocks from heavily-laden nodes to underutilized ones. - Aaron On Thu, Jun 18, 2009 at 12:57 PM, openresearch qiming...@openresearchinc.com wrote: Hi all I dfs put a large dataset onto a 10-node cluster. When I observe the Hadoop progress (via web:50070) and each local file system (via df -k), I notice that my master node is hit 5-10 times harder than others, so hard drive is get full quicker than others. Last night load, it actually crash when hard drive was full. To my understand, data should wrap around all nodes evenly (in a round-robin fashion using 64M as a unit). Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting way? Thanks -- View this message in context: http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Pipes example wordcount-nopipe.cc failed when reading from input splits
Hi, Roshan , Thanks a lot for your information about the InputSplit between Java and pipes. -Jianmin From: Roshan James roshan.james.subscript...@gmail.com To: core-user@hadoop.apache.org Sent: Thursday, June 18, 2009 9:11:41 PM Subject: Re: Pipes example wordcount-nopipe.cc failed when reading from input splits I did get this working. InputSplit information is not returned clearly. You may want to look at this thread - http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200906.mbox/%3cee216d470906121602k7f914179u5d9555e7bb080...@mail.gmail.com%3e On Thu, Jun 18, 2009 at 12:49 AM, Jianmin Woo jianmin_...@yahoo.com wrote: I tried this example and it seems that the input/output should only be in file:///... format to get correct results. - Jianmin From: Viral K khaju...@yahoo-inc.com To: core-user@hadoop.apache.org Sent: Thursday, June 18, 2009 8:57:47 AM Subject: Re: Pipes example wordcount-nopipe.cc failed when reading from input splits Does anybody have any updates on this? How can we have our own RecordReader in Hadoop pipes? When I try to print the context.getInputSplit, I get the filenames along with some junk characters. As a result the file open fails. Anybody got it working? Viral. 11 Nov. wrote: I traced into the c++ recordreader code: WordCountReader(HadoopPipes::MapContext context) { std::string filename; HadoopUtils::StringInStream stream(context.getInputSplit()); HadoopUtils::deserializeString(filename, stream); struct stat statResult; stat(filename.c_str(), statResult); bytesTotal = statResult.st_size; bytesRead = 0; cout filenameendl; file = fopen(filename.c_str(), rt); HADOOP_ASSERT(file != NULL, failed to open + filename); } I got nothing for the filename virable, which showed the InputSplit is empty. 2008/3/4, 11 Nov. nov.eleve...@gmail.com: hi colleagues, I have set up the single node cluster to test pipes examples. wordcount-simple and wordcount-part work just fine. but wordcount-nopipe can't run. Here is my commnad line: bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml -input input/ -output out-dir-nopipe1 and here is the error message printed on my console: 08/03/03 23:23:06 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 08/03/03 23:23:06 INFO mapred.FileInputFormat: Total input paths to process : 1 08/03/03 23:23:07 INFO mapred.JobClient: Running job: job_200803032218_0004 08/03/03 23:23:08 INFO mapred.JobClient: map 0% reduce 0% 08/03/03 23:23:11 INFO mapred.JobClient: Task Id : task_200803032218_0004_m_00_0, Status : FAILED java.io.IOException: pipe child exception at org.apache.hadoop.mapred.pipes.Application.abort( Application.java:138) at org.apache.hadoop.mapred.pipes.PipesMapRunner.run( PipesMapRunner.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker$Child.main( TaskTracker.java:1787) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:250) at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java :313) at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java :335) at org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run( BinaryProtocol.java:112) task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: task_200803032218_0004_m_00_0: Hadoop Pipes Exception: failed to open at /home/hadoop/hadoop-0.15.2-single-cluster /src/examples/pipes/impl/wordcount-nopipe.cc:67 in WordCountReader::WordCountReader(HadoopPipes::MapContext) Could anybody tell me how to fix this? That will be appreciated. Thanks a lot! -- View this message in context: http://www.nabble.com/Pipes-example-wordcount-nopipe.cc-failed-when-reading-from-input-splits-tp15807856p24084734.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HDFS is not loading evenly across all nodes.
Yes, it will be kept on the machine you issue the dfs -put command if it's got a datanode running. Otherwise, a random datanode will be chosen to store the datablocks. On Fri, Jun 19, 2009 at 10:41 AM, Rajeev Gupta graj...@in.ibm.com wrote: If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. Does that mean that if replication factor is 1, whole file will be kept on one node only? Thanks and regards. -Rajeev Gupta Aaron Kimball aa...@cloudera.c omTo core-user@hadoop.apache.org 06/19/2009 01:56 cc AM Subject Re: HDFS is not loading evenly Please respond to across all nodes. core-u...@hadoop. apache.org Did you run the dfs put commands from the master node? If you're inserting into HDFS from a machine running a DataNode, the local datanode will always be chosen as one of the three replica targets. For more balanced loading, you should use an off-cluster machine as the point of origin. If you experience uneven block distribution, you should also periodically rebalance your cluster by running bin/start-balancer.sh every so often. It will work in the background to move blocks from heavily-laden nodes to underutilized ones. - Aaron On Thu, Jun 18, 2009 at 12:57 PM, openresearch qiming...@openresearchinc.com wrote: Hi all I dfs put a large dataset onto a 10-node cluster. When I observe the Hadoop progress (via web:50070) and each local file system (via df -k), I notice that my master node is hit 5-10 times harder than others, so hard drive is get full quicker than others. Last night load, it actually crash when hard drive was full. To my understand, data should wrap around all nodes evenly (in a round-robin fashion using 64M as a unit). Is it expected behavior of Hadoop? Can anyone suggest a good troubleshooting way? Thanks -- View this message in context: http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Hadoop error help- file system closed, could only be replicated to 0 nodes, instead of 1
HI , What seems from your details is that datanode is not running.can you run *bin/hadoop dfsadmin -report* and find out whether your datanodes are up ? then post your observation and it would be better if you post even your hadoop-site.xml file deatils also. Regards, Ashish. On Fri, Jun 19, 2009 at 3:16 AM, terrianne.erick...@accenture.com wrote: Hi, I am extremely new to Hadoop and have come across a few errors that I'm not sure how to fix. I am running Hadoop version 0.19.0 from an image through Elasticfox and S3. I am on windows and use puTTY as my ssh. I am trying to run a wordcount with 5 slaves. This is what I do so far: 1. boot up the instance through ElasticFox 2. cd /usr/local/hadoop-0.19.0 3. bin/hadoop namenode -format 4. bin/start-all.sh 5. jps --( shows jps, jobtracker, secondarynamenode) 6.bin/stop-all.sh 7. ant examples 8. bin/start-all.sh 9. bin/hadoop jar build/hadoop-0.19.0-examples.jar pi 0 100 Then I get this error trace: Number of Maps = 0 Samples per Map = 100 Starting Job 09/06/18 17:31:25 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183) 09/06/18 17:31:25 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar retries left 4 09/06/18 17:31:25 INFO hdfs.DFSClient: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /mnt/hadoop/mapred/system/job_200906181730_0001/job.jar could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351) at sun.reflec,t.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at $Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy0.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815) at
A simple performance benchmark for Hadoop, Hive and Pig
Hi all, Yuntao Jia, our intern this summer, did a simple performance benchmark for Hadoop, Hive and Pig based on the queries in the SIGMOD 2009 paper: A Comparison of Approaches to Large-Scale Data Analysis The report and the performance test kit are both attached here: http://issues.apache.org/jira/browse/HIVE-396 We tried our best to get good performance out of Hive and Pig, and we keep the hadoop program as close as it is from the SIGMOD paper. We welcome all suggestions on how we can improve the performance more by both changing the configuration or improving the code. While we tried our best to be fair, system settings and environments do affect the result a lot. So we encourage everybody to try out the performance test kit on their own cluster, and we will appreciate if everybody can share their results. Here is the summary. The details are in the report hive_benchmark_2009-06-18.pdf from the link above. Query: GREP SELECT Hadoop: 136.1s Hive: 125.4s Pig:247.8s Query: RANKINGS SELECT Hadoop: 26.1s Hive: 31.0s Pig:38.4s Query: USERVISITS AGGREGATION Hadoop: 533.8s Hive: 768.8s Pig:855.4s Query: RANKINGS USERVISITS JOIN Hadoop: 470.0s Hive: 471.3s Pig:763.9s Please take a look at hive_benchmark_2009-06-18.pdf from the link above for details. Let's keep discussions on http://issues.apache.org/jira/browse/HIVE-396 so it's easier to keep track. Zheng