Intermittent DataStreamer Exception while appending to file inside HDFS
Hi there, I have this following exception while I'm appending existing file in my HDFS. This error appears intermittently. If the error does not show up, I can append the file successfully. If the error appears, I could not append the file. Here is the error: https://gist.github.com/arinto/d37a56f449c61c9d1d9c For your convenience, here it is: 13/10/10 14:17:30 WARN hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[10.0.106.82:50010, 10.0.106.81:50010], original=[10.0.106.82:50010, 10.0.106.81:50010]) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) Some configuration files: 1. hdfs-site.xml: https://gist.github.com/arinto/f5f1522a6f6994ddfc17#file-hdfs-append-datastream-exception-hdfs-site-xml 2. core-site.xml: https://gist.github.com/arinto/0c6f40872181fe26f8b1#file-hdfs-append-datastream-exception-core-site-xml So, any idea how to solve this issue? Some links that I've found (but unfortunately they do not help) 1. StackOverflowhttp://stackoverflow.com/questions/15347799/java-io-ioexception-failed-to-add-a-datanode-hdfs-hadoop, our replication factor is 3 and we've never changed the replication factor since we setup the cluster. 2. Impala-User mailing listhttps://groups.google.com/a/cloudera.org/forum/#!searchin/impala-user/DataStreamer$20exception/impala-user/u2CN163Cyfc/_OcRqBYL2B4J: the error here is due to replication factor set to 1. In our case, we're using replication factor = 3 Best regards, Arinto www.otnira.com
TestHDFSCLI error
I use CDH4.3.1 and run the TestHDFSCLI unit test,but there are below errors: 2013-10-10 13:05:39,671 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(156)) - --- 2013-10-10 13:05:39,671 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(157)) - Test ID: [1] 2013-10-10 13:05:39,671 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(158)) -Test Description: [ls: file using absolute path] 2013-10-10 13:05:39,671 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(159)) - 2013-10-10 13:05:39,671 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs hdfs://localhost.localdomain:41053 -touchz /file1] 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(163)) - Test Commands: [-fs hdfs://localhost.localdomain:41053 -ls /file1] 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(167)) - 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(170)) -Cleanup Commands: [-fs hdfs://localhost.localdomain:41053 -rm /file1] 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(174)) - 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(178)) - Comparator: [TokenComparator] 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(180)) - Comparision result: [pass] 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(182)) - Expected output: [Found 1 items] 2013-10-10 13:05:39,672 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(184)) - Actual output: [Found 1 items -rw-r--r-- 1 musa.ll supergroup 0 2013-10-10 13:04 /file1 ] 2013-10-10 13:05:39,673 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(178)) - Comparator: [RegexpComparator] 2013-10-10 13:05:39,673 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(180)) - Comparision result: [fail] 2013-10-10 13:05:39,673 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(182)) - Expected output: [^-rw-r--r--( )*1( )*[a-z]*( )*supergroup( )*0( )*[0-9]{4,}-[0-9]{2,}-[0-9]{2,} [0-9]{2,}:[0-9]{2,}( )*/file1] 2013-10-10 13:05:39,673 INFO cli.CLITestHelper (CLITestHelper.java:displayResults(184)) - Actual output: [Found 1 items -rw-r--r-- 1 musa.ll supergroup 0 2013-10-10 13:04 /file1 ] How can I handle the error? Thanks, LiuLei
RE: Intermittent DataStreamer Exception while appending to file inside HDFS
Hi Arinto, Please disable this feature with smaller clusters. dfs.client.block.write.replace-datanode-on-failure.policy Reason for this exception is, you have replication set to 3 and looks like you have only 2 nodes in the cluster from the logs. When you first time created pipeline we will not do any verification i.e, whether pipeline DNs met the replication or not. Above property says only replace DN on failure. But here additionally we took advantage of verifying this condition when we reopen the pipeline for append. So, here unfortunately it will not meet the replication with existing DNs and it will try to add another node. Since you are not having any extra nodes in cluster other than selected nodes, it will fail. With the current configurations you can not append. Also please take a look at default configuration description: namedfs.client.block.write.replace-datanode-on-failure.enable/name valuetrue/value description If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result, the number of datanodes in the pipeline is decreased. The feature is to add new datanodes to the pipeline. This is a site-wide property to enable/disable the feature. When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. See also dfs.client.block.write.replace-datanode-on-failure.policy /description Make this configuration false at your client side. Regards, Uma From: Arinto Murdopo [mailto:ari...@gmail.com] Sent: 10 October 2013 13:02 To: user@hadoop.apache.org Subject: Intermittent DataStreamer Exception while appending to file inside HDFS Hi there, I have this following exception while I'm appending existing file in my HDFS. This error appears intermittently. If the error does not show up, I can append the file successfully. If the error appears, I could not append the file. Here is the error: https://gist.github.com/arinto/d37a56f449c61c9d1d9c For your convenience, here it is: 13/10/10 14:17:30 WARN hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[10.0.106.82:50010http://10.0.106.82:50010, 10.0.106.81:50010http://10.0.106.81:50010], original=[10.0.106.82:50010http://10.0.106.82:50010, 10.0.106.81:50010http://10.0.106.81:50010]) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) Some configuration files: 1. hdfs-site.xml: https://gist.github.com/arinto/f5f1522a6f6994ddfc17#file-hdfs-append-datastream-exception-hdfs-site-xml 2. core-site.xml: https://gist.github.com/arinto/0c6f40872181fe26f8b1#file-hdfs-append-datastream-exception-core-site-xml So, any idea how to solve this issue? Some links that I've found (but unfortunately they do not help) 1. StackOverflowhttp://stackoverflow.com/questions/15347799/java-io-ioexception-failed-to-add-a-datanode-hdfs-hadoop, our replication factor is 3 and we've never changed the replication factor since we setup the cluster. 2. Impala-User mailing listhttps://groups.google.com/a/cloudera.org/forum/#!searchin/impala-user/DataStreamer$20exception/impala-user/u2CN163Cyfc/_OcRqBYL2B4J: the error here is due to replication factor set to 1. In our case, we're using replication factor = 3 Best regards, Arinto www.otnira.comhttp://www.otnira.com
Re: Problem with streaming exact binary chunks
Hello, Thanks a lot for the information. It helped me figure out the solution of this problem. I posted the sketch of solution on StackOverflow (http://stackoverflow.com/a/19295610/337194) for anybody who is interested. Best regards, Youssef Hatem On Oct 9, 2013, at 14:08 , Peter Marron wrote: Hi, The only way that I could find was to override the various InputWriter and OutputWriter classes. as defined by the configuration settings stream.map.input.writer.class stream.map.output.reader.class stream.reduce.input.writer.class stream.reduce. output.reader.class which was painful. Hopefully someone will tell you the _correct_ way to do this. If not I will provide more details. Regards, Peter Marron Trillium Software UK Limited Tel : +44 (0) 118 940 7609 Fax : +44 (0) 118 940 7699 E: peter.mar...@trilliumsoftware.com -Original Message- From: Youssef Hatem [mailto:youssef.ha...@rwth-aachen.de] Sent: 09 October 2013 12:14 To: user@hadoop.apache.org Subject: Problem with streaming exact binary chunks Hello, I wrote a very simple InputFormat and RecordReader to send binary data to mappers. Binary data can contain anything (including \n, \t, \r), here is what next() may actually send: public class MyRecordReader implements RecordReaderBytesWritable, BytesWritable { ... public boolean next(BytesWritable key, BytesWritable ignore) throws IOException { ... byte[] result = new byte[8]; for (int i = 0; i result.length; ++i) result[i] = (byte)(i+1); result[3] = (byte)'\n'; result[4] = (byte)'\n'; key.set(result, 0, result.length); return true; } } As you can see I am using BytesWritable to send eight bytes: 01 02 03 0a 0a 06 07 08, I also use Hadoop-1722 typed bytes (by setting -D stream.map.input=typedbytes). According to the documentation of typed bytes the mapper should receive the following byte sequence: 00 00 00 08 01 02 03 0a 0a 06 07 08 However bytes are somehow modified and I get the following sequence instead: 00 00 00 08 01 02 03 09 0a 09 0a 06 07 08 0a = '\n' 09 = '\t' It seems that Hadoop (streaming?) parsed the new line character as a separator and put '\t' which is the key/value separator for streaming I assume. Is there any work around to send *exactly* the same bytes sequence no matter what characters are in the sequence? Thanks in advance. Best regards, Youssef Hatem
Read Avro schema automatically?
Hi, We are working on building a MapReduce program that takes Avro input from HDFS, gets the timestamp, and counts the number of events written in any given day. We would like to make a program that does not need to have the Avro data declared previously, rather, it would be best if it could read the schema, determine the data type from the schema, and assign it for the mapper. Is this possible? *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com
Hadoop-2.0.1 log files deletion
Hi there, I was running somme mapreduce jobs on hadoop-2.1.0-beta . These are multiple unit tests that can take more than a day to finish running. However I realized the logs for the jobs are being deleted some how quickly than the default 24 hours setting of mapreduce.job.userlog.retain.hours property on mapred-site.xml. some of the job logs were deleted after 4 hours. Can this be a bug or if not is there any other property that override this ? Thank you. Reyane OUKPEDJO
Re: Hadoop-2.0.1 log files deletion
Hi Reyane, Did you try yarn.nodemanager.log.retain-seconds? increasing that might help. The default value is 10800 seconds, that means 3 hours. Thanks, Kishore On Thu, Oct 10, 2013 at 8:27 PM, Reyane Oukpedjo oukped...@gmail.comwrote: Hi there, I was running somme mapreduce jobs on hadoop-2.1.0-beta . These are multiple unit tests that can take more than a day to finish running. However I realized the logs for the jobs are being deleted some how quickly than the default 24 hours setting of mapreduce.job.userlog.retain.hours property on mapred-site.xml. some of the job logs were deleted after 4 hours. Can this be a bug or if not is there any other property that override this ? Thank you. Reyane OUKPEDJO
Re: Java version with Hadoop 2.0
We recently switched all our productions clusters to JDK7 off the EOL JDK6. The one big gotcha, and this was -not- specifically a problem with the Hadoop framework but you may have issues with your own applications or clients is the with the Java 7 bytecode verifier which can be disabled with -XX:-UseSplitVerifier: http://chrononsystems.com/blog/java-7-design-flaw-leads-to-huge-backward-step-for-the-jvm -JR On Wed, Oct 9, 2013 at 5:57 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: Does that mean for the new cluster, we probably try to start to aim to test/use/deploy at java 7? On Oct 9, 2013, at 3:05 PM, Andre Kelpe ake...@concurrentinc.com wrote: also keep in mind, that java 6 no longer gets public updates from Oracle: http://www.oracle.com/technetwork/java/eol-135779.html - André On Wed, Oct 9, 2013 at 11:48 PM, SF Hadoop sfhad...@gmail.com wrote: I hadn't. Thank you!!! Very helpful. Andy On Wed, Oct 9, 2013 at 2:25 PM, Patai Sangbutsarakum patai.sangbutsara...@turn.com wrote: maybe you've already seen this. http://wiki.apache.org/hadoop/HadoopJavaVersions On Oct 9, 2013, at 2:16 PM, SF Hadoop sfhad...@gmail.com wrote: I am preparing to deploy multiple cluster / distros of Hadoop for testing / benchmarking. In my research I have noticed discrepancies in the version of the JDK that various groups are using. Example: Hortonworks is suggesting JDK6u31, CDH recommends either 6 or 7 providing you stick to some guidelines for each and Apache Hadoop seems to be somewhat of a no mans land; a lot of people using a lot of different versions. Does anyone have any insight they could share about how to approach choosing the best JDK release? (I'm a total Java newb, so any info / further reading you guys can provide is appreciated.) Thanks. sf -- André Kelpe an...@concurrentinc.com http://concurrentinc.com
Re: Hadoop-2.0.1 log files deletion
Thanks problem solved. Reyane OUKPEDJO On 10 October 2013 11:10, Krishna Kishore Bonagiri write2kish...@gmail.comwrote: Hi Reyane, Did you try yarn.nodemanager.log.retain-seconds? increasing that might help. The default value is 10800 seconds, that means 3 hours. Thanks, Kishore On Thu, Oct 10, 2013 at 8:27 PM, Reyane Oukpedjo oukped...@gmail.comwrote: Hi there, I was running somme mapreduce jobs on hadoop-2.1.0-beta . These are multiple unit tests that can take more than a day to finish running. However I realized the logs for the jobs are being deleted some how quickly than the default 24 hours setting of mapreduce.job.userlog.retain.hours property on mapred-site.xml. some of the job logs were deleted after 4 hours. Can this be a bug or if not is there any other property that override this ? Thank you. Reyane OUKPEDJO
Improving MR job disk IO
Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Re: Improving MR job disk IO
Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote: Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Re: Improving MR job disk IO
Thanks Pradeep. Does it mean this job is a bad candidate for MR? Interestingly, running the cmdline '/bin/grep' under a streaming job provides (1) Much better disk throughput and, (2) CPU load is almost evenly spread across all cores/threads (no CPU gets pegged to 100%). On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote: Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Re: Improving MR job disk IO
I don't think it necessarily means that the job is a bad candidate for MR. It's a different type of a workload. Hortonworks has a great article on the different types of workloads you might see and how that affects your provisioning choices at http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html I have not looked at the Grep code so I'm not sure why it's behaving the way it is. Still curious that streaming has a higher IO throughput and lower CPU usage. It may have to do with the fact that /bin/grep is a native implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api. On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin secs...@gmail.com wrote: Thanks Pradeep. Does it mean this job is a bad candidate for MR? Interestingly, running the cmdline '/bin/grep' under a streaming job provides (1) Much better disk throughput and, (2) CPU load is almost evenly spread across all cores/threads (no CPU gets pegged to 100%). On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota pradeep...@gmail.comwrote: Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.com wrote: Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Re: Improving MR job disk IO
On Thu, Oct 10, 2013 at 1:27 PM, Pradeep Gollakota pradeep...@gmail.comwrote: I don't think it necessarily means that the job is a bad candidate for MR. It's a different type of a workload. Hortonworks has a great article on the different types of workloads you might see and how that affects your provisioning choices at http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.3.2/bk_cluster-planning-guide/content/ch_hardware-recommendations.html One statement that stood out to me in the link above is For these reasons, Hortonworks recommends that you either use the Balanced workload configuration or invest in a pilot Hadoop cluster and plan to evolve as you analyze the workload patterns in your environment. Now, this is not a critique/concern of HW but rather of hadoop. Well, what if my workloads can be both CPU and IO intensive? Do I take the approach of throw-enough-excess-hardware-just-in-case? I have not looked at the Grep code so I'm not sure why it's behaving the way it is. Still curious that streaming has a higher IO throughput and lower CPU usage. It may have to do with the fact that /bin/grep is a native implementation and Grep (Hadoop) is probably using Java Pattern/Matcher api. The Grep code is from the bundled examples in CDH. I made one line modification for it to read Sequence files. The streaming job probably does not have lower CPU utilization but I see that it does even out the CPU utilization among all the available processors. I guess the native grep binary threads better than the java MR job? Which brings me to ask - If you have the mapper/reducer functionality built into a platform specific binary, then won't it always be more efficient than a java MR job? And, in such cases, am I better off with streaming than Java MR? Thanks for your responses. On Thu, Oct 10, 2013 at 12:29 PM, Xuri Nagarin secs...@gmail.com wrote: Thanks Pradeep. Does it mean this job is a bad candidate for MR? Interestingly, running the cmdline '/bin/grep' under a streaming job provides (1) Much better disk throughput and, (2) CPU load is almost evenly spread across all cores/threads (no CPU gets pegged to 100%). On Thu, Oct 10, 2013 at 11:15 AM, Pradeep Gollakota pradeep...@gmail.com wrote: Actually... I believe that is expected behavior. Since your CPU is pegged at 100% you're not going to be IO bound. Typically jobs tend to be CPU bound or IO bound. If you're CPU bound you expect to see low IO throughput. If you're IO bound, you expect to see low CPU usage. On Thu, Oct 10, 2013 at 11:05 AM, Xuri Nagarin secs...@gmail.comwrote: Hi, I have a simple Grep job (from bundled examples) that I am running on a 11-node cluster. Each node is 2x8-core Intel Xeons (shows 32 CPUs with HT on), 64GB RAM and 8 x 1TB disks. I have mappers set to 20 per node. When I run the Grep job, I notice that CPU gets pegged to 100% on multiple cores but disk throughput remains a dismal 1-2 Mbytes/sec on a single disk on each node. So I guess, the cluster is poorly performing in terms of disk IO. Running Terasort, I see each disk puts out 25-35 Mbytes/sec with a total cluster throughput of above 1.5 Gbytes/sec. How do I go about re-configuring or re-writing the job to utilize maximum disk IO? TIA, Xuri
Conflicting dependency versions
Hi, I have a yarn application that launches a mapreduce job that has a mapper that uses a newer version of guava than the one hadoop is using. Because of this, the mapper fails and gets a NoSuchMethod exception. Is there a way to indicate that application dependencies should be used over hadoop dependencies? Thanks, Albert
Re: Conflicting dependency versions
Hi Albert, If you are using distributed cache to push the newer version of the guava jars, you can try setting mapreduce.job.user.classpath.first to true. If not, you can try overriding the value of mapreduce.application.classpath to ensure that the dir where the newer guava jars are present is referenced first in the classpath. -- Hitesh On Oct 10, 2013, at 3:27 PM, Albert Shau wrote: Hi, I have a yarn application that launches a mapreduce job that has a mapper that uses a newer version of guava than the one hadoop is using. Because of this, the mapper fails and gets a NoSuchMethod exception. Is there a way to indicate that application dependencies should be used over hadoop dependencies? Thanks, Albert
Re: Intermittent DataStreamer Exception while appending to file inside HDFS
Thank you for the comprehensive answer, When I inspect our NameNode UI, I see there are 3 datanodes are up. However, as you mentioned, the log only showed 2 datanodes are up. Does it mean that one of the datanodes was unreachable when we try to append into the files? Best regards, Arinto www.otnira.com On Thu, Oct 10, 2013 at 4:57 PM, Uma Maheswara Rao G mahesw...@huawei.comwrote: Hi Arinto, ** ** Please disable this feature with smaller clusters. dfs.client.block.write.replace-datanode-on-failure.policy Reason for this exception is, you have replication set to 3 and looks like you have only 2 nodes in the cluster from the logs. When you first time created pipeline we will not do any verification i.e, whether pipeline DNs met the replication or not. Above property says only replace DN on failure. But here additionally we took advantage of verifying this condition when we reopen the pipeline for append. So, here unfortunately it will not meet the replication with existing DNs and it will try to add another node. Since you are not having any extra nodes in cluster other than selected nodes, it will fail. With the current configurations you can not append. ** ** ** ** Also please take a look at default configuration description: namedfs.client.block.write.replace-datanode-on-failure.enable/name valuetrue/value description If there is a datanode/network failure in the write pipeline, DFSClient will try to remove the failed datanode from the pipeline and then continue writing with the remaining datanodes. As a result,** ** the number of datanodes in the pipeline is decreased. The feature is* *** to add new datanodes to the pipeline. ** ** This is a site-wide property to enable/disable the feature. ** ** When the cluster size is extremely small, e.g. 3 nodes or less, cluster administrators may want to set the policy to NEVER in the default configuration file or disable this feature. Otherwise, users may experience an unusually high rate of pipeline failures since it is impossible to find new datanodes for replacement. ** ** See also dfs.client.block.write.replace-datanode-on-failure.policy /description ** ** ** ** Make this configuration false at your client side. ** ** Regards, Uma ** ** ** ** ** ** *From:* Arinto Murdopo [mailto:ari...@gmail.com] *Sent:* 10 October 2013 13:02 *To:* user@hadoop.apache.org *Subject:* Intermittent DataStreamer Exception while appending to file inside HDFS ** ** Hi there, I have this following exception while I'm appending existing file in my HDFS. This error appears intermittently. If the error does not show up, I can append the file successfully. If the error appears, I could not append the file. Here is the error: https://gist.github.com/arinto/d37a56f449c61c9d1d9c For your convenience, here it is: 13/10/10 14:17:30 WARN hdfs.DFSClient: DataStreamer Exception java.io.IOException: Failed to add a datanode. User may turn off this feature by setting dfs.client.block.write.replace-datanode-on-failure.policy in configuration, where the current policy is DEFAULT. (Nodes: current=[10.0.106.82:50010, 10.0.106.81:50010], original=[10.0.106.82:50010, 10.0.106.81:50010]) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.findNewDatanode(DFSOutputStream.java:778) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.addDatanode2ExistingPipeline(DFSOutputStream.java:838) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.setupPipelineForAppendOrRecovery(DFSOutputStream.java:934) at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:461) Some configuration files: 1. hdfs-site.xml: https://gist.github.com/arinto/f5f1522a6f6994ddfc17#file-hdfs-append-datastream-exception-hdfs-site-xml ** ** 2. core-site.xml: https://gist.github.com/arinto/0c6f40872181fe26f8b1#file-hdfs-append-datastream-exception-core-site-xml ** ** So, any idea how to solve this issue? Some links that I've found (but unfortunately they do not help) 1. StackOverflowhttp://stackoverflow.com/questions/15347799/java-io-ioexception-failed-to-add-a-datanode-hdfs-hadoop, our replication factor is 3 and we've never changed the replication factor since we setup the cluster. 2. Impala-User mailing listhttps://groups.google.com/a/cloudera.org/forum/#!searchin/impala-user/DataStreamer$20exception/impala-user/u2CN163Cyfc/_OcRqBYL2B4J: the error here is due to replication factor set to 1. In our case, we're using replication factor = 3 ** ** Best regards, ** ** Arinto www.otnira.com
State of Art in Hadoop Log aggregation
Hi Guys, We have fairly decent sized Hadoop cluster of about 200 nodes and was wondering what is the state of art if I want to aggregate and visualize Hadoop ecosystem logs, particularly 1. Tasktracker logs 2. Datanode logs 3. Hbase RegionServer logs One way is to use something like a Flume on each node to aggregate the logs and then use something like Kibana - http://www.elasticsearch.org/overview/kibana/ to visualize the logs and make them searchable. However I don't want to write another ETL for the hadoop/hbase logs themselves. We currently log in to each machine individually to 'tail -F logs' when there is an hadoop problem on a particular node. We want a better way to look at the hadoop logs themselves in a centralized way when there is an issue without having to login to 100 different machines and was wondering what is the state of are in this regard. Suggestions/Pointers are very welcome!! Sagar
Re: State of Art in Hadoop Log aggregation
You can try Chukwa which is part of the incubating projects under Apache. Tried it before and liked it for aggregating logs. On 11 Oct, 2013, at 1:36 PM, Sagar Mehta sagarme...@gmail.com wrote: Hi Guys, We have fairly decent sized Hadoop cluster of about 200 nodes and was wondering what is the state of art if I want to aggregate and visualize Hadoop ecosystem logs, particularly Tasktracker logs Datanode logs Hbase RegionServer logs One way is to use something like a Flume on each node to aggregate the logs and then use something like Kibana - http://www.elasticsearch.org/overview/kibana/ to visualize the logs and make them searchable. However I don't want to write another ETL for the hadoop/hbase logs themselves. We currently log in to each machine individually to 'tail -F logs' when there is an hadoop problem on a particular node. We want a better way to look at the hadoop logs themselves in a centralized way when there is an issue without having to login to 100 different machines and was wondering what is the state of are in this regard. Suggestions/Pointers are very welcome!! Sagar