ignoring map task failure
Hi All, I am running a job where there are between 1300-1400 map tasks. Some map task fails due to some error. When 4 such maps fail the job naturally gets killed. How to ignore the failed tasks and go around executing the other map tasks. I am okay with loosing some data for the failed tasks. Thanks, Parnab
Re: Data cleansing in modern data architecture
Hi Bob, the answer to your original question depends entirely on the procedures and conventions set forth for your data warehouse. So only you can answer it. If you're asking for best practices, it still depends: - How large are your files? - Have you enough free space for recoding? - Are you better off writing an exception file? - How do you make sure it is always respected? - etc. Best regards, Jens
Re: ignoring map task failure
Check the parameter yarn.app.mapreduce.client.max-retries. On 8/18/14, parnab kumar parnab.2...@gmail.com wrote: Hi All, I am running a job where there are between 1300-1400 map tasks. Some map task fails due to some error. When 4 such maps fail the job naturally gets killed. How to ignore the failed tasks and go around executing the other map tasks. I am okay with loosing some data for the failed tasks. Thanks, Parnab
Re: Top K words problem
Google for streaming algorithms also stream processing for getting ideas. Best regards, Jens
Passing VM arguments while submitting a job in Hadoop YARN.
Hi All, All we submit a job using bin/hadoop script on teh resource manager node, what if we need our job to be passed in a vm argument like in my case 'target-env' , how o I do that , will this argument be passed to all the node managers in different nodes ?
Re: hadoop/yarn and task parallelization on non-hdfs filesystems
OK, I figured out exactly what was happening. I had set the configuration value yarn.nodemanager.vmem-pmem-ratio to 10. Since there is no swap space available for use, every task which is requesting 2 GB of memory is also requesting an additional 20 GB of memory. This 20 GB isn't represented in the Memory Used column on the YARN applications status page and thus it seemed like I was underutilizing the YARN cluster (when in actuality I had allocated all the memory available). (The cluster underutilization occurs regardless of using HDFS or LocalFileSystem; I must have made this configuration change after testing HDFS and before testing the local filesystem.) The solution is to set yarn.nodemanager.vmem-pmem-ratio to 1 (since I have no swap) *and* yarn.nodemanager.vmem-check.enabled to false. Part of the reason why I had set such a high setting was due to containers being killed because of virtual memory usage. The Cloudera folks have a good blog post [1] on this topic (see #6) and I wish I had read that sooner. With the above configuration values, I can now utilize the cluster at 100%. Thanks for everyone's input! Calvin [1] http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ On Fri, Aug 15, 2014 at 2:11 PM, java8964 java8...@hotmail.com wrote: Interesting to know that. I also want to know what underline logic holding the force to only generate 25-35 parallelized containers, instead of up to 1300. Another suggestion I can give is following: 1) In your driver, generate a text file, including all your 1300 bz2 file names with absolute path. 2) In your MR job, use the NLineInputFormat, with default setting, each line content will trigger one mapper task. 3) In your mapper, key/value pair will be offset byte loc/line content, just start to process the file, as it should be available from the mount path in the local data nodes. 4) I assume that you are using Yarn. In this case, at least 1300 container requests will be issued to the cluster. You generate 1300 parallelized request, now it is up to the cluster to decide how many containers can be parallel run. Yong Date: Fri, 15 Aug 2014 12:30:09 -0600 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems From: iphcal...@gmail.com To: user@hadoop.apache.org Thanks for the responses! To clarify, I'm not using any special FileSystem implementation. An example input parameter to a MapReduce job would be something like -input file:///scratch/data. Thus I think (any clarification would be helpful) Hadoop is then utilizing LocalFileSystem (org.apache.hadoop.fs.LocalFileSystem). The input data is large enough and splittable (1300 .bz2 files, 274MB each, 350GB total). Thus even if it the input data weren't splittable, Hadoop should be able to parallelize up to 1300 map tasks if capacity is available; in my case, I find that the Hadoop cluster is not fully utilized (i.e., ~25-35 containers running when it can scale up to ~80 containers) when not using HDFS, while achieving maximum use when using HDFS. I'm wondering if Hadoop is holding back or throttling the I/O if LocalFileSystem is being used, and what changes I can make to have the Hadoop tasks scale. In the meantime, I'll take a look at the API calls that Harsh mentioned. On Fri, Aug 15, 2014 at 10:15 AM, Harsh J ha...@cloudera.com wrote: The split configurations in FIF mentioned earlier would work for local files as well. They aren't deemed unsplitable, just considered as one single block. If the FS in use has its advantages it's better to implement a proper interface to it making use of them, than to rely on the LFS by mounting it. This is what we do with HDFS. On Aug 15, 2014 8:52 PM, java8964 java8...@hotmail.com wrote: I believe that Calvin mentioned before that this parallel file system mounted into local file system. In this case, will Hadoop just use java.io.File as local File system to treat them as local file and not split the file? Just want to know the logic in hadoop handling the local file. One suggestion I can think is to split the files manually outside of hadoop. For example, generate lots of small files as 128M or 256M size. In this case, each mapper will process one small file, so you can get good utilization of your cluster, assume you have a lot of small files. Yong From: ha...@cloudera.com Date: Fri, 15 Aug 2014 16:45:02 +0530 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems To: user@hadoop.apache.org Does your non-HDFS filesystem implement a getBlockLocations API, that MR relies on to know how to split files? The API is at http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus, long, long), and MR calls it at
Re: hadoop/yarn and task parallelization on non-hdfs filesystems
Oops, one of the settings should read yarn.nodemanager.vmem-check-enabled. The blog post has a typo and a comment pointed that out as well. Thanks, Calvin On Mon, Aug 18, 2014 at 4:45 PM, Calvin iphcal...@gmail.com wrote: OK, I figured out exactly what was happening. I had set the configuration value yarn.nodemanager.vmem-pmem-ratio to 10. Since there is no swap space available for use, every task which is requesting 2 GB of memory is also requesting an additional 20 GB of memory. This 20 GB isn't represented in the Memory Used column on the YARN applications status page and thus it seemed like I was underutilizing the YARN cluster (when in actuality I had allocated all the memory available). (The cluster underutilization occurs regardless of using HDFS or LocalFileSystem; I must have made this configuration change after testing HDFS and before testing the local filesystem.) The solution is to set yarn.nodemanager.vmem-pmem-ratio to 1 (since I have no swap) *and* yarn.nodemanager.vmem-check.enabled to false. Part of the reason why I had set such a high setting was due to containers being killed because of virtual memory usage. The Cloudera folks have a good blog post [1] on this topic (see #6) and I wish I had read that sooner. With the above configuration values, I can now utilize the cluster at 100%. Thanks for everyone's input! Calvin [1] http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/ On Fri, Aug 15, 2014 at 2:11 PM, java8964 java8...@hotmail.com wrote: Interesting to know that. I also want to know what underline logic holding the force to only generate 25-35 parallelized containers, instead of up to 1300. Another suggestion I can give is following: 1) In your driver, generate a text file, including all your 1300 bz2 file names with absolute path. 2) In your MR job, use the NLineInputFormat, with default setting, each line content will trigger one mapper task. 3) In your mapper, key/value pair will be offset byte loc/line content, just start to process the file, as it should be available from the mount path in the local data nodes. 4) I assume that you are using Yarn. In this case, at least 1300 container requests will be issued to the cluster. You generate 1300 parallelized request, now it is up to the cluster to decide how many containers can be parallel run. Yong Date: Fri, 15 Aug 2014 12:30:09 -0600 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems From: iphcal...@gmail.com To: user@hadoop.apache.org Thanks for the responses! To clarify, I'm not using any special FileSystem implementation. An example input parameter to a MapReduce job would be something like -input file:///scratch/data. Thus I think (any clarification would be helpful) Hadoop is then utilizing LocalFileSystem (org.apache.hadoop.fs.LocalFileSystem). The input data is large enough and splittable (1300 .bz2 files, 274MB each, 350GB total). Thus even if it the input data weren't splittable, Hadoop should be able to parallelize up to 1300 map tasks if capacity is available; in my case, I find that the Hadoop cluster is not fully utilized (i.e., ~25-35 containers running when it can scale up to ~80 containers) when not using HDFS, while achieving maximum use when using HDFS. I'm wondering if Hadoop is holding back or throttling the I/O if LocalFileSystem is being used, and what changes I can make to have the Hadoop tasks scale. In the meantime, I'll take a look at the API calls that Harsh mentioned. On Fri, Aug 15, 2014 at 10:15 AM, Harsh J ha...@cloudera.com wrote: The split configurations in FIF mentioned earlier would work for local files as well. They aren't deemed unsplitable, just considered as one single block. If the FS in use has its advantages it's better to implement a proper interface to it making use of them, than to rely on the LFS by mounting it. This is what we do with HDFS. On Aug 15, 2014 8:52 PM, java8964 java8...@hotmail.com wrote: I believe that Calvin mentioned before that this parallel file system mounted into local file system. In this case, will Hadoop just use java.io.File as local File system to treat them as local file and not split the file? Just want to know the logic in hadoop handling the local file. One suggestion I can think is to split the files manually outside of hadoop. For example, generate lots of small files as 128M or 256M size. In this case, each mapper will process one small file, so you can get good utilization of your cluster, assume you have a lot of small files. Yong From: ha...@cloudera.com Date: Fri, 15 Aug 2014 16:45:02 +0530 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems To: user@hadoop.apache.org Does your non-HDFS filesystem implement a getBlockLocations API, that MR relies on to know how to split files? The API is at
Re: 100% CPU consumption by Resource Manager process
Hi Krishna, 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in your configuration? 50 I think this config is problematic, too small heartbeat-interval will cause NM contact RM too often. I would suggest you can set this value larger like 1000. Thanks, Wangda On Wed, Aug 13, 2014 at 4:42 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Wangda, Thanks for the reply, here are the details, please see if you could suggest anything. 1) Number of nodes and running app in the cluster 2 nodes, and I am running my own application that keeps asking for containers, a) running something on the containers, b) releasing the containers, c) ask for more containers with incremented priority value, and repeat the same process 2) What's the version of your Hadoop? apache hadoop-2.4.0 3) Have you set yarn.scheduler.capacity.schedule-asynchronously.enable=true? No 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in your configuration? 50 On Tue, Aug 12, 2014 at 12:44 PM, Wangda Tan wheele...@gmail.com wrote: Hi Krishna, To get more understanding about the problem, could you please share following information: 1) Number of nodes and running app in the cluster 2) What's the version of your Hadoop? 3) Have you set yarn.scheduler.capacity.schedule-asynchronously.enable=true? 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in your configuration? Thanks, Wangda Tan On Sun, Aug 10, 2014 at 11:29 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, My YARN resource manager is consuming 100% CPU when I am running an application that is running for about 10 hours, requesting as many as 27000 containers. The CPU consumption was very low at the starting of my application, and it gradually went high to over 100%. Is this a known issue or are we doing something wrong? Every dump of the EVent Processor thread is running LeafQueue::assignContainers() specifically the for loop below from LeafQueue.java and seems to be looping through some priority list. // Try to assign containers to applications in order for (FiCaSchedulerApp application : activeApplications) { ... // Schedule in priority order for (Priority priority : application.getPriorities()) { 3XMTHREADINFO ResourceManager Event Processor J9VMThread:0x01D08600, j9thread_t:0x7F032D2FAA00, java/lang/Thread:0x8341D9A0, state:CW, prio=5 3XMJAVALTHREAD(java/lang/Thread getId:0x1E, isDaemon:false) 3XMTHREADINFO1(native thread ID:0x4B64, native priority:0x5, native policy:UNKNOWN) 3XMTHREADINFO2(native stack address range from:0x7F0313DF8000, to:0x7F0313E39000, size:0x41000) 3XMCPUTIME *CPU usage total: 42334.614623696 secs* 3XMHEAPALLOC Heap bytes allocated since last GC cycle=20456 (0x4FE8) 3XMTHREADINFO3 Java callstack: 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.assignContainers(LeafQueue.java:850(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp@0x8360DFE0, entry count: 1) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue@0x833B9280, entry count: 1) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainersToChildQueues(ParentQueue.java:655(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80, entry count: 2) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainers(ParentQueue.java:569(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80, entry count: 1) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:831(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler@0x834037C8, entry count: 1) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:878(Compiled Code)) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:100(Compiled Code)) 4XESTACKTRACEat
Re: 100% CPU consumption by Resource Manager process
Thanks Wangda, I think I have reduced this when I was trying to reduce the container allocation time. -Kishore On Tue, Aug 19, 2014 at 7:39 AM, Wangda Tan wheele...@gmail.com wrote: Hi Krishna, 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in your configuration? 50 I think this config is problematic, too small heartbeat-interval will cause NM contact RM too often. I would suggest you can set this value larger like 1000. Thanks, Wangda On Wed, Aug 13, 2014 at 4:42 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi Wangda, Thanks for the reply, here are the details, please see if you could suggest anything. 1) Number of nodes and running app in the cluster 2 nodes, and I am running my own application that keeps asking for containers, a) running something on the containers, b) releasing the containers, c) ask for more containers with incremented priority value, and repeat the same process 2) What's the version of your Hadoop? apache hadoop-2.4.0 3) Have you set yarn.scheduler.capacity.schedule-asynchronously.enable=true? No 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in your configuration? 50 On Tue, Aug 12, 2014 at 12:44 PM, Wangda Tan wheele...@gmail.com wrote: Hi Krishna, To get more understanding about the problem, could you please share following information: 1) Number of nodes and running app in the cluster 2) What's the version of your Hadoop? 3) Have you set yarn.scheduler.capacity.schedule-asynchronously.enable=true? 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in your configuration? Thanks, Wangda Tan On Sun, Aug 10, 2014 at 11:29 PM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, My YARN resource manager is consuming 100% CPU when I am running an application that is running for about 10 hours, requesting as many as 27000 containers. The CPU consumption was very low at the starting of my application, and it gradually went high to over 100%. Is this a known issue or are we doing something wrong? Every dump of the EVent Processor thread is running LeafQueue::assignContainers() specifically the for loop below from LeafQueue.java and seems to be looping through some priority list. // Try to assign containers to applications in order for (FiCaSchedulerApp application : activeApplications) { ... // Schedule in priority order for (Priority priority : application.getPriorities()) { 3XMTHREADINFO ResourceManager Event Processor J9VMThread:0x01D08600, j9thread_t:0x7F032D2FAA00, java/lang/Thread:0x8341D9A0, state:CW, prio=5 3XMJAVALTHREAD(java/lang/Thread getId:0x1E, isDaemon:false) 3XMTHREADINFO1(native thread ID:0x4B64, native priority:0x5, native policy:UNKNOWN) 3XMTHREADINFO2(native stack address range from:0x7F0313DF8000, to:0x7F0313E39000, size:0x41000) 3XMCPUTIME *CPU usage total: 42334.614623696 secs* 3XMHEAPALLOC Heap bytes allocated since last GC cycle=20456 (0x4FE8) 3XMTHREADINFO3 Java callstack: 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.assignContainers(LeafQueue.java:850(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp@0x8360DFE0, entry count: 1) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue@0x833B9280, entry count: 1) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainersToChildQueues(ParentQueue.java:655(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80, entry count: 2) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainers(ParentQueue.java:569(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80, entry count: 1) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:831(Compiled Code)) 5XESTACKTRACE (entered lock: org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler@0x834037C8, entry count: 1) 4XESTACKTRACEat org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:878(Compiled Code)) 4XESTACKTRACEat
mismatch layout version.
I installed version 2.4.1. I found the following information in the datanode log file. But, cluster is healthy. *** datanode log 2014-07-21 15:43:27,365 INFO org.apache.hadoop.hdfs.server.common.Storage: Data-node version: -55 and name-node layout version: -56