ignoring map task failure

2014-08-18 Thread parnab kumar
Hi All,

   I am running a job where there are between 1300-1400 map tasks. Some
map task fails due to some error. When 4 such maps fail the job naturally
gets killed.  How to  ignore the failed tasks and go around executing the
other map tasks. I am okay with loosing some data for the failed tasks.

Thanks,
Parnab


Re: Data cleansing in modern data architecture

2014-08-18 Thread Jens Scheidtmann
Hi Bob,

the answer to your original question depends entirely on the procedures and
conventions set forth for your data warehouse. So only you can answer it.

If you're asking for best practices, it still depends:
- How large are your files?
- Have you enough free space for recoding?
- Are you better off writing an exception file?
- How do you make sure it is always respected?
- etc.

Best regards,

Jens


Re: ignoring map task failure

2014-08-18 Thread Susheel Kumar Gadalay
Check the parameter yarn.app.mapreduce.client.max-retries.

On 8/18/14, parnab kumar parnab.2...@gmail.com wrote:
 Hi All,

I am running a job where there are between 1300-1400 map tasks. Some
 map task fails due to some error. When 4 such maps fail the job naturally
 gets killed.  How to  ignore the failed tasks and go around executing the
 other map tasks. I am okay with loosing some data for the failed tasks.

 Thanks,
 Parnab



Re: Top K words problem

2014-08-18 Thread Jens Scheidtmann
Google for streaming algorithms also stream processing for getting ideas.

Best regards, Jens


Passing VM arguments while submitting a job in Hadoop YARN.

2014-08-18 Thread S.L
Hi All,

All we submit a job using bin/hadoop script on teh resource manager node,
what if we need our job to be passed in a vm argument like in my case
'target-env' , how o I do that , will this argument be passed to all the
node managers in different nodes ?


Re: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-18 Thread Calvin
OK, I figured out exactly what was happening.

I had set the configuration value yarn.nodemanager.vmem-pmem-ratio
to 10. Since there is no swap space available for use, every task
which is requesting 2 GB of memory is also requesting an additional 20
GB of memory. This 20 GB isn't represented in the Memory Used column
on the YARN applications status page and thus it seemed like I was
underutilizing the YARN cluster (when in actuality I had allocated all
the memory available).

(The cluster underutilization occurs regardless of using HDFS or
LocalFileSystem; I must have made this configuration change after
testing HDFS and before testing the local filesystem.)

The solution is to set  yarn.nodemanager.vmem-pmem-ratio to 1 (since
I have no swap) *and* yarn.nodemanager.vmem-check.enabled to false.

Part of the reason why I had set such a high setting was due to
containers being killed because of virtual memory usage. The Cloudera
folks have a good blog post [1] on this topic (see #6) and I wish I
had read that sooner.

With the above configuration values, I can now utilize the cluster at 100%.

Thanks for everyone's input!

Calvin

[1] 
http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

On Fri, Aug 15, 2014 at 2:11 PM, java8964 java8...@hotmail.com wrote:
 Interesting to know that.

 I also want to know what underline logic holding the force to only generate
 25-35 parallelized containers, instead of up to 1300.

 Another suggestion I can give is following:

 1) In your driver, generate a text file, including all your 1300 bz2 file
 names with absolute path.
 2) In your MR job, use the NLineInputFormat, with default setting, each line
 content will trigger one mapper task.
 3) In your mapper, key/value pair will be offset byte loc/line content, just
 start to process the file, as it should be available from the mount path in
 the local data nodes.
 4) I assume that you are using Yarn. In this case, at least 1300 container
 requests will be issued to the cluster. You generate 1300 parallelized
 request, now it is up to the cluster to decide how many containers can be
 parallel run.

 Yong

 Date: Fri, 15 Aug 2014 12:30:09 -0600

 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
 From: iphcal...@gmail.com
 To: user@hadoop.apache.org


 Thanks for the responses!

 To clarify, I'm not using any special FileSystem implementation. An
 example input parameter to a MapReduce job would be something like
 -input file:///scratch/data. Thus I think (any clarification would
 be helpful) Hadoop is then utilizing LocalFileSystem
 (org.apache.hadoop.fs.LocalFileSystem).

 The input data is large enough and splittable (1300 .bz2 files, 274MB
 each, 350GB total). Thus even if it the input data weren't splittable,
 Hadoop should be able to parallelize up to 1300 map tasks if capacity
 is available; in my case, I find that the Hadoop cluster is not fully
 utilized (i.e., ~25-35 containers running when it can scale up to ~80
 containers) when not using HDFS, while achieving maximum use when
 using HDFS.

 I'm wondering if Hadoop is holding back or throttling the I/O if
 LocalFileSystem is being used, and what changes I can make to have the
 Hadoop tasks scale.

 In the meantime, I'll take a look at the API calls that Harsh mentioned.


 On Fri, Aug 15, 2014 at 10:15 AM, Harsh J ha...@cloudera.com wrote:
  The split configurations in FIF mentioned earlier would work for local
  files
  as well. They aren't deemed unsplitable, just considered as one single
  block.
 
  If the FS in use has its advantages it's better to implement a proper
  interface to it making use of them, than to rely on the LFS by mounting
  it.
  This is what we do with HDFS.
 
  On Aug 15, 2014 8:52 PM, java8964 java8...@hotmail.com wrote:
 
  I believe that Calvin mentioned before that this parallel file system
  mounted into local file system.
 
  In this case, will Hadoop just use java.io.File as local File system to
  treat them as local file and not split the file?
 
  Just want to know the logic in hadoop handling the local file.
 
  One suggestion I can think is to split the files manually outside of
  hadoop. For example, generate lots of small files as 128M or 256M size.
 
  In this case, each mapper will process one small file, so you can get
  good
  utilization of your cluster, assume you have a lot of small files.
 
  Yong
 
   From: ha...@cloudera.com
   Date: Fri, 15 Aug 2014 16:45:02 +0530
   Subject: Re: hadoop/yarn and task parallelization on non-hdfs
   filesystems
   To: user@hadoop.apache.org
  
   Does your non-HDFS filesystem implement a getBlockLocations API, that
   MR relies on to know how to split files?
  
   The API is at
  
   http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,
   long, long), and MR calls it at
  
  
   

Re: hadoop/yarn and task parallelization on non-hdfs filesystems

2014-08-18 Thread Calvin
Oops, one of the settings should read
yarn.nodemanager.vmem-check-enabled. The blog post has a typo and a
comment pointed that out as well.

Thanks,
Calvin

On Mon, Aug 18, 2014 at 4:45 PM, Calvin iphcal...@gmail.com wrote:
 OK, I figured out exactly what was happening.

 I had set the configuration value yarn.nodemanager.vmem-pmem-ratio
 to 10. Since there is no swap space available for use, every task
 which is requesting 2 GB of memory is also requesting an additional 20
 GB of memory. This 20 GB isn't represented in the Memory Used column
 on the YARN applications status page and thus it seemed like I was
 underutilizing the YARN cluster (when in actuality I had allocated all
 the memory available).

 (The cluster underutilization occurs regardless of using HDFS or
 LocalFileSystem; I must have made this configuration change after
 testing HDFS and before testing the local filesystem.)

 The solution is to set  yarn.nodemanager.vmem-pmem-ratio to 1 (since
 I have no swap) *and* yarn.nodemanager.vmem-check.enabled to false.

 Part of the reason why I had set such a high setting was due to
 containers being killed because of virtual memory usage. The Cloudera
 folks have a good blog post [1] on this topic (see #6) and I wish I
 had read that sooner.

 With the above configuration values, I can now utilize the cluster at 100%.

 Thanks for everyone's input!

 Calvin

 [1] 
 http://blog.cloudera.com/blog/2014/04/apache-hadoop-yarn-avoiding-6-time-consuming-gotchas/

 On Fri, Aug 15, 2014 at 2:11 PM, java8964 java8...@hotmail.com wrote:
 Interesting to know that.

 I also want to know what underline logic holding the force to only generate
 25-35 parallelized containers, instead of up to 1300.

 Another suggestion I can give is following:

 1) In your driver, generate a text file, including all your 1300 bz2 file
 names with absolute path.
 2) In your MR job, use the NLineInputFormat, with default setting, each line
 content will trigger one mapper task.
 3) In your mapper, key/value pair will be offset byte loc/line content, just
 start to process the file, as it should be available from the mount path in
 the local data nodes.
 4) I assume that you are using Yarn. In this case, at least 1300 container
 requests will be issued to the cluster. You generate 1300 parallelized
 request, now it is up to the cluster to decide how many containers can be
 parallel run.

 Yong

 Date: Fri, 15 Aug 2014 12:30:09 -0600

 Subject: Re: hadoop/yarn and task parallelization on non-hdfs filesystems
 From: iphcal...@gmail.com
 To: user@hadoop.apache.org


 Thanks for the responses!

 To clarify, I'm not using any special FileSystem implementation. An
 example input parameter to a MapReduce job would be something like
 -input file:///scratch/data. Thus I think (any clarification would
 be helpful) Hadoop is then utilizing LocalFileSystem
 (org.apache.hadoop.fs.LocalFileSystem).

 The input data is large enough and splittable (1300 .bz2 files, 274MB
 each, 350GB total). Thus even if it the input data weren't splittable,
 Hadoop should be able to parallelize up to 1300 map tasks if capacity
 is available; in my case, I find that the Hadoop cluster is not fully
 utilized (i.e., ~25-35 containers running when it can scale up to ~80
 containers) when not using HDFS, while achieving maximum use when
 using HDFS.

 I'm wondering if Hadoop is holding back or throttling the I/O if
 LocalFileSystem is being used, and what changes I can make to have the
 Hadoop tasks scale.

 In the meantime, I'll take a look at the API calls that Harsh mentioned.


 On Fri, Aug 15, 2014 at 10:15 AM, Harsh J ha...@cloudera.com wrote:
  The split configurations in FIF mentioned earlier would work for local
  files
  as well. They aren't deemed unsplitable, just considered as one single
  block.
 
  If the FS in use has its advantages it's better to implement a proper
  interface to it making use of them, than to rely on the LFS by mounting
  it.
  This is what we do with HDFS.
 
  On Aug 15, 2014 8:52 PM, java8964 java8...@hotmail.com wrote:
 
  I believe that Calvin mentioned before that this parallel file system
  mounted into local file system.
 
  In this case, will Hadoop just use java.io.File as local File system to
  treat them as local file and not split the file?
 
  Just want to know the logic in hadoop handling the local file.
 
  One suggestion I can think is to split the files manually outside of
  hadoop. For example, generate lots of small files as 128M or 256M size.
 
  In this case, each mapper will process one small file, so you can get
  good
  utilization of your cluster, assume you have a lot of small files.
 
  Yong
 
   From: ha...@cloudera.com
   Date: Fri, 15 Aug 2014 16:45:02 +0530
   Subject: Re: hadoop/yarn and task parallelization on non-hdfs
   filesystems
   To: user@hadoop.apache.org
  
   Does your non-HDFS filesystem implement a getBlockLocations API, that
   MR relies on to know how to split files?
  
   The API is at

Re: 100% CPU consumption by Resource Manager process

2014-08-18 Thread Wangda Tan
Hi Krishna,

4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in
your configuration?
50

I think this config is problematic, too small heartbeat-interval will cause
NM contact RM too often. I would suggest you can set this value larger like
1000.

Thanks,
Wangda



On Wed, Aug 13, 2014 at 4:42 PM, Krishna Kishore Bonagiri 
write2kish...@gmail.com wrote:

 Hi Wangda,
   Thanks for the reply, here are the details, please see if you could
 suggest anything.

 1) Number of nodes and running app in the cluster
 2 nodes, and I am running my own application that keeps asking for
 containers,
 a) running something on the containers,
 b) releasing the containers,
 c) ask for more containers with incremented priority value, and repeat the
 same process

 2) What's the version of your Hadoop?
 apache hadoop-2.4.0

 3) Have you set
 yarn.scheduler.capacity.schedule-asynchronously.enable=true?
 No

 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in
 your configuration?
 50




 On Tue, Aug 12, 2014 at 12:44 PM, Wangda Tan wheele...@gmail.com wrote:

 Hi Krishna,
 To get more understanding about the problem, could you please share
 following information:
 1) Number of nodes and running app in the cluster
 2) What's the version of your Hadoop?
 3) Have you set
 yarn.scheduler.capacity.schedule-asynchronously.enable=true?
 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms
 in your configuration?

 Thanks,
 Wangda Tan



 On Sun, Aug 10, 2014 at 11:29 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi,
   My YARN resource manager is consuming 100% CPU when I am running an
 application that is running for about 10 hours, requesting as many as 27000
 containers. The CPU consumption was very low at the starting of my
 application, and it gradually went high to over 100%. Is this a known issue
 or are we doing something wrong?

 Every dump of the EVent Processor thread is running
 LeafQueue::assignContainers() specifically the for loop below from
 LeafQueue.java and seems to be looping through some priority list.

 // Try to assign containers to applications in order
 for (FiCaSchedulerApp application : activeApplications) {
 ...
 // Schedule in priority order
 for (Priority priority : application.getPriorities()) {

 3XMTHREADINFO  ResourceManager Event Processor
 J9VMThread:0x01D08600, j9thread_t:0x7F032D2FAA00,
 java/lang/Thread:0x8341D9A0, state:CW, prio=5
 3XMJAVALTHREAD(java/lang/Thread getId:0x1E, isDaemon:false)
 3XMTHREADINFO1(native thread ID:0x4B64, native priority:0x5,
 native policy:UNKNOWN)
 3XMTHREADINFO2(native stack address range
 from:0x7F0313DF8000, to:0x7F0313E39000, size:0x41000)
 3XMCPUTIME   *CPU usage total: 42334.614623696 secs*
 3XMHEAPALLOC Heap bytes allocated since last GC cycle=20456
 (0x4FE8)
 3XMTHREADINFO3   Java callstack:
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.assignContainers(LeafQueue.java:850(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp@0x8360DFE0,
 entry count: 1)
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue@0x833B9280,
 entry count: 1)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainersToChildQueues(ParentQueue.java:655(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80,
 entry count: 2)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainers(ParentQueue.java:569(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80,
 entry count: 1)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:831(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler@0x834037C8,
 entry count: 1)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:878(Compiled
 Code))
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:100(Compiled
 Code))
 4XESTACKTRACEat
 

Re: 100% CPU consumption by Resource Manager process

2014-08-18 Thread Krishna Kishore Bonagiri
Thanks Wangda, I think I have reduced this when I was trying to reduce the
container allocation time.

-Kishore


On Tue, Aug 19, 2014 at 7:39 AM, Wangda Tan wheele...@gmail.com wrote:

 Hi Krishna,

 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms in
 your configuration?
 50

 I think this config is problematic, too small heartbeat-interval will
 cause NM contact RM too often. I would suggest you can set this value
 larger like 1000.

 Thanks,
 Wangda



 On Wed, Aug 13, 2014 at 4:42 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi Wangda,
   Thanks for the reply, here are the details, please see if you could
 suggest anything.

 1) Number of nodes and running app in the cluster
 2 nodes, and I am running my own application that keeps asking for
 containers,
 a) running something on the containers,
 b) releasing the containers,
 c) ask for more containers with incremented priority value, and repeat
 the same process

 2) What's the version of your Hadoop?
 apache hadoop-2.4.0

 3) Have you set
 yarn.scheduler.capacity.schedule-asynchronously.enable=true?
 No

 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms
 in your configuration?
 50




 On Tue, Aug 12, 2014 at 12:44 PM, Wangda Tan wheele...@gmail.com wrote:

 Hi Krishna,
 To get more understanding about the problem, could you please share
 following information:
 1) Number of nodes and running app in the cluster
 2) What's the version of your Hadoop?
 3) Have you set
 yarn.scheduler.capacity.schedule-asynchronously.enable=true?
 4) What's the yarn.resourcemanager.nodemanagers.heartbeat-interval-ms
 in your configuration?

 Thanks,
 Wangda Tan



 On Sun, Aug 10, 2014 at 11:29 PM, Krishna Kishore Bonagiri 
 write2kish...@gmail.com wrote:

 Hi,
   My YARN resource manager is consuming 100% CPU when I am running an
 application that is running for about 10 hours, requesting as many as 27000
 containers. The CPU consumption was very low at the starting of my
 application, and it gradually went high to over 100%. Is this a known issue
 or are we doing something wrong?

 Every dump of the EVent Processor thread is running
 LeafQueue::assignContainers() specifically the for loop below from
 LeafQueue.java and seems to be looping through some priority list.

 // Try to assign containers to applications in order
 for (FiCaSchedulerApp application : activeApplications) {
 ...
 // Schedule in priority order
 for (Priority priority : application.getPriorities()) {

 3XMTHREADINFO  ResourceManager Event Processor
 J9VMThread:0x01D08600, j9thread_t:0x7F032D2FAA00,
 java/lang/Thread:0x8341D9A0, state:CW, prio=5
 3XMJAVALTHREAD(java/lang/Thread getId:0x1E, isDaemon:false)
 3XMTHREADINFO1(native thread ID:0x4B64, native
 priority:0x5, native policy:UNKNOWN)
 3XMTHREADINFO2(native stack address range
 from:0x7F0313DF8000, to:0x7F0313E39000, size:0x41000)
 3XMCPUTIME   *CPU usage total: 42334.614623696 secs*
 3XMHEAPALLOC Heap bytes allocated since last GC cycle=20456
 (0x4FE8)
 3XMTHREADINFO3   Java callstack:
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue.assignContainers(LeafQueue.java:850(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/common/fica/FiCaSchedulerApp@0x8360DFE0,
 entry count: 1)
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/LeafQueue@0x833B9280,
 entry count: 1)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainersToChildQueues(ParentQueue.java:655(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80,
 entry count: 2)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue.assignContainers(ParentQueue.java:569(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/ParentQueue@0x83360A80,
 entry count: 1)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:831(Compiled
 Code))
 5XESTACKTRACE   (entered lock:
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler@0x834037C8,
 entry count: 1)
 4XESTACKTRACEat
 org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/CapacityScheduler.handle(CapacityScheduler.java:878(Compiled
 Code))
 4XESTACKTRACEat
 

mismatch layout version.

2014-08-18 Thread juil cho
 
I installed version 2.4.1.
 
I found the following information in the datanode log file. 
 
But, cluster is healthy. 
 
*** datanode log
2014-07-21 15:43:27,365 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Data-node version: -55 and name-node layout version: -56