[jira] [Updated] (HADOOP-15898) 1 - 1.5 TB Data size fails to run with the following error

Srinivas (JIRA) Fri, 04 Jan 2019 13:03:28 -0800


     [ 
https://issues.apache.org/jira/browse/HADOOP-15898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Srinivas updated HADOOP-15898:
------------------------------
    Description: 
There is a business impact MR job which runs every day @ 2.00 PM PST and data 
size is about 1 - 1.5 TB (depends on the business days) . Ideal elapsed time of 
this job : 4 hrs.  But the multiple  mappers of this job simultaneously  
failing  with the following error so job will take some times 11 and even 13 
hours also like that.  

Steps to prevent this problem : 1, Migrated the environment to Yarn .2 
increased the ulimit 3. Added extra nodes to the cluster. 4. Disks replacement 
taking place regularly 5. Monitoring the cluster and terminating other jobs 
which impacts this job. 

Few of the values that we tried increasing without any benefit are

1. increased open files

2.  increase dfs.datanode.handler.count

3. increase dfs.datanode.max.xcievers

4. increase dfs.datanode.max.transfer.threads

But no luck.

org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in 
pipeline DatanodeInfoWithStorage
 [10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK],
  
DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
 
DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]:(
 bad datanode 
DatanodeInfoWithStorage[10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK]
  
 org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in 
pipeline 
DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
 
DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]:
 bad datanode 
DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK]

org.apache.hadoop.mapred.YarnChild: Exception running child : 
java.io.IOException: java.io.IOException: All datanodes 
DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]
 are bad. Aborting... at 
  

 

  was:
There is a business impact MR job which runs every day @ 2.00 PM PST and data 
size is about 1 - 1.5 TB (depends on the business days) . Ideal elapsed time of 
this job : 4 hrs.  But the multiple  mappers of this job simultaneously  
failing  with the following error so job will take some times 11 and even 13 
hours also like that.  

Steps to prevent this problem : 1, Migrated the environment to Yarn .2 
increased the ulimit 3. Added extra nodes to the cluster. 4. Disks replacement 
taking place regularly 5. Monitoring the cluster and terminating other jobs 
which impacts this job.  But no luck.

org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in 
pipeline DatanodeInfoWithStorage
[10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK],
 
DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]:(
bad datanode 
DatanodeInfoWithStorage[10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK]
 
org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in 
pipeline 
DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
 
DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]:
 bad datanode 
DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK]

org.apache.hadoop.mapred.YarnChild: Exception running child : 
java.io.IOException: java.io.IOException: All datanodes 
DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]
 are bad. Aborting... at 
 


> 1 - 1.5 TB Data size fails to run with the following error 
> -----------------------------------------------------------
>
>                 Key: HADOOP-15898
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15898
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: performance
>    Affects Versions: 2.6.0
>         Environment: Hadoop 2.6.0-cdh5.5.1 Express edition.
>  
>  
>            Reporter: Srinivas
>            Priority: Major
>              Labels: performance
>             Fix For: 2.6.0
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> There is a business impact MR job which runs every day @ 2.00 PM PST and data 
> size is about 1 - 1.5 TB (depends on the business days) . Ideal elapsed time 
> of this job : 4 hrs.  But the multiple  mappers of this job simultaneously  
> failing  with the following error so job will take some times 11 and even 13 
> hours also like that.  
> Steps to prevent this problem : 1, Migrated the environment to Yarn .2 
> increased the ulimit 3. Added extra nodes to the cluster. 4. Disks 
> replacement taking place regularly 5. Monitoring the cluster and terminating 
> other jobs which impacts this job. 
> Few of the values that we tried increasing without any benefit are
> 1. increased open files
> 2.  increase dfs.datanode.handler.count
> 3. increase dfs.datanode.max.xcievers
> 4. increase dfs.datanode.max.transfer.threads
> But no luck.
> org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
> BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in 
> pipeline DatanodeInfoWithStorage
>  [10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK],
>   
> DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
>  
> DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]:(
>  bad datanode 
> DatanodeInfoWithStorage[10.0.1.37:50010,DS-ed333d2e-839a-4029-a1c9-b6615c322ed2,DISK]
>   
>  org.apache.hadoop.hdfs.DFSClient: Error Recovery for block 
> BP-854530680-69.194.253.58-1430267558563:blk_4683766046_1108754130089 in 
> pipeline 
> DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK],
>  
> DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]:
>  bad datanode 
> DatanodeInfoWithStorage[74.120.143.19:50010,DS-5d10576e-adc3-474f-bc9d-f0d6fb3ae4c3,DISK]
> org.apache.hadoop.mapred.YarnChild: Exception running child : 
> java.io.IOException: java.io.IOException: All datanodes 
> DatanodeInfoWithStorage[74.120.143.6:50010,DS-a5299d68-2858-46c3-8e37-d2559895f979,DISK]
>  are bad. Aborting... at 
>   
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Updated] (HADOOP-15898) 1 - 1.5 TB Data size fails to run with the following error

Reply via email to