[ 
https://issues.apache.org/jira/browse/HDDS-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648620#comment-16648620
 ] 

Xiaoyu Yao commented on HDDS-637:
---------------------------------

Thanks [~nmaheshwari] for reporting the issue. Do you restart the DN while the 
MR result file is being written? 
I tried a few times but can't repro the issue. 

By reviewing the code and the log message, it seems this can happen when the 
pipeline creation and the closing executed (stale DN handler) by different 
threads can have a race, leading to the invalidate state transition. This will 
be improved with HDDS-587 and follow up JIRAs. 

> Not able to access the part-r-00000 file after the MR job succeeds
> ------------------------------------------------------------------
>
>                 Key: HDDS-637
>                 URL: https://issues.apache.org/jira/browse/HDDS-637
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>    Affects Versions: 0.3.0
>            Reporter: Namit Maheshwari
>            Assignee: Xiaoyu Yao
>            Priority: Major
>
> Run a MR job
> {code:java}
> -bash-4.2$ /usr/hdp/current/hadoop-client/bin/hadoop jar 
> /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar 
> wordcount /tmp/mr_jobs/input/ o3://bucket2.volume2/mr_jobDD
> 18/10/12 01:00:23 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:24 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:24 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:25 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:25 INFO client.ConfiguredRMFailoverProxyProvider: Failing over 
> to rm2
> 18/10/12 01:00:25 INFO mapreduce.JobResourceUploader: Disabling Erasure 
> Coding for path: /user/hdfs/.staging/job_1539295307098_0003
> 18/10/12 01:00:27 INFO input.FileInputFormat: Total input files to process : 1
> 18/10/12 01:00:27 INFO lzo.GPLNativeCodeLoader: Loaded native gpl library
> 18/10/12 01:00:27 INFO lzo.LzoCodec: Successfully loaded & initialized 
> native-lzo library [hadoop-lzo rev 5d6248d8d690f8456469979213ab2e9993bfa2e9]
> 18/10/12 01:00:27 INFO mapreduce.JobSubmitter: number of splits:1
> 18/10/12 01:00:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: 
> job_1539295307098_0003
> 18/10/12 01:00:28 INFO mapreduce.JobSubmitter: Executing with tokens: []
> 18/10/12 01:00:28 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:28 INFO conf.Configuration: found resource resource-types.xml 
> at file:/etc/hadoop/3.0.3.0-63/0/resource-types.xml
> 18/10/12 01:00:28 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:28 INFO impl.YarnClientImpl: Submitted application 
> application_1539295307098_0003
> 18/10/12 01:00:29 INFO mapreduce.Job: The url to track the job: 
> http://ctr-e138-1518143905142-510793-01-000005.hwx.site:8088/proxy/application_1539295307098_0003/
> 18/10/12 01:00:29 INFO mapreduce.Job: Running job: job_1539295307098_0003
> 18/10/12 01:00:35 INFO mapreduce.Job: Job job_1539295307098_0003 running in 
> uber mode : false
> 18/10/12 01:00:35 INFO mapreduce.Job: map 0% reduce 0%
> 18/10/12 01:00:44 INFO mapreduce.Job: map 100% reduce 0%
> 18/10/12 01:00:57 INFO mapreduce.Job: map 100% reduce 67%
> 18/10/12 01:00:59 INFO mapreduce.Job: map 100% reduce 100%
> 18/10/12 01:00:59 INFO mapreduce.Job: Job job_1539295307098_0003 completed 
> successfully
> 18/10/12 01:00:59 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:00:59 INFO mapreduce.Job: Counters: 58
> File System Counters
> FILE: Number of bytes read=6332
> FILE: Number of bytes written=532585
> FILE: Number of read operations=0
> FILE: Number of large read operations=0
> FILE: Number of write operations=0
> HDFS: Number of bytes read=215876
> HDFS: Number of bytes written=0
> HDFS: Number of read operations=2
> HDFS: Number of large read operations=0
> HDFS: Number of write operations=0
> O3: Number of bytes read=0
> O3: Number of bytes written=0
> O3: Number of read operations=0
> O3: Number of large read operations=0
> O3: Number of write operations=0
> Job Counters
> Launched map tasks=1
> Launched reduce tasks=1
> Rack-local map tasks=1
> Total time spent by all maps in occupied slots (ms)=25392
> Total time spent by all reduces in occupied slots (ms)=103584
> Total time spent by all map tasks (ms)=6348
> Total time spent by all reduce tasks (ms)=12948
> Total vcore-milliseconds taken by all map tasks=6348
> Total vcore-milliseconds taken by all reduce tasks=12948
> Total megabyte-milliseconds taken by all map tasks=26001408
> Total megabyte-milliseconds taken by all reduce tasks=106070016
> Map-Reduce Framework
> Map input records=716
> Map output records=32019
> Map output bytes=343475
> Map output materialized bytes=6332
> Input split bytes=121
> Combine input records=32019
> Combine output records=461
> Reduce input groups=461
> Reduce shuffle bytes=6332
> Reduce input records=461
> Reduce output records=461
> Spilled Records=922
> Shuffled Maps =1
> Failed Shuffles=0
> Merged Map outputs=1
> GC time elapsed (ms)=359
> CPU time spent (ms)=11800
> Physical memory (bytes) snapshot=3018502144
> Virtual memory (bytes) snapshot=14470242304
> Total committed heap usage (bytes)=3521642496
> Peak Map Physical memory (bytes)=2518896640
> Peak Map Virtual memory (bytes)=5397549056
> Peak Reduce Physical memory (bytes)=499605504
> Peak Reduce Virtual memory (bytes)=9072693248
> Shuffle Errors
> BAD_ID=0
> CONNECTION=0
> IO_ERROR=0
> WRONG_LENGTH=0
> WRONG_MAP=0
> WRONG_REDUCE=0
> File Input Format Counters
> Bytes Read=215755
> File Output Format Counters
> Bytes Written=0
> 18/10/12 01:00:59 INFO conf.Configuration: Removed undeclared tags:
> -bash-4.2$
> {code}
> Below exception is seen in SCM logs
> {code:java}
> 2018-10-12 01:00:51,142 INFO 
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl: Allocating a new 
> ratis pipeline of size: 3 id: pipelineId=85cfe1ff-a47c-4ee7-aadb-9c7a42d80192
> 2018-10-12 01:00:51,988 INFO 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector: closing pipeline 
> pipelineId=85cfe1ff-a47c-4ee7-aadb-9c7a42d80192.
> 2018-10-12 01:00:51,988 INFO 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector: Finalizing pipeline. 
> pipelineID: pipelineId=85cfe1ff-a47c-4ee7-aadb-9c7a42d80192
> 2018-10-12 01:00:51,989 ERROR 
> org.apache.hadoop.hdds.scm.pipelines.PipelineStateManager: Failed to update 
> pipeline state pipelineId=85cfe1ff-a47c-4ee7-aadb-9c7a42d80192, reason: 
> invalid state transition from state: CREATING upon event: FINALIZE.
> 2018-10-12 01:00:51,991 INFO 
> org.apache.hadoop.hdds.scm.pipelines.PipelineCloseHandler: failed to close 
> pipeline:pipelineId=85cfe1ff-a47c-4ee7-aadb-9c7a42d80192
> org.apache.hadoop.hdds.scm.exceptions.SCMException: Failed to update pipeline 
> state pipelineId=85cfe1ff-a47c-4ee7-aadb-9c7a42d80192, reason: invalid state 
> transition from state: CREATING upon event: FINALIZE.
> at 
> org.apache.hadoop.hdds.scm.pipelines.PipelineStateManager.updatePipelineState(PipelineStateManager.java:128)
> at 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.updatePipelineState(PipelineSelector.java:464)
> at 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector.finalizePipeline(PipelineSelector.java:310)
> at 
> org.apache.hadoop.hdds.scm.pipelines.PipelineCloseHandler.onMessage(PipelineCloseHandler.java:44)
> at 
> org.apache.hadoop.hdds.scm.pipelines.PipelineCloseHandler.onMessage(PipelineCloseHandler.java:30)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:85)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> 2018-10-12 01:00:53,297 INFO 
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl: Allocating a new 
> ratis pipeline of size: 3 id: pipelineId=68051c76-53c3-47e5-b9c6-d7dc2a3e44c1
> 2018-10-12 01:00:53,847 INFO 
> org.apache.hadoop.hdds.scm.pipelines.ratis.RatisManagerImpl: Allocating a new 
> ratis pipeline of size: 3 id: pipelineId=25f738e7-692e-4201-882e-2dcf08ef9340
> 2018-10-12 01:00:56,959 INFO 
> org.apache.hadoop.hdds.scm.server.SCMClientProtocolServer: Object type 
> container id 2 op create new stage complete
> 2018-10-12 01:01:09,990 INFO 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector: closing pipeline 
> pipelineId=68051c76-53c3-47e5-b9c6-d7dc2a3e44c1.
> 2018-10-12 01:01:09,990 INFO 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector: Finalizing pipeline. 
> pipelineID: pipelineId=68051c76-53c3-47e5-b9c6-d7dc2a3e44c1
> 2018-10-12 01:01:09,991 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 7
> 2018-10-12 01:01:09,992 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 10
> 2018-10-12 01:01:09,993 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 13
> 2018-10-12 01:01:09,993 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 16
> 2018-10-12 01:01:09,993 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 2
> 2018-10-12 01:01:09,997 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 19
> 2018-10-12 01:01:09,997 INFO 
> org.apache.hadoop.hdds.scm.container.CloseContainerEventHandler: Close 
> container Event triggered for container : 4
> 2018-10-12 01:02:10,114 INFO 
> org.apache.hadoop.hdds.scm.pipelines.PipelineSelector: Closing pipeline. 
> pipelineID: pipelineId=68051c76-53c3-47e5-b9c6-d7dc2a3e44c1{code}
> Not able to access the part-r-00000 file
> {code:java}
> -bash-4.2$ hdfs dfs -ls o3://bucket2.volume2/mr_jobDD
> 18/10/12 01:01:44 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:01:45 INFO conf.Configuration: Removed undeclared tags:
> Found 2 items
> rw-rw-rw 1 hdfs hdfs 0 2018-10-12 01:00 o3://bucket2.volume2/mr_jobDD/_SUCCESS
> rw-rw-rw 1 hdfs hdfs 5017 1970-07-19 09:05 
> o3://bucket2.volume2/mr_jobDD/part-r-00000
> 18/10/12 01:01:45 INFO conf.Configuration: Removed undeclared tags:
> -bash-4.2$ hdfs dfs -cat o3://bucket2.volume2/mr_jobDD/part-r-00000
> 18/10/12 01:02:09 INFO conf.Configuration: Removed undeclared tags:
> 18/10/12 01:02:10 INFO conf.Configuration: Removed undeclared tags:
> cat: ContainerID 2 does not exist
> 18/10/12 01:02:12 INFO conf.Configuration: Removed undeclared tags:
> -bash-4.2$ {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to