[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2016-04-04 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-8496:
---
  Resolution: Fixed
   Fix Version/s: 2.8.0
Target Version/s: 2.8.0  (was: 2.7.3, 2.6.5)
  Status: Resolved  (was: Patch Available)

> Calling stopWriter() with FSDatasetImpl lock held may block other threads
> -
>
> Key: HDFS-8496
> URL: https://issues.apache.org/jira/browse/HDFS-8496
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: zhouyingchao
>Assignee: Colin Patrick McCabe
> Fix For: 2.8.0
>
> Attachments: HDFS-8496-001.patch, HDFS-8496.002.patch, 
> HDFS-8496.003.patch, HDFS-8496.004.patch
>
>
> On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
> heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
> looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
> lock blocked everything.
> Following is the heartbeat stack, as an example, to show how threads are 
> blocked by FSDatasetImpl lock:
> {code}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x000770465dc0> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The thread which held the FSDatasetImpl lock is just sleeping to wait another 
> thread to exit in stopWriter(). The stack is:
> {code}
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> - locked <0x0007636953b8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
> - locked <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> In this case, we deployed quite a lot other workloads on the DN, the local 
> file system and disk is quite busy. We guess this is why the stopWriter took 
> quite a long time.
> Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
> lock held.   In HDFS-7999, the createTemporary() is changed to call 
> stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
> three methods: recoverClose()/recoverAppend/recoverRbw().
> I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2016-03-31 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-8496:
---
Attachment: HDFS-8496.004.patch

Fix unit test hang

> Calling stopWriter() with FSDatasetImpl lock held may block other threads
> -
>
> Key: HDFS-8496
> URL: https://issues.apache.org/jira/browse/HDFS-8496
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: zhouyingchao
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-8496-001.patch, HDFS-8496.002.patch, 
> HDFS-8496.003.patch, HDFS-8496.004.patch
>
>
> On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
> heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
> looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
> lock blocked everything.
> Following is the heartbeat stack, as an example, to show how threads are 
> blocked by FSDatasetImpl lock:
> {code}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x000770465dc0> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The thread which held the FSDatasetImpl lock is just sleeping to wait another 
> thread to exit in stopWriter(). The stack is:
> {code}
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> - locked <0x0007636953b8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
> - locked <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> In this case, we deployed quite a lot other workloads on the DN, the local 
> file system and disk is quite busy. We guess this is why the stopWriter took 
> quite a long time.
> Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
> lock held.   In HDFS-7999, the createTemporary() is changed to call 
> stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
> three methods: recoverClose()/recoverAppend/recoverRbw().
> I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2016-03-31 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-8496:
---
Attachment: HDFS-8496.003.patch

* Add a unit test verifying that we are no longer blocked inside stopThread in 
the init recovery function
* Fix a bug where we were not interrupting the existing writer before 
attempting to join it

> Calling stopWriter() with FSDatasetImpl lock held may block other threads
> -
>
> Key: HDFS-8496
> URL: https://issues.apache.org/jira/browse/HDFS-8496
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: zhouyingchao
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-8496-001.patch, HDFS-8496.002.patch, 
> HDFS-8496.003.patch
>
>
> On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
> heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
> looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
> lock blocked everything.
> Following is the heartbeat stack, as an example, to show how threads are 
> blocked by FSDatasetImpl lock:
> {code}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x000770465dc0> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The thread which held the FSDatasetImpl lock is just sleeping to wait another 
> thread to exit in stopWriter(). The stack is:
> {code}
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> - locked <0x0007636953b8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
> - locked <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> In this case, we deployed quite a lot other workloads on the DN, the local 
> file system and disk is quite busy. We guess this is why the stopWriter took 
> quite a long time.
> Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
> lock held.   In HDFS-7999, the createTemporary() is changed to call 
> stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
> three methods: recoverClose()/recoverAppend/recoverRbw().
> I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2016-03-30 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-8496:
--
Target Version/s: 2.7.3, 2.6.5

> Calling stopWriter() with FSDatasetImpl lock held may block other threads
> -
>
> Key: HDFS-8496
> URL: https://issues.apache.org/jira/browse/HDFS-8496
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: zhouyingchao
>Assignee: Colin Patrick McCabe
> Attachments: HDFS-8496-001.patch, HDFS-8496.002.patch
>
>
> On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
> heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
> looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
> lock blocked everything.
> Following is the heartbeat stack, as an example, to show how threads are 
> blocked by FSDatasetImpl lock:
> {code}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x000770465dc0> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The thread which held the FSDatasetImpl lock is just sleeping to wait another 
> thread to exit in stopWriter(). The stack is:
> {code}
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> - locked <0x0007636953b8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
> - locked <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> In this case, we deployed quite a lot other workloads on the DN, the local 
> file system and disk is quite busy. We guess this is why the stopWriter took 
> quite a long time.
> Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
> lock held.   In HDFS-7999, the createTemporary() is changed to call 
> stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
> three methods: recoverClose()/recoverAppend/recoverRbw().
> I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2016-03-30 Thread Colin Patrick McCabe (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin Patrick McCabe updated HDFS-8496:
---
Attachment: HDFS-8496.002.patch

> Calling stopWriter() with FSDatasetImpl lock held may block other threads
> -
>
> Key: HDFS-8496
> URL: https://issues.apache.org/jira/browse/HDFS-8496
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: zhouyingchao
>Assignee: zhouyingchao
> Attachments: HDFS-8496-001.patch, HDFS-8496.002.patch
>
>
> On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
> heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
> looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
> lock blocked everything.
> Following is the heartbeat stack, as an example, to show how threads are 
> blocked by FSDatasetImpl lock:
> {code}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x000770465dc0> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The thread which held the FSDatasetImpl lock is just sleeping to wait another 
> thread to exit in stopWriter(). The stack is:
> {code}
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> - locked <0x0007636953b8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
> - locked <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> In this case, we deployed quite a lot other workloads on the DN, the local 
> file system and disk is quite busy. We guess this is why the stopWriter took 
> quite a long time.
> Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
> lock held.   In HDFS-7999, the createTemporary() is changed to call 
> stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
> three methods: recoverClose()/recoverAppend/recoverRbw().
> I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2016-03-23 Thread John Zhuge (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated HDFS-8496:
-
Summary: Calling stopWriter() with FSDatasetImpl lock held may block other 
threads  (was: Should not call stopWriter() with FSDatasetImpl lock held)

> Calling stopWriter() with FSDatasetImpl lock held may block other threads
> -
>
> Key: HDFS-8496
> URL: https://issues.apache.org/jira/browse/HDFS-8496
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: zhouyingchao
>Assignee: zhouyingchao
> Attachments: HDFS-8496-001.patch
>
>
> On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
> heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
> looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
> lock blocked everything.
> Following is the heartbeat stack, as an example, to show how threads are 
> blocked by FSDatasetImpl lock:
> {code}
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
> - waiting to lock <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
> - locked <0x000770465dc0> (a java.lang.Object)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
> at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> The thread which held the FSDatasetImpl lock is just sleeping to wait another 
> thread to exit in stopWriter(). The stack is:
> {code}
>java.lang.Thread.State: TIMED_WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Thread.join(Thread.java:1194)
> - locked <0x0007636953b8> (a org.apache.hadoop.util.Daemon)
> at 
> org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
> at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
> - locked <0x0007701badc0> (a 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
> at 
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
> at 
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
> at java.lang.Thread.run(Thread.java:662)
> {code}
> In this case, we deployed quite a lot other workloads on the DN, the local 
> file system and disk is quite busy. We guess this is why the stopWriter took 
> quite a long time.
> Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
> lock held.   In HDFS-7999, the createTemporary() is changed to call 
> stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
> three methods: recoverClose()/recoverAppend/recoverRbw().
> I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2015-05-29 Thread zhouyingchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhouyingchao updated HDFS-8496:
---
Status: Patch Available  (was: Open)

Run all hdfs unit tests without introducing new failure.

 Calling stopWriter() with FSDatasetImpl lock held may  block other threads
 --

 Key: HDFS-8496
 URL: https://issues.apache.org/jira/browse/HDFS-8496
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: zhouyingchao
Assignee: zhouyingchao

 On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
 heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
 looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
 lock blocked everything.
 Following is the heartbeat stack, as an example, to show how threads are 
 blocked by FSDatasetImpl lock:
 {code}
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
 - waiting to lock 0x0007701badc0 (a 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
 - locked 0x000770465dc0 (a java.lang.Object)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 The thread which held the FSDatasetImpl lock is just sleeping to wait another 
 thread to exit in stopWriter(). The stack is:
 {code}
java.lang.Thread.State: TIMED_WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1194)
 - locked 0x0007636953b8 (a org.apache.hadoop.util.Daemon)
 at 
 org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
 - locked 0x0007701badc0 (a 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 In this case, we deployed quite a lot other workloads on the DN, the local 
 file system and disk is quite busy. We guess this is why the stopWriter took 
 quite a long time.
 Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
 lock held.   In HDFS-7999, the createTemporary() is changed to call 
 stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
 three methods: recoverClose()/recoverAppend/recoverRbw().
 I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-8496) Calling stopWriter() with FSDatasetImpl lock held may block other threads

2015-05-29 Thread zhouyingchao (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-8496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhouyingchao updated HDFS-8496:
---
Attachment: HDFS-8496-001.patch

 Calling stopWriter() with FSDatasetImpl lock held may  block other threads
 --

 Key: HDFS-8496
 URL: https://issues.apache.org/jira/browse/HDFS-8496
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
Reporter: zhouyingchao
Assignee: zhouyingchao
 Attachments: HDFS-8496-001.patch


 On a DN of a HDFS 2.6 cluster, we noticed some DataXceiver threads and  
 heartbeat threads are blocked for quite a while on the FSDatasetImpl lock. By 
 looking at the stack, we found the calling of stopWriter() with FSDatasetImpl 
 lock blocked everything.
 Following is the heartbeat stack, as an example, to show how threads are 
 blocked by FSDatasetImpl lock:
 {code}
java.lang.Thread.State: BLOCKED (on object monitor)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getDfsUsed(FsVolumeImpl.java:152)
 - waiting to lock 0x0007701badc0 (a 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.getAvailable(FsVolumeImpl.java:191)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getStorageReports(FsDatasetImpl.java:144)
 - locked 0x000770465dc0 (a java.lang.Object)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:575)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:680)
 at 
 org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:850)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 The thread which held the FSDatasetImpl lock is just sleeping to wait another 
 thread to exit in stopWriter(). The stack is:
 {code}
java.lang.Thread.State: TIMED_WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Thread.join(Thread.java:1194)
 - locked 0x0007636953b8 (a org.apache.hadoop.util.Daemon)
 at 
 org.apache.hadoop.hdfs.server.datanode.ReplicaInPipeline.stopWriter(ReplicaInPipeline.java:183)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverCheck(FsDatasetImpl.java:982)
 at 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.recoverClose(FsDatasetImpl.java:1026)
 - locked 0x0007701badc0 (a 
 org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:624)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137)
 at 
 org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74)
 at 
 org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
 at java.lang.Thread.run(Thread.java:662)
 {code}
 In this case, we deployed quite a lot other workloads on the DN, the local 
 file system and disk is quite busy. We guess this is why the stopWriter took 
 quite a long time.
 Any way, it is not quite reasonable to call stopWriter with the FSDatasetImpl 
 lock held.   In HDFS-7999, the createTemporary() is changed to call 
 stopWriter without FSDatasetImpl lock. We guess we should do so in the other 
 three methods: recoverClose()/recoverAppend/recoverRbw().
 I'll try to finish a patch for this today. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)