[jira] [Commented] (HDDS-1532) Ozone: Freon: Improve the concurrent testing framework.

2019-07-01 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-1532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16876624#comment-16876624
 ] 

Junping Du commented on HDDS-1532:
--

bq. In this environment, writing the same amount of data, by improving the 
concurrent framework, test performance is 8 times faster.
Amazing patch! Great work, [~xudongcao]!

> Ozone: Freon: Improve the concurrent testing framework.
> ---
>
> Key: HDDS-1532
> URL: https://issues.apache.org/jira/browse/HDDS-1532
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.4.0
>Reporter: Xudong Cao
>Assignee: Xudong Cao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, Freon's concurrency framework is just on volume-level, but in 
> actual testing, users are likely to provide a smaller volume number(typically 
> 1), and a larger bucket number and key number, in which case the existing 
> concurrency framework can not make good use of the thread pool.
> We need to improve the concurrency policy, make the volume creation task, 
> bucket creation task, and key creation task all can be equally submitted to 
> the thread pool as a general task. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-698) Support Topology Awareness for Ozone

2019-01-31 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757215#comment-16757215
 ] 

Junping Du commented on HDDS-698:
-

Thanks [~szetszwo] and sorry for late response, the design document is attached 
in this umbrella jira. For patches, as Sammi's mentioned, please help to review 
them in sub-jiras. Thanks!

> Support Topology Awareness for Ozone
> 
>
> Key: HDDS-698
> URL: https://issues.apache.org/jira/browse/HDDS-698
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Xiaoyu Yao
>Assignee: Sammi Chen
>Priority: Major
> Attachments: HDDS-698.000.patch, network-topology-default.xml, 
> network-topology-nodegroup.xml
>
>
> This is an umbrella JIRA to add topology aware support for Ozone Pipelines, 
> Containers and Blocks. Long time since HDFS is created, we provide 
> rack/nodegroup awareness for reliability and high performance for data 
> access. Ozone need a similar mechanism and this can be more flexible for 
> cloud scenarios. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-698) Support Topology Awareness for Ozone

2019-01-10 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16740089#comment-16740089
 ] 

Junping Du commented on HDDS-698:
-

No enough bandwidth shortly, reassign to [~Sammi].

> Support Topology Awareness for Ozone
> 
>
> Key: HDDS-698
> URL: https://issues.apache.org/jira/browse/HDDS-698
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Xiaoyu Yao
>Assignee: Sammi Chen
>Priority: Major
> Attachments: HDDS-698.000.patch, network-topology-default.xml
>
>
> This is an umbrella JIRA to add topology aware support for Ozone Pipelines, 
> Containers and Blocks. Long time since HDFS is created, we provide 
> rack/nodegroup awareness for reliability and high performance for data 
> access. Ozone need a similar mechanism and this can be more flexible for 
> cloud scenarios. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-698) Support Topology Awareness for Ozone

2019-01-10 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned HDDS-698:
---

Assignee: Sammi Chen  (was: Junping Du)

> Support Topology Awareness for Ozone
> 
>
> Key: HDDS-698
> URL: https://issues.apache.org/jira/browse/HDDS-698
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Xiaoyu Yao
>Assignee: Sammi Chen
>Priority: Major
> Attachments: HDDS-698.000.patch, network-topology-default.xml
>
>
> This is an umbrella JIRA to add topology aware support for Ozone Pipelines, 
> Containers and Blocks. Long time since HDFS is created, we provide 
> rack/nodegroup awareness for reliability and high performance for data 
> access. Ozone need a similar mechanism and this can be more flexible for 
> cloud scenarios. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-698) Support Topology Awareness for Ozone

2018-12-13 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDDS-698:

Description: This is an umbrella JIRA to add topology aware support for 
Ozone Pipelines, Containers and Blocks. Long time since HDFS is created, we 
provide rack/nodegroup awareness for reliability and high performance for data 
access. Ozone need a similar mechanism and this can be more flexible for cloud 
scenarios.   (was: This is an umbrella JIRA to add topology aware support for 
Ozone Pipelines, Containers and Blocks.)

> Support Topology Awareness for Ozone
> 
>
> Key: HDDS-698
> URL: https://issues.apache.org/jira/browse/HDDS-698
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Xiaoyu Yao
>Assignee: Junping Du
>Priority: Major
>
> This is an umbrella JIRA to add topology aware support for Ozone Pipelines, 
> Containers and Blocks. Long time since HDFS is created, we provide 
> rack/nodegroup awareness for reliability and high performance for data 
> access. Ozone need a similar mechanism and this can be more flexible for 
> cloud scenarios. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12914) Block report leases cause missing blocks until next report

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12914:
--
Target Version/s: 2.8.6  (was: 2.8.5)

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12136:
--
Target Version/s: 2.8.6  (was: 2.8.5)

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7967) Reduce the performance impact of the balancer

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-7967:
-
Target Version/s: 2.8.6  (was: 2.8.5)

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch, 
> HDFS-7967.branch-2-1.patch, HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-13111:
--
Target Version/s: 2.8.6  (was: 2.8.5)

> Close recovery may incorrectly mark blocks corrupt
> --
>
> Key: HDFS-13111
> URL: https://issues.apache.org/jira/browse/HDFS-13111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> Close recovery can leave a block marked corrupt until the next FBR arrives 
> from one of the DNs.  The reason is unclear but has happened multiple times 
> when a DN has io saturated disks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12704) FBR may corrupt block state

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12704:
--
Target Version/s: 2.8.6  (was: 2.8.5)

> FBR may corrupt block state
> ---
>
> Key: HDFS-12704
> URL: https://issues.apache.org/jira/browse/HDFS-12704
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> If FBR processing generates a runtime exception it is believed to foul the 
> block state and lead to unpredictable behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12703) Exceptions are fatal to decommissioning monitor

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12703:
--
Target Version/s: 2.8.6  (was: 2.8.5)

> Exceptions are fatal to decommissioning monitor
> ---
>
> Key: HDFS-12703
> URL: https://issues.apache.org/jira/browse/HDFS-12703
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If 
> an exception occurs, all decommissioning ceases until the NN is restarted.  
> Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the 
> task encounters an exception, subsequent executions are suppressed*.  The 
> monitor thread is alive but blocked waiting for an executor task that will 
> never come.  The code currently disposes of the future so the actual 
> exception that aborted the task is gone.
> Failover is insufficient since the task is also likely dead on the standby.  
> Replication queue init after the transition to active will fix the under 
> replication of blocks on currently decommissioning nodes but future nodes 
> never decommission.  The standby must be bounced prior to failover – and 
> hopefully the error condition does not reoccur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12747) Lease monitor may infinitely loop on the same lease

2018-09-08 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12747:
--
Target Version/s: 2.8.6  (was: 2.8.5)

> Lease monitor may infinitely loop on the same lease
> ---
>
> Key: HDFS-12747
> URL: https://issues.apache.org/jira/browse/HDFS-12747
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Lease recovery incorrectly handles UC files if the last block is complete but 
> the penultimate block is committed.  Incorrectly handles is the euphemism for 
> infinitely loops for days and leaves all abandoned streams open until 
> customers complain.
> The problem may manifest when:
> # Block1 is committed but seemingly never completed
> # Block2 is allocated
> # Lease recovery is initiated for block2
> # Commit block synchronization invokes {{FSNamesytem#closeFileCommitBlocks}}, 
> causing:
> #* {{commitOrCompleteLastBlock}} to mark block2 as complete
> #* 
> {{finalizeINodeFileUnderConstruction}}/{{INodeFile.assertAllBlocksComplete}} 
> to throw {{IllegalStateException}} because the penultimate block1 is 
> "COMMITTED but not COMPLETE"
> # The next lease recovery results in an infinite loop.
> The {{LeaseManager}} expects that {{FSNamesystem#internalReleaseLease}} will 
> either init recovery and renew the lease, or remove the lease.  In the 
> described state it does neither.  The switch case will break out if the last 
> block is complete.  (The case statement ironically contains an assert).  
> Since nothing changed, the lease is still the “next” lease to be processed.  
> The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on 
> it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12747) Lease monitor may infinitely loop on the same lease

2018-09-08 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16607955#comment-16607955
 ] 

Junping Du commented on HDFS-12747:
---

move out to 2.8.6 given no progress for a while.

> Lease monitor may infinitely loop on the same lease
> ---
>
> Key: HDFS-12747
> URL: https://issues.apache.org/jira/browse/HDFS-12747
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Lease recovery incorrectly handles UC files if the last block is complete but 
> the penultimate block is committed.  Incorrectly handles is the euphemism for 
> infinitely loops for days and leaves all abandoned streams open until 
> customers complain.
> The problem may manifest when:
> # Block1 is committed but seemingly never completed
> # Block2 is allocated
> # Lease recovery is initiated for block2
> # Commit block synchronization invokes {{FSNamesytem#closeFileCommitBlocks}}, 
> causing:
> #* {{commitOrCompleteLastBlock}} to mark block2 as complete
> #* 
> {{finalizeINodeFileUnderConstruction}}/{{INodeFile.assertAllBlocksComplete}} 
> to throw {{IllegalStateException}} because the penultimate block1 is 
> "COMMITTED but not COMPLETE"
> # The next lease recovery results in an infinite loop.
> The {{LeaseManager}} expects that {{FSNamesystem#internalReleaseLease}} will 
> either init recovery and renew the lease, or remove the lease.  In the 
> described state it does neither.  The switch case will break out if the last 
> block is complete.  (The case statement ironically contains an assert).  
> Since nothing changed, the lease is still the “next” lease to be processed.  
> The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on 
> it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-286) Fix NodeReportPublisher.getReport NPE

2018-07-24 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned HDDS-286:
---

Assignee: Junjie Chen

> Fix NodeReportPublisher.getReport NPE
> -
>
> Key: HDDS-286
> URL: https://issues.apache.org/jira/browse/HDDS-286
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Xiaoyu Yao
>Assignee: Junjie Chen
>Priority: Major
>  Labels: newbie
> Fix For: 0.2.1
>
>
> This can be reproed with TestKeys#testPutKey
> {code}
> 2018-07-23 21:33:55,598 WARN  concurrent.ExecutorHelper 
> (ExecutorHelper.java:logThrowableFromAfterExecute(63)) - Caught exception in 
> thread Datanode ReportManager Thread - 0: 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.ozone.container.common.volume.VolumeInfo.getScmUsed(VolumeInfo.java:107)
>   at 
> org.apache.hadoop.ozone.container.common.volume.VolumeSet.getNodeReport(VolumeSet.java:350)
>   at 
> org.apache.hadoop.ozone.container.ozoneimpl.OzoneContainer.getNodeReport(OzoneContainer.java:260)
>   at 
> org.apache.hadoop.ozone.container.common.report.NodeReportPublisher.getReport(NodeReportPublisher.java:64)
>   at 
> org.apache.hadoop.ozone.container.common.report.NodeReportPublisher.getReport(NodeReportPublisher.java:39)
>   at 
> org.apache.hadoop.ozone.container.common.report.ReportPublisher.publishReport(ReportPublisher.java:86)
>   at 
> org.apache.hadoop.ozone.container.common.report.ReportPublisher.run(ReportPublisher.java:73)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-188) TestOmMetrcis should not use the deprecated WhiteBox class

2018-07-23 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDDS-188:

Description: TestOmMetrcis should stop using 
{{org.apache.hadoop.test.Whitebox}}.  (was: TestKSMMetrcis (also needs to be 
renamed) should stop using {{org.apache.hadoop.test.Whitebox}}.)

> TestOmMetrcis should not use the deprecated WhiteBox class
> --
>
> Key: HDDS-188
> URL: https://issues.apache.org/jira/browse/HDDS-188
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
>  Labels: newbie
> Fix For: 0.2.1
>
>
> TestOmMetrcis should stop using {{org.apache.hadoop.test.Whitebox}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-188) TestKSMMetrcis should not use the deprecated WhiteBox class

2018-07-23 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDDS-188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16553826#comment-16553826
 ] 

Junping Du commented on HDDS-188:
-

Looks like TestKSMMetrics get renamed to TestOmMetrics, update title and 
description to reflect the change.

> TestKSMMetrcis should not use the deprecated WhiteBox class
> ---
>
> Key: HDDS-188
> URL: https://issues.apache.org/jira/browse/HDDS-188
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
>  Labels: newbie
> Fix For: 0.2.1
>
>
> TestKSMMetrcis (also needs to be renamed) should stop using 
> {{org.apache.hadoop.test.Whitebox}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-188) TestOmMetrcis should not use the deprecated WhiteBox class

2018-07-23 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDDS-188:

Summary: TestOmMetrcis should not use the deprecated WhiteBox class  (was: 
TestKSMMetrcis should not use the deprecated WhiteBox class)

> TestOmMetrcis should not use the deprecated WhiteBox class
> --
>
> Key: HDDS-188
> URL: https://issues.apache.org/jira/browse/HDDS-188
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Arpit Agarwal
>Assignee: Arpit Agarwal
>Priority: Major
>  Labels: newbie
> Fix For: 0.2.1
>
>
> TestKSMMetrcis (also needs to be renamed) should stop using 
> {{org.apache.hadoop.test.Whitebox}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-117) Wrapper for set/get Standalone, Ratis and Rest Ports in DatanodeDetails.

2018-07-23 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned HDDS-117:
---

Assignee: chao.wu  (was: Junping Du)

> Wrapper for set/get Standalone, Ratis and Rest Ports in DatanodeDetails.
> 
>
> Key: HDDS-117
> URL: https://issues.apache.org/jira/browse/HDDS-117
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Datanode
>Reporter: Nanda kumar
>Assignee: chao.wu
>Priority: Major
>  Labels: newbie
>
> It will be very helpful to have a wrapper for set/get Standalone, Ratis and 
> Rest Ports in DatanodeDetails.
> Search and Replace usage of DatanodeDetails#newPort directly in current code. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-164) Add unit test for HddsDatanodeService

2018-07-20 Thread Junping Du (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDDS-164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du reassigned HDDS-164:
---

Assignee: Junjie Chen

> Add unit test for HddsDatanodeService
> -
>
> Key: HDDS-164
> URL: https://issues.apache.org/jira/browse/HDDS-164
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: Ozone Datanode
>Reporter: Nanda kumar
>Assignee: Junjie Chen
>Priority: Major
>  Labels: newbie, test
> Fix For: 0.2.1
>
>
> We have to add unit-test for {{HddsDatanodeService}} class.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12914) Block report leases cause missing blocks until next report

2018-07-18 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547561#comment-16547561
 ] 

Junping Du commented on HDFS-12914:
---

[~daryn] Do we still hit the issue, any clue to fix it?

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10664) layoutVersion mismatch between Namenode VERSION file and Journalnode VERSION file after cluster upgrade

2018-06-26 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16524639#comment-16524639
 ] 

Junping Du commented on HDFS-10664:
---

Failed tests are not related to the patch and verified that tests can pass 
locally. However, I found no mater w/o the fix, added UT can pass. [~yuanbo], 
do I miss something?

> layoutVersion mismatch between Namenode VERSION file and Journalnode VERSION 
> file after cluster upgrade
> ---
>
> Key: HDFS-10664
> URL: https://issues.apache.org/jira/browse/HDFS-10664
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, hdfs
>Affects Versions: 2.7.1
>Reporter: Amit Anand
>Assignee: Yuanbo Liu
>Priority: Major
> Attachments: HDFS-10664.001.patch
>
>
> After a cluster is upgraded I see a mismatch in {{layoutVersion}} between NN 
> VERSION file and JN VERSION file.
> Here is what I see:
> Before cluster upgrade:
> ==
> {code}
> ## Version file from NN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=NAME_NODE
> blockpoolID=BP-786201894-10.0.100.11-1466026941507
> layoutVersion=-60
> {code}
> {code}
> ## Version file from JN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=JOURNAL_NODE
> layoutVersion=-60
> {code}
> After cluster upgrade:
> =
> {code}
> ## Version file from NN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=NAME_NODE
> blockpoolID=BP-786201894-10.0.100.11-1466026941507
> layoutVersion=-63
> {code}
> {code}
> ## Version file from JN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=JOURNAL_NODE
> layoutVersion=-60
> {code}
> Since {{Namenode}} is what creates {{Journalnode}} {{VERSION}} file during 
> {{initializeSharedEdits}}, it should also update the file with correct 
> information after the cluster is upgraded and {{hdfs dfsadmin 
> -finalizeUpgrade}} has been executed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10664) layoutVersion mismatch between Namenode VERSION file and Journalnode VERSION file after cluster upgrade

2018-06-26 Thread Junping Du (JIRA)


[ 
https://issues.apache.org/jira/browse/HDFS-10664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16523615#comment-16523615
 ] 

Junping Du commented on HDFS-10664:
---

This patch seems to correct some harmless but silly mistake for JN version 
number, and I see very low risk for the patch here. Kick off Jenkins again.

> layoutVersion mismatch between Namenode VERSION file and Journalnode VERSION 
> file after cluster upgrade
> ---
>
> Key: HDFS-10664
> URL: https://issues.apache.org/jira/browse/HDFS-10664
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha, hdfs
>Affects Versions: 2.7.1
>Reporter: Amit Anand
>Assignee: Yuanbo Liu
>Priority: Major
> Attachments: HDFS-10664.001.patch
>
>
> After a cluster is upgraded I see a mismatch in {{layoutVersion}} between NN 
> VERSION file and JN VERSION file.
> Here is what I see:
> Before cluster upgrade:
> ==
> {code}
> ## Version file from NN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=NAME_NODE
> blockpoolID=BP-786201894-10.0.100.11-1466026941507
> layoutVersion=-60
> {code}
> {code}
> ## Version file from JN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=JOURNAL_NODE
> layoutVersion=-60
> {code}
> After cluster upgrade:
> =
> {code}
> ## Version file from NN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=NAME_NODE
> blockpoolID=BP-786201894-10.0.100.11-1466026941507
> layoutVersion=-63
> {code}
> {code}
> ## Version file from JN current directory
> namespaceID=109645726
> clusterID=CID-edcb62c5-bc1f-49f5-addb-37827340b5de
> cTime=0
> storageType=JOURNAL_NODE
> layoutVersion=-60
> {code}
> Since {{Namenode}} is what creates {{Journalnode}} {{VERSION}} file during 
> {{initializeSharedEdits}}, it should also update the file with correct 
> information after the cluster is upgraded and {{hdfs dfsadmin 
> -finalizeUpgrade}} has been executed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-05-07 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466776#comment-16466776
 ] 

Junping Du commented on HDFS-12136:
---

looks like HDFS-11187 may not cover all fix here. [~jojochuang], do you have 
further comments here?
Move it to 2.8.5 as we need more discussion and 2.8.4 is in RC stage.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-05-07 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12136:
--
Target Version/s: 2.8.5  (was: 2.8.4)

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457439#comment-16457439
 ] 

Junping Du commented on HDFS-12136:
---

Thanks [~jojochuang] for comments. I agree with you that previous lock issue 
get resolved in HDFS-11187. [~daryn], I will go ahead to resolve this as 
duplicated of HDFS-11187 if you also agree.

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7967) Reduce the performance impact of the balancer

2018-04-26 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455663#comment-16455663
 ] 

Junping Du commented on HDFS-7967:
--

move it to 2.8.5 given no response for a while.

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch, 
> HDFS-7967.branch-2-1.patch, HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7967) Reduce the performance impact of the balancer

2018-04-26 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-7967:
-
Target Version/s: 2.8.5  (was: 2.8.4)

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch, 
> HDFS-7967.branch-2-1.patch, HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12703) Exceptions are fatal to decommissioning monitor

2018-04-26 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12703:
--
Target Version/s: 2.8.5  (was: 2.8.4)

> Exceptions are fatal to decommissioning monitor
> ---
>
> Key: HDFS-12703
> URL: https://issues.apache.org/jira/browse/HDFS-12703
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If 
> an exception occurs, all decommissioning ceases until the NN is restarted.  
> Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the 
> task encounters an exception, subsequent executions are suppressed*.  The 
> monitor thread is alive but blocked waiting for an executor task that will 
> never come.  The code currently disposes of the future so the actual 
> exception that aborted the task is gone.
> Failover is insufficient since the task is also likely dead on the standby.  
> Replication queue init after the transition to active will fix the under 
> replication of blocks on currently decommissioning nodes but future nodes 
> never decommission.  The standby must be bounced prior to failover – and 
> hopefully the error condition does not reoccur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt

2018-04-26 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16455659#comment-16455659
 ] 

Junping Du commented on HDFS-13111:
---

move to 2.8.5 given this jira has been quiet for a while.

> Close recovery may incorrectly mark blocks corrupt
> --
>
> Key: HDFS-13111
> URL: https://issues.apache.org/jira/browse/HDFS-13111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> Close recovery can leave a block marked corrupt until the next FBR arrives 
> from one of the DNs.  The reason is unclear but has happened multiple times 
> when a DN has io saturated disks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt

2018-04-26 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-13111:
--
Target Version/s: 2.8.5  (was: 2.8.4)

> Close recovery may incorrectly mark blocks corrupt
> --
>
> Key: HDFS-13111
> URL: https://issues.apache.org/jira/browse/HDFS-13111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> Close recovery can leave a block marked corrupt until the next FBR arrives 
> from one of the DNs.  The reason is unclear but has happened multiple times 
> when a DN has io saturated disks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12914) Block report leases cause missing blocks until next report

2018-04-26 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12914:
--
Target Version/s: 2.8.5  (was: 2.8.4)

> Block report leases cause missing blocks until next report
> --
>
> Key: HDFS-12914
> URL: https://issues.apache.org/jira/browse/HDFS-12914
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> {{BlockReportLeaseManager#checkLease}} will reject FBRs from DNs for 
> conditions such as "unknown datanode", "not in pending set", "lease has 
> expired", wrong lease id, etc.  Lease rejection does not throw an exception.  
> It returns false which bubbles up to  {{NameNodeRpcServer#blockReport}} and 
> interpreted as {{noStaleStorages}}.
> A re-registering node whose FBR is rejected from an invalid lease becomes 
> active with _no blocks_.  A replication storm ensues possibly causing DNs to 
> temporarily go dead (HDFS-12645), leading to more FBR lease rejections on 
> re-registration.  The cluster will have many "missing blocks" until the DNs 
> next FBR is sent and/or forced.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12747) Lease monitor may infinitely loop on the same lease

2018-04-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12747:
--
Target Version/s: 2.8.5  (was: 2.8.4)

> Lease monitor may infinitely loop on the same lease
> ---
>
> Key: HDFS-12747
> URL: https://issues.apache.org/jira/browse/HDFS-12747
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Lease recovery incorrectly handles UC files if the last block is complete but 
> the penultimate block is committed.  Incorrectly handles is the euphemism for 
> infinitely loops for days and leaves all abandoned streams open until 
> customers complain.
> The problem may manifest when:
> # Block1 is committed but seemingly never completed
> # Block2 is allocated
> # Lease recovery is initiated for block2
> # Commit block synchronization invokes {{FSNamesytem#closeFileCommitBlocks}}, 
> causing:
> #* {{commitOrCompleteLastBlock}} to mark block2 as complete
> #* 
> {{finalizeINodeFileUnderConstruction}}/{{INodeFile.assertAllBlocksComplete}} 
> to throw {{IllegalStateException}} because the penultimate block1 is 
> "COMMITTED but not COMPLETE"
> # The next lease recovery results in an infinite loop.
> The {{LeaseManager}} expects that {{FSNamesystem#internalReleaseLease}} will 
> either init recovery and renew the lease, or remove the lease.  In the 
> described state it does neither.  The switch case will break out if the last 
> block is complete.  (The case statement ironically contains an assert).  
> Since nothing changed, the lease is still the “next” lease to be processed.  
> The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on 
> it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12704) FBR may corrupt block state

2018-04-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437151#comment-16437151
 ] 

Junping Du commented on HDFS-12704:
---

move to 2.8.5 given issue has been quiet for a while.

> FBR may corrupt block state
> ---
>
> Key: HDFS-12704
> URL: https://issues.apache.org/jira/browse/HDFS-12704
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> If FBR processing generates a runtime exception it is believed to foul the 
> block state and lead to unpredictable behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12704) FBR may corrupt block state

2018-04-13 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12704:
--
Target Version/s: 2.8.5  (was: 2.8.4)

> FBR may corrupt block state
> ---
>
> Key: HDFS-12704
> URL: https://issues.apache.org/jira/browse/HDFS-12704
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> If FBR processing generates a runtime exception it is believed to foul the 
> block state and lead to unpredictable behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7967) Reduce the performance impact of the balancer

2018-04-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437131#comment-16437131
 ] 

Junping Du commented on HDFS-7967:
--

[~kihwal] and [~daryn], do we need this patch on trunk/hadoop-3.x? If not, 
shall we make progress on branch-2.x only?

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch, 
> HDFS-7967.branch-2-1.patch, HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13281) Namenode#createFile should be /.reserved/raw/ aware.

2018-04-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437122#comment-16437122
 ] 

Junping Du commented on HDFS-13281:
---

Hi [~shahrs87] and [~xiaochen], do we have plan to get this patch in shortly? 
If not, I will move it to 2.8.5 to unblock 2.8.4 release.

> Namenode#createFile should be /.reserved/raw/ aware.
> 
>
> Key: HDFS-13281
> URL: https://issues.apache.org/jira/browse/HDFS-13281
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: encryption
>Affects Versions: 2.8.3
>Reporter: Rushabh S Shah
>Assignee: Rushabh S Shah
>Priority: Critical
> Attachments: HDFS-13281.001.patch, HDFS-13281.002.branch-2.patch, 
> HDFS-13281.002.patch
>
>
> If I want to write to /.reserved/raw/ and if that directory happens to 
> be in EZ, then namenode *should not* create edek and just copy the raw bytes 
> from the source.
>  Namenode#startFileInt should be /.reserved/raw/ aware.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12703) Exceptions are fatal to decommissioning monitor

2018-04-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437110#comment-16437110
 ] 

Junping Du commented on HDFS-12703:
---

Hi Daryn, I saw this issue has been quiet for a while. Can we move to next 
release given 2.8.4 is on releasing track?

> Exceptions are fatal to decommissioning monitor
> ---
>
> Key: HDFS-12703
> URL: https://issues.apache.org/jira/browse/HDFS-12703
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If 
> an exception occurs, all decommissioning ceases until the NN is restarted.  
> Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the 
> task encounters an exception, subsequent executions are suppressed*.  The 
> monitor thread is alive but blocked waiting for an executor task that will 
> never come.  The code currently disposes of the future so the actual 
> exception that aborted the task is gone.
> Failover is insufficient since the task is also likely dead on the standby.  
> Replication queue init after the transition to active will fix the under 
> replication of blocks on currently decommissioning nodes but future nodes 
> never decommission.  The standby must be bounced prior to failover – and 
> hopefully the error condition does not reoccur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437109#comment-16437109
 ] 

Junping Du commented on HDFS-12136:
---

Hi [~daryn], I saw this issue has been quiet for a while. Can we move to next 
release given 2.8.4 is on releasing track?

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12136) BlockSender performance regression due to volume scanner edge case

2018-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12136:
--
Target Version/s: 2.8.4  (was: 2.8.3)

> BlockSender performance regression due to volume scanner edge case
> --
>
> Key: HDFS-12136
> URL: https://issues.apache.org/jira/browse/HDFS-12136
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-12136.branch-2.patch, HDFS-12136.trunk.patch
>
>
> HDFS-11160 attempted to fix a volume scan race for a file appended mid-scan 
> by reading the last checksum of finalized blocks within the {{BlockSender}} 
> ctor.  Unfortunately it's holding the exclusive dataset lock to open and read 
> the metafile multiple times  Block sender instantiation becomes serialized.
> Performance completely collapses under heavy disk i/o utilization or high 
> xceiver activity.  Ex. lost node replication, balancing, or decommissioning.  
> The xceiver threads congest creating block senders and impair the heartbeat 
> processing that is contending for the same lock.  Combined with other lock 
> contention issues, pipelines break and nodes sporadically go dead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13111) Close recovery may incorrectly mark blocks corrupt

2018-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-13111:
--
Target Version/s: 2.8.4  (was: 2.8.3)

> Close recovery may incorrectly mark blocks corrupt
> --
>
> Key: HDFS-13111
> URL: https://issues.apache.org/jira/browse/HDFS-13111
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> Close recovery can leave a block marked corrupt until the next FBR arrives 
> from one of the DNs.  The reason is unclear but has happened multiple times 
> when a DN has io saturated disks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7967) Reduce the performance impact of the balancer

2018-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-7967:
-
Target Version/s: 2.8.4  (was: 2.8.3)

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch, 
> HDFS-7967.branch-2-1.patch, HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12703) Exceptions are fatal to decommissioning monitor

2018-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12703:
--
Target Version/s: 2.8.4  (was: 2.8.3)

> Exceptions are fatal to decommissioning monitor
> ---
>
> Key: HDFS-12703
> URL: https://issues.apache.org/jira/browse/HDFS-12703
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> The {{DecommissionManager.Monitor}} runs as an executor scheduled task.  If 
> an exception occurs, all decommissioning ceases until the NN is restarted.  
> Per javadoc for {{executor#scheduleAtFixedRate}}: *If any execution of the 
> task encounters an exception, subsequent executions are suppressed*.  The 
> monitor thread is alive but blocked waiting for an executor task that will 
> never come.  The code currently disposes of the future so the actual 
> exception that aborted the task is gone.
> Failover is insufficient since the task is also likely dead on the standby.  
> Replication queue init after the transition to active will fix the under 
> replication of blocks on currently decommissioning nodes but future nodes 
> never decommission.  The standby must be bounced prior to failover – and 
> hopefully the error condition does not reoccur.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12704) FBR may corrupt block state

2018-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12704:
--
Target Version/s: 2.8.4  (was: 2.8.3)

> FBR may corrupt block state
> ---
>
> Key: HDFS-12704
> URL: https://issues.apache.org/jira/browse/HDFS-12704
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> If FBR processing generates a runtime exception it is believed to foul the 
> block state and lead to unpredictable behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12747) Lease monitor may infinitely loop on the same lease

2018-04-06 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16428151#comment-16428151
 ] 

Junping Du commented on HDFS-12747:
---

move to 2.8.4 as 2.8.3 has been released.

> Lease monitor may infinitely loop on the same lease
> ---
>
> Key: HDFS-12747
> URL: https://issues.apache.org/jira/browse/HDFS-12747
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Lease recovery incorrectly handles UC files if the last block is complete but 
> the penultimate block is committed.  Incorrectly handles is the euphemism for 
> infinitely loops for days and leaves all abandoned streams open until 
> customers complain.
> The problem may manifest when:
> # Block1 is committed but seemingly never completed
> # Block2 is allocated
> # Lease recovery is initiated for block2
> # Commit block synchronization invokes {{FSNamesytem#closeFileCommitBlocks}}, 
> causing:
> #* {{commitOrCompleteLastBlock}} to mark block2 as complete
> #* 
> {{finalizeINodeFileUnderConstruction}}/{{INodeFile.assertAllBlocksComplete}} 
> to throw {{IllegalStateException}} because the penultimate block1 is 
> "COMMITTED but not COMPLETE"
> # The next lease recovery results in an infinite loop.
> The {{LeaseManager}} expects that {{FSNamesystem#internalReleaseLease}} will 
> either init recovery and renew the lease, or remove the lease.  In the 
> described state it does neither.  The switch case will break out if the last 
> block is complete.  (The case statement ironically contains an assert).  
> Since nothing changed, the lease is still the “next” lease to be processed.  
> The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on 
> it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12747) Lease monitor may infinitely loop on the same lease

2018-04-06 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12747:
--
Target Version/s: 2.8.4  (was: 2.8.3)

> Lease monitor may infinitely loop on the same lease
> ---
>
> Key: HDFS-12747
> URL: https://issues.apache.org/jira/browse/HDFS-12747
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Lease recovery incorrectly handles UC files if the last block is complete but 
> the penultimate block is committed.  Incorrectly handles is the euphemism for 
> infinitely loops for days and leaves all abandoned streams open until 
> customers complain.
> The problem may manifest when:
> # Block1 is committed but seemingly never completed
> # Block2 is allocated
> # Lease recovery is initiated for block2
> # Commit block synchronization invokes {{FSNamesytem#closeFileCommitBlocks}}, 
> causing:
> #* {{commitOrCompleteLastBlock}} to mark block2 as complete
> #* 
> {{finalizeINodeFileUnderConstruction}}/{{INodeFile.assertAllBlocksComplete}} 
> to throw {{IllegalStateException}} because the penultimate block1 is 
> "COMMITTED but not COMPLETE"
> # The next lease recovery results in an infinite loop.
> The {{LeaseManager}} expects that {{FSNamesystem#internalReleaseLease}} will 
> either init recovery and renew the lease, or remove the lease.  In the 
> described state it does neither.  The switch case will break out if the last 
> block is complete.  (The case statement ironically contains an assert).  
> Since nothing changed, the lease is still the “next” lease to be processed.  
> The lease monitor loops for 25ms on the same lease, sleeps for 2s, loops on 
> it again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-13174) hdfs mover -p /path times out after 20 min

2018-03-19 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16405746#comment-16405746
 ] 

Junping Du commented on HDFS-13174:
---

drop fixed version as no patch get committed yet.

> hdfs mover -p /path times out after 20 min
> --
>
> Key: HDFS-13174
> URL: https://issues.apache.org/jira/browse/HDFS-13174
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.8.0, 2.7.4, 3.0.0-alpha2
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> In HDFS-11015 there is an iteration timeout introduced in Dispatcher.Source 
> class, that is checked during dispatching the moves that the Balancer and the 
> Mover does. This timeout is hardwired to 20 minutes.
> In the Balancer we have iterations, and even if an iteration is timing out 
> the Balancer runs further and does an other iteration before it fails if 
> there were no moves happened in a few iterations.
> The Mover on the other hand does not have iterations, so if moving a path 
> runs for more than 20 minutes, after 20 minutes Mover will stop with the 
> following exception reported to the console (lines might differ as this 
> exception came from a CDH5.12.1 installation):
> java.io.IOException: Block move timed out
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.receiveResponse(Dispatcher.java:382)
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.dispatch(Dispatcher.java:328)
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:186)
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$1.run(Dispatcher.java:956)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-13174) hdfs mover -p /path times out after 20 min

2018-03-19 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-13174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-13174:
--
Target Version/s: 3.1.0, 2.9.1, 3.0.1, 2.8.4, 2.7.6
   Fix Version/s: (was: 2.7.6)
  (was: 2.8.4)
  (was: 3.0.1)
  (was: 3.1.0)

> hdfs mover -p /path times out after 20 min
> --
>
> Key: HDFS-13174
> URL: https://issues.apache.org/jira/browse/HDFS-13174
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: balancer & mover
>Affects Versions: 2.8.0, 2.7.4, 3.0.0-alpha2
>Reporter: Istvan Fajth
>Assignee: Istvan Fajth
>Priority: Major
>
> In HDFS-11015 there is an iteration timeout introduced in Dispatcher.Source 
> class, that is checked during dispatching the moves that the Balancer and the 
> Mover does. This timeout is hardwired to 20 minutes.
> In the Balancer we have iterations, and even if an iteration is timing out 
> the Balancer runs further and does an other iteration before it fails if 
> there were no moves happened in a few iterations.
> The Mover on the other hand does not have iterations, so if moving a path 
> runs for more than 20 minutes, after 20 minutes Mover will stop with the 
> following exception reported to the console (lines might differ as this 
> exception came from a CDH5.12.1 installation):
> java.io.IOException: Block move timed out
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.receiveResponse(Dispatcher.java:382)
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.dispatch(Dispatcher.java:328)
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$PendingMove.access$2500(Dispatcher.java:186)
> at 
> org.apache.hadoop.hdfs.server.balancer.Dispatcher$1.run(Dispatcher.java:956)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12920) HDFS default value change (with adding time unit) breaks old version MR tarball work with new version (3.0) of hadoop

2017-12-15 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293555#comment-16293555
 ] 

Junping Du commented on HDFS-12920:
---

bq. An alternative to reverting the change is to deprecate the old property and 
create a new one that understands time units, as was raised in that JIRA. If 
specifying units breaks rolling upgrades, then what is the point of adding 
units, ever?
That is also a possible approach. We can either keep the default value for 
existing properties or start to using new properties and deprecated previous 
properties.

bq. So another workaround is to have at least two tarballs on HDFS, one that 
uses 3.x and one that uses 2.x. The 3.x site configs request the 3.x tarball 
and the 2.x site configs request the 2.x tarball. When the job submitter client 
upgrades to use 3.x jars, it can also upgrade to 3.x configs to start running 
the job with 3.x as well.
As we discussed offline, if we explicitly packaging these configs into tarball, 
then we may not hitting this issue as different version tar ball and 
configuration will match each other in the end. However, some users may not 
follow this practice before and after. Also, managing configurations in 
different places (cluster setup, MR tar ball, job submission, etc.) is also 
complicated. May be it is more easier to fix issue here instead of tarball 
configuration?

bq. Junping Du, does presence of any unit-suffixed values in the config file 
cause this failure?
Hi [~arpitagarwal], the unit-suffixed values is by default (in 
hdfs-default.xml) now in 3.x. Job submit against old version MR tar ball will 
load new default values provided by new hadoop deployment which will get stuck 
with exception I put above.

> HDFS default value change (with adding time unit) breaks old version MR 
> tarball work with new version (3.0) of hadoop
> -
>
> Key: HDFS-12920
> URL: https://issues.apache.org/jira/browse/HDFS-12920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Junping Du
>Priority: Blocker
>
> After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
> RC1, and run the job with following errors:
> {noformat}
> 2017-12-12 13:29:06,824 INFO [main] 
> org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
> {noformat}
> This is because HDFS-10845, we are adding time unit to hdfs-default.xml but 
> it cannot be recognized by old version MR jars. 
> This break our rolling upgrade story, so should mark as blocker.
> A quick workaround is to add values in hdfs-site.xml with removing all time 
> unit. But the right way may be to revert HDFS-10845 (and get rid of noisy 
> warnings).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12920) HDFS default value change (with adding time unit) breaks old version MR tarball work with new version (3.0) of hadoop

2017-12-13 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288948#comment-16288948
 ] 

Junping Du commented on HDFS-12920:
---

bq. So the right way should be to revert HDFS-10845 and get rid of noisy 
warnings as you suggested. If we are all agreed on on this way, I will attach 
the patch to make this changed.
Thanks [~linyiqun] for quick response. I am +1 on ongoing this way.

bq. If we're not going to change the config properties, the client needs to 
load 3.x config files during rolling upgrades, then this isn't worth the hassle.
Actually, it could be a serious issue for old client (provided by MR tarball 
via distributed cache) to work with new 3.x config which include incompatible 
default value. Old client cannot recognize the new value for the known property 
(end up with "s", etc.) and throw exception to end the job which means app 
cannot run successful during upgrade from 2.9 to 3.0.0.

> HDFS default value change (with adding time unit) breaks old version MR 
> tarball work with new version (3.0) of hadoop
> -
>
> Key: HDFS-12920
> URL: https://issues.apache.org/jira/browse/HDFS-12920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Junping Du
>Priority: Blocker
>
> After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
> RC1, and run the job with following errors:
> {noformat}
> 2017-12-12 13:29:06,824 INFO [main] 
> org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
> {noformat}
> This is because HDFS-10845, we are adding time unit to hdfs-default.xml but 
> it cannot be recognized by old version MR jars. 
> This break our rolling upgrade story, so should mark as blocker.
> A quick workaround is to add values in hdfs-site.xml with removing all time 
> unit. But the right way may be to revert HDFS-10845 (and get rid of noisy 
> warnings).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12920) HDFS default value change (with adding time unit) breaks old version MR tarball work with new version (3.0) of hadoop

2017-12-12 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16288568#comment-16288568
 ] 

Junping Du commented on HDFS-12920:
---

CC [~andrew.wang], [~linyiqun], [~chris.douglas].

> HDFS default value change (with adding time unit) breaks old version MR 
> tarball work with new version (3.0) of hadoop
> -
>
> Key: HDFS-12920
> URL: https://issues.apache.org/jira/browse/HDFS-12920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Junping Du
>Priority: Blocker
>
> After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
> RC1, and run the job with following errors:
> {noformat}
> 2017-12-12 13:29:06,824 INFO [main] 
> org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
>   at 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
> {noformat}
> This is because HDFS-10845, we are adding time unit to hdfs-default.xml but 
> it cannot be recognized by old version MR jars. 
> This break our rolling upgrade story, so should mark as blocker.
> A quick workaround is to add values in hdfs-site.xml with removing all time 
> unit. But the right way may be to revert HDFS-10845 (and get rid of noisy 
> warnings).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12920) HDFS default value change (with adding time unit) breaks old version MR tarball work with new version (3.0) of hadoop

2017-12-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12920:
--
Description: 
After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
RC1, and run the job with following errors:
{noformat}
2017-12-12 13:29:06,824 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; 
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
{noformat}
This is because HDFS-10845, we are adding time unit to hdfs-default.xml but it 
cannot be recognized by old version MR jars. 
This break our rolling upgrade story, so should mark as blocker.
A quick workaround is to add values in hdfs-site.xml with removing all time 
unit. But the right way may be to revert HDFS-10845 (and get rid of noisy 
warnings).

  was:
After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
RC1, and run the job with following errors:
{noformat}
2017-12-12 13:29:06,824 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; 
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
{noformat}
This is because HDFS-9821, we are adding time unit to hdfs-default.xml but it 
cannot be recognized by old version MR jars. 
This break our rolling upgrade story, so should mark as blocker.
A quick workaround is to add values in hdfs-site.xml with removing all time 
unit. But the right way may be to revert HDFS-9821.


> HDFS default value change (with adding time unit) breaks old version MR 
> tarball work with new version (3.0) of hadoop
> -
>
> Key: HDFS-12920
> URL: https://issues.apache.org/jira/browse/HDFS-12920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Junping Du
>Priority: Blocker
>
> After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
> RC1, and run the job with following errors:
> {noformat}
> 2017-12-12 13:29:06,824 INFO [main] 
> org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lan

[jira] [Updated] (HDFS-12920) HDFS default value change (with adding time unit) breaks old version MR tarball work with new version (3.0) of hadoop

2017-12-12 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12920:
--
Description: 
After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
RC1, and run the job with following errors:
{noformat}
2017-12-12 13:29:06,824 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; 
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
{noformat}
This is because HDFS-9821, we are adding time unit to hdfs-default.xml but it 
cannot be recognized by old version MR jars. 
This break our rolling upgrade story, so should mark as blocker.
A quick workaround is to add values in hdfs-site.xml with removing all time 
unit. But the right way may be to revert HDFS-9821.

  was:
I tried to deploy 2.9.0 tar ball with 3.0.0 RC1, and run the job with following 
errors:
{noformat}
2017-12-12 13:29:06,824 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; 
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
{noformat}
This is because HDFS-9821, we are adding time unit to hdfs-default.xml but it 
cannot be recognized by old version MR jars. 
This break our rolling upgrade story, so should mark as blocker.
A quick workaround is to add values in hdfs-site.xml with removing all time 
unit. But the right way may be to revert HDFS-9821.


> HDFS default value change (with adding time unit) breaks old version MR 
> tarball work with new version (3.0) of hadoop
> -
>
> Key: HDFS-12920
> URL: https://issues.apache.org/jira/browse/HDFS-12920
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: Junping Du
>Priority: Blocker
>
> After HADOOP-15059 get resolved. I tried to deploy 2.9.0 tar ball with 3.0.0 
> RC1, and run the job with following errors:
> {noformat}
> 2017-12-12 13:29:06,824 INFO [main] 
> org.apache.hadoop.service.AbstractService: Service 
> org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; cause: 
> org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
> java.lang.NumberFormatException: For input string: "30s"
> org.apache.hadoop

[jira] [Created] (HDFS-12920) HDFS default value change (with adding time unit) breaks old version MR tarball work with new version (3.0) of hadoop

2017-12-12 Thread Junping Du (JIRA)
Junping Du created HDFS-12920:
-

 Summary: HDFS default value change (with adding time unit) breaks 
old version MR tarball work with new version (3.0) of hadoop
 Key: HDFS-12920
 URL: https://issues.apache.org/jira/browse/HDFS-12920
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Junping Du
Priority: Blocker


I tried to deploy 2.9.0 tar ball with 3.0.0 RC1, and run the job with following 
errors:
{noformat}
2017-12-12 13:29:06,824 INFO [main] org.apache.hadoop.service.AbstractService: 
Service org.apache.hadoop.mapreduce.v2.app.MRAppMaster failed in state INITED; 
cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: 
java.lang.NumberFormatException: For input string: "30s"
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:542)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$2.call(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.callWithJobClassLoader(MRAppMaster.java:1764)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.createOutputCommitter(MRAppMaster.java:522)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.serviceInit(MRAppMaster.java:308)
at 
org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster$5.run(MRAppMaster.java:1722)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.initAndStartAppMaster(MRAppMaster.java:1719)
at 
org.apache.hadoop.mapreduce.v2.app.MRAppMaster.main(MRAppMaster.java:1650)
{noformat}
This is because HDFS-9821, we are adding time unit to hdfs-default.xml but it 
cannot be recognized by old version MR jars. 
This break our rolling upgrade story, so should mark as blocker.
A quick workaround is to add values in hdfs-site.xml with removing all time 
unit. But the right way may be to revert HDFS-9821.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12832) INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to NameNode exit

2017-12-04 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12832:
--
Fix Version/s: (was: 2.8.4)
   2.8.3

> INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to 
> NameNode exit
> 
>
> Key: HDFS-12832
> URL: https://issues.apache.org/jira/browse/HDFS-12832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.4, 3.0.0-beta1
>Reporter: DENG FEI
>Assignee: Konstantin Shvachko
>Priority: Critical
> Fix For: 2.8.3, 2.7.5, 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
> Attachments: HDFS-12832-branch-2.002.patch, 
> HDFS-12832-branch-2.7.002.patch, HDFS-12832-trunk-001.patch, 
> HDFS-12832.002.patch, exception.log
>
>
> {code:title=INode.java|borderStyle=solid}
> public String getFullPathName() {
> // Get the full path name of this inode.
> if (isRoot()) {
>   return Path.SEPARATOR;
> }
> // compute size of needed bytes for the path
> int idx = 0;
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   // add component + delimiter (if not tail component)
>   idx += inode.getLocalNameBytes().length + (inode != this ? 1 : 0);
> }
> byte[] path = new byte[idx];
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   if (inode != this) {
> path[--idx] = Path.SEPARATOR_CHAR;
>   }
>   byte[] name = inode.getLocalNameBytes();
>   idx -= name.length;
>   System.arraycopy(name, 0, path, idx, name.length);
> }
> return DFSUtil.bytes2String(path);
>   }
> {code}
> We found ArrayIndexOutOfBoundsException at 
> _{color:#707070}System.arraycopy(name, 0, path, idx, name.length){color}_ 
> when ReplicaMonitor work ,and the NameNode will quit.
> It seems the two loop is not synchronized, the path's length is changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12832) INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to NameNode exit

2017-12-04 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277839#comment-16277839
 ] 

Junping Du commented on HDFS-12832:
---

Merge to branch-2.8.3 as well.

> INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to 
> NameNode exit
> 
>
> Key: HDFS-12832
> URL: https://issues.apache.org/jira/browse/HDFS-12832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.4, 3.0.0-beta1
>Reporter: DENG FEI
>Assignee: Konstantin Shvachko
>Priority: Critical
> Fix For: 2.8.3, 2.7.5, 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
> Attachments: HDFS-12832-branch-2.002.patch, 
> HDFS-12832-branch-2.7.002.patch, HDFS-12832-trunk-001.patch, 
> HDFS-12832.002.patch, exception.log
>
>
> {code:title=INode.java|borderStyle=solid}
> public String getFullPathName() {
> // Get the full path name of this inode.
> if (isRoot()) {
>   return Path.SEPARATOR;
> }
> // compute size of needed bytes for the path
> int idx = 0;
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   // add component + delimiter (if not tail component)
>   idx += inode.getLocalNameBytes().length + (inode != this ? 1 : 0);
> }
> byte[] path = new byte[idx];
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   if (inode != this) {
> path[--idx] = Path.SEPARATOR_CHAR;
>   }
>   byte[] name = inode.getLocalNameBytes();
>   idx -= name.length;
>   System.arraycopy(name, 0, path, idx, name.length);
> }
> return DFSUtil.bytes2String(path);
>   }
> {code}
> We found ArrayIndexOutOfBoundsException at 
> _{color:#707070}System.arraycopy(name, 0, path, idx, name.length){color}_ 
> when ReplicaMonitor work ,and the NameNode will quit.
> It seems the two loop is not synchronized, the path's length is changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) Delete copy-on-truncate block along with the original block, when deleting a file being truncated

2017-12-04 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277838#comment-16277838
 ] 

Junping Du commented on HDFS-12638:
---

Merge to branch-2.8.3 as well.

> Delete copy-on-truncate block along with the original block, when deleting a 
> file being truncated
> -
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Assignee: Konstantin Shvachko
>Priority: Blocker
> Fix For: 2.8.3, 2.7.5, 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> HDFS-12638.003.patch, HDFS-12638.004.patch, OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12638) Delete copy-on-truncate block along with the original block, when deleting a file being truncated

2017-12-04 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12638:
--
Fix Version/s: (was: 2.8.4)
   2.8.3

> Delete copy-on-truncate block along with the original block, when deleting a 
> file being truncated
> -
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Assignee: Konstantin Shvachko
>Priority: Blocker
> Fix For: 2.8.3, 2.7.5, 3.1.0, 2.10.0, 2.9.1, 3.0.1
>
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> HDFS-12638.003.patch, HDFS-12638.004.patch, OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269808#comment-16269808
 ] 

Junping Du commented on HDFS-12638:
---

Ping... [~szetszwo] and [~jingzhao], do you agree with [~shv]'s proposal above 
that to revert HDFS-9754? Or you have some other proposal here? 2.8.3 release 
is pending on this.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12832) INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to NameNode exit

2017-11-28 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12832:
--
Target Version/s: 2.8.3, 2.7.5, 3.0.0, 3.1.0, 2.9.1  (was: 2.8.3, 2.7.5, 
3.0.0, 2.9.1)

> INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to 
> NameNode exit
> 
>
> Key: HDFS-12832
> URL: https://issues.apache.org/jira/browse/HDFS-12832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.4, 3.0.0-beta1
>Reporter: DENG FEI
>Assignee: Konstantin Shvachko
>Priority: Critical
>  Labels: release-blocker
> Attachments: HDFS-12832-branch-2.002.patch, 
> HDFS-12832-branch-2.7.002.patch, HDFS-12832-trunk-001.patch, 
> HDFS-12832.002.patch, exception.log
>
>
> {code:title=INode.java|borderStyle=solid}
> public String getFullPathName() {
> // Get the full path name of this inode.
> if (isRoot()) {
>   return Path.SEPARATOR;
> }
> // compute size of needed bytes for the path
> int idx = 0;
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   // add component + delimiter (if not tail component)
>   idx += inode.getLocalNameBytes().length + (inode != this ? 1 : 0);
> }
> byte[] path = new byte[idx];
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   if (inode != this) {
> path[--idx] = Path.SEPARATOR_CHAR;
>   }
>   byte[] name = inode.getLocalNameBytes();
>   idx -= name.length;
>   System.arraycopy(name, 0, path, idx, name.length);
> }
> return DFSUtil.bytes2String(path);
>   }
> {code}
> We found ArrayIndexOutOfBoundsException at 
> _{color:#707070}System.arraycopy(name, 0, path, idx, name.length){color}_ 
> when ReplicaMonitor work ,and the NameNode will quit.
> It seems the two loop is not synchronized, the path's length is changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12832) INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to NameNode exit

2017-11-28 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12832:
--
Target Version/s: 2.8.3, 2.7.5, 3.0.0, 2.9.1  (was: 2.7.5)

> INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to 
> NameNode exit
> 
>
> Key: HDFS-12832
> URL: https://issues.apache.org/jira/browse/HDFS-12832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.4, 3.0.0-beta1
>Reporter: DENG FEI
>Assignee: Konstantin Shvachko
>Priority: Critical
>  Labels: release-blocker
> Attachments: HDFS-12832-branch-2.002.patch, 
> HDFS-12832-branch-2.7.002.patch, HDFS-12832-trunk-001.patch, 
> HDFS-12832.002.patch, exception.log
>
>
> {code:title=INode.java|borderStyle=solid}
> public String getFullPathName() {
> // Get the full path name of this inode.
> if (isRoot()) {
>   return Path.SEPARATOR;
> }
> // compute size of needed bytes for the path
> int idx = 0;
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   // add component + delimiter (if not tail component)
>   idx += inode.getLocalNameBytes().length + (inode != this ? 1 : 0);
> }
> byte[] path = new byte[idx];
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   if (inode != this) {
> path[--idx] = Path.SEPARATOR_CHAR;
>   }
>   byte[] name = inode.getLocalNameBytes();
>   idx -= name.length;
>   System.arraycopy(name, 0, path, idx, name.length);
> }
> return DFSUtil.bytes2String(path);
>   }
> {code}
> We found ArrayIndexOutOfBoundsException at 
> _{color:#707070}System.arraycopy(name, 0, path, idx, name.length){color}_ 
> when ReplicaMonitor work ,and the NameNode will quit.
> It seems the two loop is not synchronized, the path's length is changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12832) INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to NameNode exit

2017-11-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269792#comment-16269792
 ] 

Junping Du commented on HDFS-12832:
---

Thanks [~shv] for the patch and [~zhz] for review. I think we also need this 
fix in branch-2.8 (branch-2.8.3) and branch-2.9. Adding 2.8.3, 2.9.1 and 3.0.0 
in target version.

> INode.getFullPathName may throw ArrayIndexOutOfBoundsException lead to 
> NameNode exit
> 
>
> Key: HDFS-12832
> URL: https://issues.apache.org/jira/browse/HDFS-12832
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.7.4, 3.0.0-beta1
>Reporter: DENG FEI
>Assignee: Konstantin Shvachko
>Priority: Critical
>  Labels: release-blocker
> Attachments: HDFS-12832-branch-2.002.patch, 
> HDFS-12832-branch-2.7.002.patch, HDFS-12832-trunk-001.patch, 
> HDFS-12832.002.patch, exception.log
>
>
> {code:title=INode.java|borderStyle=solid}
> public String getFullPathName() {
> // Get the full path name of this inode.
> if (isRoot()) {
>   return Path.SEPARATOR;
> }
> // compute size of needed bytes for the path
> int idx = 0;
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   // add component + delimiter (if not tail component)
>   idx += inode.getLocalNameBytes().length + (inode != this ? 1 : 0);
> }
> byte[] path = new byte[idx];
> for (INode inode = this; inode != null; inode = inode.getParent()) {
>   if (inode != this) {
> path[--idx] = Path.SEPARATOR_CHAR;
>   }
>   byte[] name = inode.getLocalNameBytes();
>   idx -= name.length;
>   System.arraycopy(name, 0, path, idx, name.length);
> }
> return DFSUtil.bytes2String(path);
>   }
> {code}
> We found ArrayIndexOutOfBoundsException at 
> _{color:#707070}System.arraycopy(name, 0, path, idx, name.length){color}_ 
> when ReplicaMonitor work ,and the NameNode will quit.
> It seems the two loop is not synchronized, the path's length is changed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12638:
--
Target Version/s: 2.8.3, 3.0.0, 2.9.1  (was: 2.8.3)

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263649#comment-16263649
 ] 

Junping Du commented on HDFS-12638:
---

Should be a blocker for 2.9.1 and 3.0.0 as well. Add them into target version 
for tracking.

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12638) NameNode exits due to ReplicationMonitor thread received Runtime exception in ReplicationWork#chooseTargets

2017-11-22 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16263522#comment-16263522
 ] 

Junping Du commented on HDFS-12638:
---

[~jingzhao] and [~szetszwo], any comments on Konstantin's proposal above?

> NameNode exits due to ReplicationMonitor thread received Runtime exception in 
> ReplicationWork#chooseTargets
> ---
>
> Key: HDFS-12638
> URL: https://issues.apache.org/jira/browse/HDFS-12638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.2
>Reporter: Jiandan Yang 
>Priority: Blocker
> Attachments: HDFS-12638-branch-2.8.2.001.patch, HDFS-12638.002.patch, 
> OphanBlocksAfterTruncateDelete.jpg
>
>
> Active NamNode exit due to NPE, I can confirm that the BlockCollection passed 
> in when creating ReplicationWork is null, but I do not know why 
> BlockCollection is null, By view history I found 
> [HDFS-9754|https://issues.apache.org/jira/browse/HDFS-9754] remove judging  
> whether  BlockCollection is null.
> NN logs are as following:
> {code:java}
> 2017-10-11 16:29:06,161 ERROR [ReplicationMonitor] 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: 
> ReplicationMonitor thread received Runtime exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:55)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWorkForBlocks(BlockManager.java:1532)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReplicationWork(BlockManager.java:1491)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:3792)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$ReplicationMonitor.run(BlockManager.java:3744)
> at java.lang.Thread.run(Thread.java:834)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12681) Fold HdfsLocatedFileStatus into HdfsFileStatus

2017-11-15 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254332#comment-16254332
 ] 

Junping Du commented on HDFS-12681:
---

bq. We hope that clients would request locations in the first RPC call, rather 
than asking for a FileStatus and then requesting its block locations.
I am not sure if this assumption works for other places, but at least our MR 
jobs get broken here with NPE. Do we have a quick solution to fix this problem? 
Otherwise, we may need to consider to revert this (and may be HDFS-7878).

I also notice this belongs to incompatible change - any reason that not get in 
3.0 but instead go to 3.1 besides the time window of 3.0 release?

> Fold HdfsLocatedFileStatus into HdfsFileStatus
> --
>
> Key: HDFS-12681
> URL: https://issues.apache.org/jira/browse/HDFS-12681
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Chris Douglas
>Assignee: Chris Douglas
>Priority: Minor
> Fix For: 3.1.0
>
> Attachments: HDFS-12681.00.patch, HDFS-12681.01.patch, 
> HDFS-12681.02.patch, HDFS-12681.03.patch, HDFS-12681.04.patch, 
> HDFS-12681.05.patch, HDFS-12681.06.patch, HDFS-12681.07.patch, 
> HDFS-12681.08.patch, HDFS-12681.09.patch, HDFS-12681.10.patch
>
>
> {{HdfsLocatedFileStatus}} is a subtype of {{HdfsFileStatus}}, but not of 
> {{LocatedFileStatus}}. Conversion requires copying common fields and shedding 
> unknown data. It would be cleaner and sufficient for {{HdfsFileStatus}} to 
> extend {{LocatedFileStatus}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12420) Add an option to disallow 'namenode format -force'

2017-10-18 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12420:
--
Fix Version/s: (was: 2.8.2)
   2.8.3

> Add an option to disallow 'namenode format -force'
> --
>
> Key: HDFS-12420
> URL: https://issues.apache.org/jira/browse/HDFS-12420
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ajay Kumar
>Assignee: Ajay Kumar
> Fix For: 2.9.0, 2.8.3, 2.7.5, 3.0.0, 3.1.0
>
> Attachments: HDFS-12420.01.patch, HDFS-12420.02.patch, 
> HDFS-12420.03.patch, HDFS-12420.04.patch, HDFS-12420.05.patch, 
> HDFS-12420.06.patch, HDFS-12420.07.patch, HDFS-12420.08.patch, 
> HDFS-12420.09.patch, HDFS-12420.10.patch, HDFS-12420.11.patch, 
> HDFS-12420.12.patch
>
>
> Support for disabling NameNode format to avoid accidental formatting of 
> Namenode in production cluster. If someone really wants to delete the 
> complete fsImage, they can first delete the metadata dir and then run {code} 
> hdfs namenode -format{code} manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12420) Add an option to disallow 'namenode format -force'

2017-10-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210386#comment-16210386
 ] 

Junping Du commented on HDFS-12420:
---

Drop 2.8.2 from fix version as the patch only commit to branch-2.8 instead of 
branch-2.8.2.

> Add an option to disallow 'namenode format -force'
> --
>
> Key: HDFS-12420
> URL: https://issues.apache.org/jira/browse/HDFS-12420
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ajay Kumar
>Assignee: Ajay Kumar
> Fix For: 2.9.0, 2.8.3, 2.7.5, 3.0.0, 3.1.0
>
> Attachments: HDFS-12420.01.patch, HDFS-12420.02.patch, 
> HDFS-12420.03.patch, HDFS-12420.04.patch, HDFS-12420.05.patch, 
> HDFS-12420.06.patch, HDFS-12420.07.patch, HDFS-12420.08.patch, 
> HDFS-12420.09.patch, HDFS-12420.10.patch, HDFS-12420.11.patch, 
> HDFS-12420.12.patch
>
>
> Support for disabling NameNode format to avoid accidental formatting of 
> Namenode in production cluster. If someone really wants to delete the 
> complete fsImage, they can first delete the metadata dir and then run {code} 
> hdfs namenode -format{code} manually.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12495) TestPendingInvalidateBlock#testPendingDeleteUnknownBlocks fails intermittently

2017-10-18 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210288#comment-16210288
 ] 

Junping Du commented on HDFS-12495:
---

I don't find the patch landed in branch-2.8.2. Drop 2.8.2 in fix version.

> TestPendingInvalidateBlock#testPendingDeleteUnknownBlocks fails intermittently
> --
>
> Key: HDFS-12495
> URL: https://issues.apache.org/jira/browse/HDFS-12495
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Badger
>Assignee: Eric Badger
>  Labels: flaky-test
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.0.0, 3.1.0
>
> Attachments: HDFS-12495.001.patch, HDFS-12495.002.patch
>
>
> {noformat}
> java.net.BindException: Problem binding to [localhost:36701] 
> java.net.BindException: Address already in use; For more details see:  
> http://wiki.apache.org/hadoop/BindException
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at org.apache.hadoop.ipc.Server.bind(Server.java:546)
>   at org.apache.hadoop.ipc.Server$Listener.(Server.java:955)
>   at org.apache.hadoop.ipc.Server.(Server.java:2655)
>   at org.apache.hadoop.ipc.RPC$Server.(RPC.java:968)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:367)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:342)
>   at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:810)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initIpcServer(DataNode.java:954)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1314)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:481)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2611)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2499)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2546)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartDataNode(MiniDFSCluster.java:2152)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock.testPendingDeleteUnknownBlocks(TestPendingInvalidateBlock.java:175)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12495) TestPendingInvalidateBlock#testPendingDeleteUnknownBlocks fails intermittently

2017-10-18 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12495:
--
Fix Version/s: (was: 2.8.2)

> TestPendingInvalidateBlock#testPendingDeleteUnknownBlocks fails intermittently
> --
>
> Key: HDFS-12495
> URL: https://issues.apache.org/jira/browse/HDFS-12495
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
>Reporter: Eric Badger
>Assignee: Eric Badger
>  Labels: flaky-test
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.3, 3.0.0, 3.1.0
>
> Attachments: HDFS-12495.001.patch, HDFS-12495.002.patch
>
>
> {noformat}
> java.net.BindException: Problem binding to [localhost:36701] 
> java.net.BindException: Address already in use; For more details see:  
> http://wiki.apache.org/hadoop/BindException
>   at sun.nio.ch.Net.bind0(Native Method)
>   at sun.nio.ch.Net.bind(Net.java:433)
>   at sun.nio.ch.Net.bind(Net.java:425)
>   at 
> sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223)
>   at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
>   at org.apache.hadoop.ipc.Server.bind(Server.java:546)
>   at org.apache.hadoop.ipc.Server$Listener.(Server.java:955)
>   at org.apache.hadoop.ipc.Server.(Server.java:2655)
>   at org.apache.hadoop.ipc.RPC$Server.(RPC.java:968)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server.(ProtobufRpcEngine.java:367)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:342)
>   at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:810)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.initIpcServer(DataNode.java:954)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:1314)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.(DataNode.java:481)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2611)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2499)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:2546)
>   at 
> org.apache.hadoop.hdfs.MiniDFSCluster.restartDataNode(MiniDFSCluster.java:2152)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.TestPendingInvalidateBlock.testPendingDeleteUnknownBlocks(TestPendingInvalidateBlock.java:175)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12473) Change hosts JSON file format

2017-09-21 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12473:
--
Fix Version/s: (was: 2.8.3)

> Change hosts JSON file format
> -
>
> Key: HDFS-12473
> URL: https://issues.apache.org/jira/browse/HDFS-12473
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.2, 3.1.0
>
> Attachments: HDFS-12473-2.patch, HDFS-12473-3.patch, 
> HDFS-12473-4.patch, HDFS-12473-5.patch, HDFS-12473-6.patch, 
> HDFS-12473-branch-2.patch, HDFS-12473.patch
>
>
> The existing host JSON file format doesn't have a top-level token.
> {noformat}
>   {"hostName": "host1"}
>   {"hostName": "host2", "upgradeDomain": "ud0"}
>   {"hostName": "host3", "adminState": "DECOMMISSIONED"}
>   {"hostName": "host4", "upgradeDomain": "ud2", "adminState": 
> "DECOMMISSIONED"}
>   {"hostName": "host5", "port": 8090}
>   {"hostName": "host6", "adminState": "IN_MAINTENANCE"}
>   {"hostName": "host7", "adminState": "IN_MAINTENANCE", 
> "maintenanceExpireTimeInMS": "112233"}
> {noformat}
> Instead, to conform with the JSON standard it should be like
> {noformat}
> [
>   {"hostName": "host1"},
>   {"hostName": "host2", "upgradeDomain": "ud0"},
>   {"hostName": "host3", "adminState": "DECOMMISSIONED"},
>   {"hostName": "host4", "upgradeDomain": "ud2", "adminState": 
> "DECOMMISSIONED"},
>   {"hostName": "host5", "port": 8090},
>   {"hostName": "host6", "adminState": "IN_MAINTENANCE"},
>   {"hostName": "host7", "adminState": "IN_MAINTENANCE", 
> "maintenanceExpireTimeInMS": "112233"}
> ]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12372) Document the impact of HDFS-11069 (Tighten the authorization of datanode RPC)

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12372:
--
Target Version/s: 2.9.0, 2.8.3, 3.0.0  (was: 2.9.0, 2.8.2, 3.0.0)

> Document the impact of HDFS-11069 (Tighten the authorization of datanode RPC)
> -
>
> Key: HDFS-12372
> URL: https://issues.apache.org/jira/browse/HDFS-12372
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 2.8.0, 2.9.0, 2.7.4, 3.0.0-alpha2
>Reporter: Wei-Chiu Chuang
>Assignee: Wei-Chiu Chuang
>
> The idea of HDFS-11069 is good. But it seems to cause confusion for 
> administrators when they issue commands like hdfs diskbalancer, or hdfs 
> dfsadmin, because this change of behavior is not documented properly.
> I suggest we document a recommended way to kinit (e.g. kinit as 
> hdfs/ho...@host1.example.com, rather than h...@example.com), as well as 
> documenting a notice for running privileged DataNode commands in a Kerberized 
> clusters



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11885) createEncryptionZone should not block on initializing EDEK cache

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11885:
--
Target Version/s: 2.9.0, 3.0.0-beta1, 2.8.3  (was: 2.9.0, 3.0.0-beta1, 
2.8.2)

> createEncryptionZone should not block on initializing EDEK cache
> 
>
> Key: HDFS-11885
> URL: https://issues.apache.org/jira/browse/HDFS-11885
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: encryption
>Affects Versions: 2.6.5
>Reporter: Andrew Wang
>Assignee: Andrew Wang
> Attachments: HDFS-11885.001.patch, HDFS-11885.002.patch, 
> HDFS-11885.003.patch, HDFS-11885.004.patch
>
>
> When creating an encryption zone, we call {{ensureKeyIsInitialized}}, which 
> calls {{provider.warmUpEncryptedKeys(keyName)}}. This is a blocking call, 
> which attempts to fill the key cache up to the low watermark.
> If the KMS is down or slow, this can take a very long time, and cause the 
> createZone RPC to fail with a timeout.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12142) Files may be closed before streamer is done

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12142:
--
Target Version/s: 2.8.3  (was: 2.8.2)

> Files may be closed before streamer is done
> ---
>
> Key: HDFS-12142
> URL: https://issues.apache.org/jira/browse/HDFS-12142
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.8.0
>Reporter: Daryn Sharp
>
> We're encountering multiple cases of clients calling updateBlockForPipeline 
> on completed blocks.  Initial analysis is the client closes a file, 
> completeFile succeeds, then it immediately attempts recovery.  The exception 
> is swallowed on the client, only logged on the NN by checkUCBlock.
> The problem "appears" to be benign (no data loss) but it's unproven if the 
> issue always occurs for successfully closed files.  There appears to be very 
> poor coordination between the dfs output stream's threads which leads to 
> races that confuse the streamer thread – which probably should have been 
> joined before returning from close.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11616) Namenode doesn't mark the block as non-corrupt if the reason for corruption was INVALID_STATE

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11616:
--
Target Version/s: 2.8.3  (was: 2.8.1)

> Namenode doesn't mark the block as non-corrupt if the reason for corruption 
> was INVALID_STATE
> -
>
> Key: HDFS-11616
> URL: https://issues.apache.org/jira/browse/HDFS-11616
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 2.7.3
>Reporter: Rushabh S Shah
>
> Due to power failure event, we hit HDFS-5042.
> We lost many racks across the cluster.
> There were couple of missing blocks.
> For a  given missing block, following is the output of fsck.
> {noformat}
> [hdfs@XXX rushabhs]$ hdfs fsck -blockId blk_8566436445
> Connecting to namenode via 
> http://nn1:50070/fsck?ugi=hdfs&blockId=blk_8566436445+&path=%2F
> FSCK started by hdfs (auth:KERBEROS_SSL) from XXX at Mon Apr 03 16:22:48 UTC 
> 2017
> Block Id: blk_8566436445
> Block belongs to: 
> No. of Expected Replica: 3
> No. of live Replica: 0
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 3
> Block replica on datanode/rack: datanodeA is CORRUPT   ReasonCode: 
> INVALID_STATE
> Block replica on datanode/rack: datanodeB is CORRUPT   ReasonCode: 
> INVALID_STATE
> Block replica on datanode/rack: datanodeC is CORRUPT   ReasonCode: 
> INVALID_STATE
> {noformat}
> After the power event, when we restarted the datanode, the blocks were in rbw 
> directory.
> When full block report is sent to namenode, all the blocks from rbw directory 
> gets converted into RWR state and the namenode marked it as corrupt with 
> reason Reason.INVALID_STATE.
> After sometime (in this case after 31 hours) when I went to recover missing 
> blocks, I noticed the following things.
> All the datanodes has their copy of the block in rbw directory but the file 
> was complete according to namenode.
> All the replicas had the right size and correct genstamp and {{hdfs debug 
> verify}} command also succeeded.
> I went to dnA and moved the block from rbw directory to finalized directory.
> Restarted the datanode (making sure the replicas file was not present during 
> startup).
> I forced a FBR and made sure the datanode block reported to namenode.
> After waiting for sometime, still that block was missing.
> I expected the missing block to go away since the replica is in FINALIZED 
> directory.
> On investigating more, I found out that namenode will remove the replica from 
> corrupt map only if the reason for corruption was {{GENSTAMP_MISMATCH}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11602) Enable HttpFS Tomcat access logging

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11602:
--
Target Version/s: 2.8.3  (was: 2.8.1)

> Enable HttpFS Tomcat access logging
> ---
>
> Key: HDFS-11602
> URL: https://issues.apache.org/jira/browse/HDFS-11602
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: httpfs
>Affects Versions: 2.6.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>
> Use Tomcat {{org.apache.catalina.valves.AccessLogValve}} to enable access 
> logging. Verify the solution works with LB or proxy.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12008) Improve the available-space block placement policy

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12008:
--
Target Version/s: 2.8.3  (was: 2.8.2)

> Improve the available-space block placement policy
> --
>
> Key: HDFS-12008
> URL: https://issues.apache.org/jira/browse/HDFS-12008
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: block placement
>Affects Versions: 2.8.1
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
> Attachments: HDFS-12008.patch, HDFS-12008.v2.branch-2.patch, 
> HDFS-12008.v2.trunk.patch, HDFS-12008.v2.trunk.patch, 
> RandomAllocationPolicy.png
>
>
> AvailableSpaceBlockPlacementPolicy currently picks two nodes unconditionally, 
> then picks one node. It could avoid picking the second node when not 
> necessary.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12070) Failed block recovery leaves files open indefinitely and at risk for data loss

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12070:
--
Target Version/s: 2.8.3  (was: 2.8.2)

> Failed block recovery leaves files open indefinitely and at risk for data loss
> --
>
> Key: HDFS-12070
> URL: https://issues.apache.org/jira/browse/HDFS-12070
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>
> Files will remain open indefinitely if block recovery fails which creates a 
> high risk of data loss.  The replication monitor will not replicate these 
> blocks.
> The NN provides the primary node a list of candidate nodes for recovery which 
> involves a 2-stage process. The primary node removes any candidates that 
> cannot init replica recovery (essentially alive and knows about the block) to 
> create a sync list.  Stage 2 issues updates to the sync list – _but fails if 
> any node fails_ unlike the first stage.  The NN should be informed of nodes 
> that did succeed.
> Manual recovery will also fail until the problematic node is temporarily 
> stopped so a connection refused will induce the bad node to be pruned from 
> the candidates.  Recovery succeeds, the lease is released, under replication 
> is fixed, and block is invalidated from the bad node.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11617) Datanode should delete the block from rbw directory when it finds duplicate in finalized directory.

2017-09-10 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11617:
--
Target Version/s: 2.8.3  (was: 2.8.1)

> Datanode should delete the block from rbw directory when it finds duplicate 
> in finalized directory.
> ---
>
> Key: HDFS-11617
> URL: https://issues.apache.org/jira/browse/HDFS-11617
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Affects Versions: 2.7.3
>Reporter: Rushabh S Shah
>
> Recently we had power failure event and we hit HDFS-5042.
> There were missing blocks but datanode had the copy of the block (and meta 
> file) in rbw directory.
> I manually copied the block and meta file to finalized directory and 
> restarted the datanode.
> But after restart, the block somehow got deleted from the finalized directory.
> So I think the datanode tried to resolve duplicate replicas and in process of 
> resolving it deleted the replica from finalized directory.
> In my opinion, if we have to choose between rbw replica and finalized replica 
> (assuming size and genstamp are same), we should delete rbw replica,  not 
> finalized replica.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-9696) Garbage snapshot records lingering forever

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149858#comment-16149858
 ] 

Junping Du commented on HDFS-9696:
--

Add 2.8.0 and 2.9.0 in fix version given patch get landed there.

> Garbage snapshot records lingering forever
> --
>
> Key: HDFS-9696
> URL: https://issues.apache.org/jira/browse/HDFS-9696
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.9.0, 2.6.5, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-9696.branch-2.6.patch, HDFS-9696-branch-2.7.patch, 
> HDFS-9696.patch, HDFS-9696.v2.patch
>
>
> We have a cluster where the snapshot feature might have been tested years 
> ago. When the HDFS does not have any snapshot, but I see filediff records 
> persisted in its fsimage.  Since it has been restarted many times and 
> checkpointed over 100 times since then, it must haven been persisted and  
> carried over since then.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-9696) Garbage snapshot records lingering forever

2017-08-31 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-9696:
-
Fix Version/s: 2.8.0
   2.9.0

> Garbage snapshot records lingering forever
> --
>
> Key: HDFS-9696
> URL: https://issues.apache.org/jira/browse/HDFS-9696
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.9.0, 2.6.5, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-9696.branch-2.6.patch, HDFS-9696-branch-2.7.patch, 
> HDFS-9696.patch, HDFS-9696.v2.patch
>
>
> We have a cluster where the snapshot feature might have been tested years 
> ago. When the HDFS does not have any snapshot, but I see filediff records 
> persisted in its fsimage.  Since it has been restarted many times and 
> checkpointed over 100 times since then, it must haven been persisted and  
> carried over since then.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10763) Open files can leak permanently due to inconsistent lease update

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149856#comment-16149856
 ] 

Junping Du commented on HDFS-10763:
---

Add 2.8.0, 2.9.0 in fix version given patch get landed there.

> Open files can leak permanently due to inconsistent lease update
> 
>
> Key: HDFS-10763
> URL: https://issues.apache.org/jira/browse/HDFS-10763
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3, 2.6.4
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.9.0, 2.6.5, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-10763.br27.patch, 
> HDFS-10763.branch-2.7.supplement.patch, HDFS-10763.branch-2.7.v2.patch, 
> HDFS-10763.patch
>
>
> This can heppen during {{commitBlockSynchronization()}} or a client gives up 
> on closing a file after retries.
> From {{finalizeINodeFileUnderConstruction()}}, the lease is removed first and 
> then the inode is turned into the closed state. But if any block is not in 
> COMPLETE state, 
> {{INodeFile#assertAllBlocksComplete()}} will throw an exception. This will 
> cause the lease is removed from the lease manager, but not from the inode. 
> Since the lease manager does not have a lease for the file, no lease recovery 
> will happen for this file. Moreover, this broken state is persisted and 
> reconstructed through saving and loading of fsimage. Since no replication is 
> scheduled for the blocks for the file, this can cause a data loss and also 
> block decommissioning of datanode.
> The lease cannot be manually recovered either. It fails with
> {noformat}
> ...AlreadyBeingCreatedException): Failed to RECOVER_LEASE /xyz/xyz for user1 
> on
>  0.0.0.1 because the file is under construction but no leases found.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2950)
> ...
> {noformat}
> When a client retries {{close()}}, the same inconsistent state is created, 
> but it can work in the next time since {{checkLease()}} only looks at the 
> inode, not the lease manager in this case. The close behavior is different if 
> HDFS-8999 is activated by setting 
> {{dfs.namenode.file.close.num-committed-allowed}} to 1 (unlikely) or 2 
> (never). 
> In principle, the under-construction feature of an inode and the lease in the 
> lease manager should never go out of sync. The fix involves two parts.
> 1) Prevent inconsistent lease updates. We can achieve this by calling 
> {{removeLease()}} after checking the block state. 
> 2) Avoid reconstructing inconsistent lease states from a fsimage. 1) alone 
> does not correct the existing inconsistencies surviving through fsimages.  
> This can be done during fsimage loading time by making sure a corresponding 
> lease exists for each inode that are with the underconstruction feature. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-10763) Open files can leak permanently due to inconsistent lease update

2017-08-31 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-10763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-10763:
--
Fix Version/s: 2.8.0
   2.9.0

> Open files can leak permanently due to inconsistent lease update
> 
>
> Key: HDFS-10763
> URL: https://issues.apache.org/jira/browse/HDFS-10763
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3, 2.6.4
>Reporter: Kihwal Lee
>Assignee: Kihwal Lee
>Priority: Critical
> Fix For: 2.8.0, 2.9.0, 2.6.5, 2.7.4, 3.0.0-alpha1
>
> Attachments: HDFS-10763.br27.patch, 
> HDFS-10763.branch-2.7.supplement.patch, HDFS-10763.branch-2.7.v2.patch, 
> HDFS-10763.patch
>
>
> This can heppen during {{commitBlockSynchronization()}} or a client gives up 
> on closing a file after retries.
> From {{finalizeINodeFileUnderConstruction()}}, the lease is removed first and 
> then the inode is turned into the closed state. But if any block is not in 
> COMPLETE state, 
> {{INodeFile#assertAllBlocksComplete()}} will throw an exception. This will 
> cause the lease is removed from the lease manager, but not from the inode. 
> Since the lease manager does not have a lease for the file, no lease recovery 
> will happen for this file. Moreover, this broken state is persisted and 
> reconstructed through saving and loading of fsimage. Since no replication is 
> scheduled for the blocks for the file, this can cause a data loss and also 
> block decommissioning of datanode.
> The lease cannot be manually recovered either. It fails with
> {noformat}
> ...AlreadyBeingCreatedException): Failed to RECOVER_LEASE /xyz/xyz for user1 
> on
>  0.0.0.1 because the file is under construction but no leases found.
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:2950)
> ...
> {noformat}
> When a client retries {{close()}}, the same inconsistent state is created, 
> but it can work in the next time since {{checkLease()}} only looks at the 
> inode, not the lease manager in this case. The close behavior is different if 
> HDFS-8999 is activated by setting 
> {{dfs.namenode.file.close.num-committed-allowed}} to 1 (unlikely) or 2 
> (never). 
> In principle, the under-construction feature of an inode and the lease in the 
> lease manager should never go out of sync. The fix involves two parts.
> 1) Prevent inconsistent lease updates. We can achieve this by calling 
> {{removeLease()}} after checking the block state. 
> 2) Avoid reconstructing inconsistent lease states from a fsimage. 1) alone 
> does not correct the existing inconsistencies surviving through fsimages.  
> This can be done during fsimage loading time by making sure a corresponding 
> lease exists for each inode that are with the underconstruction feature. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11947) When constructing a thread name, BPOfferService may print a bogus warning message

2017-08-31 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11947:
--
Fix Version/s: (was: 2.8.2)

> When constructing a thread name, BPOfferService may print a bogus warning 
> message 
> --
>
> Key: HDFS-11947
> URL: https://issues.apache.org/jira/browse/HDFS-11947
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Weiwei Yang
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-alpha4
>
> Attachments: HDFS-11947.001.patch, HDFS-11947.002.patch, 
> HDFS-11947.003.patch
>
>
> HDFS-11558 tries to get Block pool ID for constructing thread names.  When 
> the service is not yet registered with NN, it prints the bogus warning "Block 
> pool ID needed, but service not yet registered with NN" with stack trace.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11947) When constructing a thread name, BPOfferService may print a bogus warning message

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149749#comment-16149749
 ] 

Junping Du commented on HDFS-11947:
---

Sounds like we only commit to trunk and branch-2. Dropping 2.8.2 in fix version.

> When constructing a thread name, BPOfferService may print a bogus warning 
> message 
> --
>
> Key: HDFS-11947
> URL: https://issues.apache.org/jira/browse/HDFS-11947
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Weiwei Yang
>Priority: Minor
> Fix For: 2.9.0, 3.0.0-alpha4
>
> Attachments: HDFS-11947.001.patch, HDFS-11947.002.patch, 
> HDFS-11947.003.patch
>
>
> HDFS-11558 tries to get Block pool ID for constructing thread names.  When 
> the service is not yet registered with NN, it prints the bogus warning "Block 
> pool ID needed, but service not yet registered with NN" with stack trace.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-10326) Disable setting tcp socket send/receive buffers for write pipelines

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-10326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149744#comment-16149744
 ] 

Junping Du commented on HDFS-10326:
---

Sounds like we only commit the patch to branch-2.8 but forget to commit to 
branch-2.8.2. Just commit it.
Also, add 2.9 and 3.0 in fixed version.

> Disable setting tcp socket send/receive buffers for write pipelines
> ---
>
> Key: HDFS-10326
> URL: https://issues.apache.org/jira/browse/HDFS-10326
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, hdfs
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
> Fix For: 2.9.0, 3.0.0-beta1, 2.8.2
>
> Attachments: HDFS-10326.000.patch, HDFS-10326.001.patch, 
> HDFS-10326.001.patch
>
>
> The DataStreamer and the Datanode use a hardcoded 
> DEFAULT_DATA_SOCKET_SIZE=128K for the send and receive buffers of a write 
> pipeline.  Explicitly setting tcp buffer sizes disables tcp stack 
> auto-tuning.  
> The hardcoded value will saturate a 1Gb with 1ms RTT.  105Mbs at 10ms.  
> Paltry 11Mbs over a 100ms long haul.  10Gb networks are underutilized.
> There should either be a configuration to completely disable setting the 
> buffers, or the the setReceiveBuffer and setSendBuffer should be removed 
> entirely.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11078) Fix NPE in LazyPersistFileScrubber

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149714#comment-16149714
 ] 

Junping Du commented on HDFS-11078:
---

Ok. As [~brahmareddy] mentioned above, we are missing jira ID in commit message 
here, so the patch get actually landed.

> Fix NPE in LazyPersistFileScrubber
> --
>
> Key: HDFS-11078
> URL: https://issues.apache.org/jira/browse/HDFS-11078
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HDFS-11078.000.patch, HDFS-11078.001.patch, 
> HDFS-11078-branch-2.7.patch
>
>
> If a block is removed, it will be removed from the block map. When the 
> clearCorruptLazyPersistFiles() tries to delete the block, it may already be 
> deleted and generate a null pointer exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:3820)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:3851)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11078) Fix NPE in LazyPersistFileScrubber

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149709#comment-16149709
 ] 

Junping Du commented on HDFS-11078:
---

I don't find this fix landed in branch-2.8 or branch-2.8.2. [~barnaul], which 
branches do you commit the patch to? We should explicitly mention it in 
committing the patch.

> Fix NPE in LazyPersistFileScrubber
> --
>
> Key: HDFS-11078
> URL: https://issues.apache.org/jira/browse/HDFS-11078
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Íñigo Goiri
>Assignee: Íñigo Goiri
> Fix For: 2.9.0, 2.7.4, 3.0.0-alpha4, 2.8.2
>
> Attachments: HDFS-11078.000.patch, HDFS-11078.001.patch, 
> HDFS-11078-branch-2.7.patch
>
>
> If a block is removed, it will be removed from the block map. When the 
> clearCorruptLazyPersistFiles() tries to delete the block, it may already be 
> deleted and generate a null pointer exception.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:3820)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:3851)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block

2017-08-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16149706#comment-16149706
 ] 

Junping Du commented on HDFS-12102:
---

Drop 2.8.2 in fix version given patch haven't been committed.

> VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt 
> block
> 
>
> Key: HDFS-12102
> URL: https://issues.apache.org/jira/browse/HDFS-12102
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Affects Versions: 2.8.2
>Reporter: Ashwin Ramesh
>Assignee: Ashwin Ramesh
>Priority: Minor
> Attachments: HDFS-12102-001.patch, HDFS-12102-002.patch, 
> HDFS-12102-003.patch
>
>
> When the Volume scanner sees a corrupt block, it restarts the scan and scans 
> the blocks at much faster rate with a negligible scan period. This is so that 
> it doesn't take 3 weeks to report blocks since a corrupt block means 
> increased likelihood that there are more corrupt blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12102) VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt block

2017-08-31 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12102:
--
Fix Version/s: (was: 2.8.2)

> VolumeScanner throttle dropped (fast scan enabled) when there is a corrupt 
> block
> 
>
> Key: HDFS-12102
> URL: https://issues.apache.org/jira/browse/HDFS-12102
> Project: Hadoop HDFS
>  Issue Type: New Feature
>  Components: datanode, hdfs
>Affects Versions: 2.8.2
>Reporter: Ashwin Ramesh
>Assignee: Ashwin Ramesh
>Priority: Minor
> Attachments: HDFS-12102-001.patch, HDFS-12102-002.patch, 
> HDFS-12102-003.patch
>
>
> When the Volume scanner sees a corrupt block, it restarts the scan and scans 
> the blocks at much faster rate with a negligible scan period. This is so that 
> it doesn't take 3 weeks to report blocks since a corrupt block means 
> increased likelihood that there are more corrupt blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11896) Non-dfsUsed will be doubled on dead node re-registration

2017-08-28 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16144632#comment-16144632
 ] 

Junping Du commented on HDFS-11896:
---

Thanks [~brahmareddy] for identify this not landing on branch-2.8.2. Drop 2.8.3 
in fix version.

> Non-dfsUsed will be doubled on dead node re-registration
> 
>
> Key: HDFS-11896
> URL: https://issues.apache.org/jira/browse/HDFS-11896
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Fix For: 2.9.0, 2.7.4, 3.0.0-beta1, 2.8.2
>
> Attachments: HDFS-11896-002.patch, HDFS-11896-003.patch, 
> HDFS-11896-004.patch, HDFS-11896-005.patch, HDFS-11896-006.patch, 
> HDFS-11896-007.patch, HDFS-11896-008.patch, HDFS-11896-branch-2.7-001.patch, 
> HDFS-11896-branch-2.7-002.patch, HDFS-11896-branch-2.7-003.patch, 
> HDFS-11896-branch-2.7-004.patch, HDFS-11896-branch-2.7-005.patch, 
> HDFS-11896-branch-2.7-006.patch, HDFS-11896-branch-2.7-008.patch, 
> HDFS-11896.patch
>
>
>  *Scenario:* 
> i)Make you sure you've non-dfs data.
> ii) Stop Datanode
> iii) wait it becomes dead
> iv) now restart and check the non-dfs data



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-11896) Non-dfsUsed will be doubled on dead node re-registration

2017-08-28 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-11896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-11896:
--
Fix Version/s: (was: 2.8.3)

> Non-dfsUsed will be doubled on dead node re-registration
> 
>
> Key: HDFS-11896
> URL: https://issues.apache.org/jira/browse/HDFS-11896
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.3
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Fix For: 2.9.0, 2.7.4, 3.0.0-beta1, 2.8.2
>
> Attachments: HDFS-11896-002.patch, HDFS-11896-003.patch, 
> HDFS-11896-004.patch, HDFS-11896-005.patch, HDFS-11896-006.patch, 
> HDFS-11896-007.patch, HDFS-11896-008.patch, HDFS-11896-branch-2.7-001.patch, 
> HDFS-11896-branch-2.7-002.patch, HDFS-11896-branch-2.7-003.patch, 
> HDFS-11896-branch-2.7-004.patch, HDFS-11896-branch-2.7-005.patch, 
> HDFS-11896-branch-2.7-006.patch, HDFS-11896-branch-2.7-008.patch, 
> HDFS-11896.patch
>
>
>  *Scenario:* 
> i)Make you sure you've non-dfs data.
> ii) Stop Datanode
> iii) wait it becomes dead
> iv) now restart and check the non-dfs data



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12319) DirectoryScanner will throw IllegalStateException when Multiple BP's are present

2017-08-24 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140603#comment-16140603
 ] 

Junping Du commented on HDFS-12319:
---

Got it. Thanks [~arpitagarwal] for confirm. I will hold on 2.8.2 RC0 for this.

> DirectoryScanner will throw IllegalStateException when Multiple BP's are 
> present
> 
>
> Key: HDFS-12319
> URL: https://issues.apache.org/jira/browse/HDFS-12319
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Attachments: HDFS-12319-001.patch, HDFS-12319-002.patch, 
> TestCase_to_Reproduce.patch
>
>
> *Scenario:*
> Configure "*dfs.datanode.directoryscan.interval*" as *60* and start federated 
> cluster atleast with two nameservices.
> {noformat}
> 2017-08-18 19:06:37,150 
> [java.util.concurrent.ThreadPoolExecutor$Worker@37d68b4e[State = -1, empty 
> queue]] ERROR datanode.DirectoryScanner 
> (DirectoryScanner.java:getDiskReport(551)) - Error compiling report for the 
> volume, StorageId: DS-258b5e16-caa3-48c8-a0c8-b16934eb8a0c
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> StopWatch is already running
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:542)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:392)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:373)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:318)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalStateException: StopWatch is already running
>   at org.apache.hadoop.util.StopWatch.start(StopWatch.java:49)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:612)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:579)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7967) Reduce the performance impact of the balancer

2017-08-23 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-7967:
-
Target Version/s: 2.8.3  (was: 2.8.1)

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7967) Reduce the performance impact of the balancer

2017-08-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139432#comment-16139432
 ] 

Junping Du commented on HDFS-7967:
--

Got it. [~kihwal]. Let's move it to 2.8.3.

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12319) DirectoryScanner will throw IllegalStateException when Multiple BP's are present

2017-08-23 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139386#comment-16139386
 ] 

Junping Du commented on HDFS-12319:
---

Hi [~arpitagarwal], can you confirm that if it should be blocker for 2.8.2? I 
am planning to kick off a RC within this week.

> DirectoryScanner will throw IllegalStateException when Multiple BP's are 
> present
> 
>
> Key: HDFS-12319
> URL: https://issues.apache.org/jira/browse/HDFS-12319
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Brahma Reddy Battula
>Assignee: Brahma Reddy Battula
>Priority: Blocker
> Attachments: HDFS-12319-001.patch, TestCase_to_Reproduce.patch
>
>
> *Scenario:*
> Configure "*dfs.datanode.directoryscan.interval*" as *60* and start federated 
> cluster atleast with two nameservices.
> {noformat}
> 2017-08-18 19:06:37,150 
> [java.util.concurrent.ThreadPoolExecutor$Worker@37d68b4e[State = -1, empty 
> queue]] ERROR datanode.DirectoryScanner 
> (DirectoryScanner.java:getDiskReport(551)) - Error compiling report for the 
> volume, StorageId: DS-258b5e16-caa3-48c8-a0c8-b16934eb8a0c
> java.util.concurrent.ExecutionException: java.lang.IllegalStateException: 
> StopWatch is already running
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:192)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.getDiskReport(DirectoryScanner.java:542)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:392)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:373)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:318)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.IllegalStateException: StopWatch is already running
>   at org.apache.hadoop.util.StopWatch.start(StopWatch.java:49)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:612)
>   at 
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner$ReportCompiler.call(DirectoryScanner.java:579)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   ... 3 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7967) Reduce the performance impact of the balancer

2017-08-20 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-7967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16134707#comment-16134707
 ] 

Junping Du commented on HDFS-7967:
--

Hi [~daryn] and [~kihwal], I am planning to kicking off RC0 soon for 2.8.2. 
What's current status for this jira? Do we plan to merge patch to 
branch-2/branch-2.8 in short term?

> Reduce the performance impact of the balancer
> -
>
> Key: HDFS-7967
> URL: https://issues.apache.org/jira/browse/HDFS-7967
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
> Attachments: HDFS-7967.branch-2.001.patch, 
> HDFS-7967.branch-2.002.patch, HDFS-7967.branch-2-1.patch, 
> HDFS-7967.branch-2.8.001.patch, HDFS-7967.branch-2.8.002.patch, 
> HDFS-7967.branch-2.8.003.patch, HDFS-7967.branch-2.8-1.patch, 
> HDFS-7967-branch-2.8.patch, HDFS-7967-branch-2.patch
>
>
> The balancer needs to query for blocks to move from overly full DNs.  The 
> block lookup is extremely inefficient.  An iterator of the node's blocks is 
> created from the iterators of its storages' blocks.  A random number is 
> chosen corresponding to how many blocks will be skipped via the iterator.  
> Each skip requires costly scanning of triplets.
> The current design also only considers node imbalances while ignoring 
> imbalances within the nodes's storages.  A more efficient and intelligent 
> design may eliminate the costly skipping of blocks via round-robin selection 
> of blocks from the storages based on remaining capacity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-12049) Recommissioning live nodes stalls the NN

2017-07-31 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-12049:
--
Target Version/s: 2.9.0  (was: 2.8.2)

> Recommissioning live nodes stalls the NN
> 
>
> Key: HDFS-12049
> URL: https://issues.apache.org/jira/browse/HDFS-12049
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> A node refresh will recommission included nodes that are alive and in 
> decommissioning or decommissioned state.  The recommission will scan all 
> blocks on the node, find over replicated blocks, chose an excess, queue an 
> invalidate.
> The process is expensive and worsened by overhead of storage types (even when 
> not in use).  It can be especially devastating because the write lock is held 
> for the entire node refresh.  _Recommissioning 67 nodes with ~500k 
> blocks/node stalled rpc services for over 4 mins._



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN

2017-07-31 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108222#comment-16108222
 ] 

Junping Du commented on HDFS-12049:
---

Moved. Thanks Daryn.

> Recommissioning live nodes stalls the NN
> 
>
> Key: HDFS-12049
> URL: https://issues.apache.org/jira/browse/HDFS-12049
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> A node refresh will recommission included nodes that are alive and in 
> decommissioning or decommissioned state.  The recommission will scan all 
> blocks on the node, find over replicated blocks, chose an excess, queue an 
> invalidate.
> The process is expensive and worsened by overhead of storage types (even when 
> not in use).  It can be especially devastating because the write lock is held 
> for the entire node refresh.  _Recommissioning 67 nodes with ~500k 
> blocks/node stalled rpc services for over 4 mins._



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12049) Recommissioning live nodes stalls the NN

2017-07-20 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-12049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095329#comment-16095329
 ] 

Junping Du commented on HDFS-12049:
---

It seems to be a long term existing issues and a bit risky/complicated to fix 
in a maint release. 
[~daryn], I am inclined to move it to 2.9 release if you also agree.

> Recommissioning live nodes stalls the NN
> 
>
> Key: HDFS-12049
> URL: https://issues.apache.org/jira/browse/HDFS-12049
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Daryn Sharp
>Priority: Critical
>
> A node refresh will recommission included nodes that are alive and in 
> decommissioning or decommissioned state.  The recommission will scan all 
> blocks on the node, find over replicated blocks, chose an excess, queue an 
> invalidate.
> The process is expensive and worsened by overhead of storage types (even when 
> not in use).  It can be especially devastating because the write lock is held 
> for the entire node refresh.  _Recommissioning 67 nodes with ~500k 
> blocks/node stalled rpc services for over 4 mins._



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-11661) GetContentSummary uses excessive amounts of memory

2017-04-21 Thread Junping Du (JIRA)

[ 
https://issues.apache.org/jira/browse/HDFS-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979508#comment-15979508
 ] 

Junping Du commented on HDFS-11661:
---

Thanks [~nroberts] and all for reporting the issue. +1 on reverting HDFS-10797 
if we don't have a quick fix here. Just pinged the patch authors of HDFS-10797 
to see if any magic can happen in short term. :)

> GetContentSummary uses excessive amounts of memory
> --
>
> Key: HDFS-11661
> URL: https://issues.apache.org/jira/browse/HDFS-11661
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Nathan Roberts
>Priority: Blocker
> Attachments: Heap growth.png
>
>
> ContentSummaryComputationContext::nodeIncluded() is being used to keep track 
> of all INodes visited during the current content summary calculation. This 
> can be all of the INodes in the filesystem, making for a VERY large hash 
> table. This simply won't work on large filesystems. 
> We noticed this after upgrading a namenode with ~100Million filesystem 
> objects was spending significantly more time in GC. Fortunately this system 
> had some memory breathing room, other clusters we have will not run with this 
> additional demand on memory.
> This was added as part of HDFS-10797 as a way of keeping track of INodes that 
> have already been accounted for - to avoid double counting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   3   4   5   6   7   8   9   10   >