[jira] [Updated] (HDDS-1572) Implement a Pipeline scrubber to maintain healthy number of pipelines in a cluster

2019-11-18 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-1572:
---
Description: 
The design document talks about initial requirements for the pipeline scrubber.
 - Scan the pipelines that the nodes are a part of to select candidates for 
teardown.
 - Scan pipelines that do not have open containers currently in use and 
datanodes are in violation.
 - Schedule tear down operation if a candidate pipeline is found.

  was:
The design document talks about initial requirements for the pipeline scrubber.

- Maintain a datastructure for datanodes violating the pipeline membership soft 
upper bound.
- Scan the pipelines that the nodes are a part of to select candidates for 
teardown.
- Scan pipelines that do not have open containers currently in use and 
datanodes are in violation.
- Schedule tear down operation if a candidate pipeline is found.



> Implement a Pipeline scrubber to maintain healthy number of pipelines in a 
> cluster
> --
>
> Key: HDDS-1572
> URL: https://issues.apache.org/jira/browse/HDDS-1572
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> The design document talks about initial requirements for the pipeline 
> scrubber.
>  - Scan the pipelines that the nodes are a part of to select candidates for 
> teardown.
>  - Scan pipelines that do not have open containers currently in use and 
> datanodes are in violation.
>  - Schedule tear down operation if a candidate pipeline is found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1572) Implement a Pipeline scrubber to maintain healthy number of pipelines in a cluster

2019-11-18 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-1572:
--

Assignee: Li Cheng

> Implement a Pipeline scrubber to maintain healthy number of pipelines in a 
> cluster
> --
>
> Key: HDDS-1572
> URL: https://issues.apache.org/jira/browse/HDDS-1572
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> The design document talks about initial requirements for the pipeline 
> scrubber.
> - Maintain a datastructure for datanodes violating the pipeline membership 
> soft upper bound.
> - Scan the pipelines that the nodes are a part of to select candidates for 
> teardown.
> - Scan pipelines that do not have open containers currently in use and 
> datanodes are in violation.
> - Schedule tear down operation if a candidate pipeline is found.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1573) Add scrubber metrics and pipeline metrics

2019-11-17 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-1573:
--

Assignee: Li Cheng

> Add scrubber metrics and pipeline metrics
> -
>
> Key: HDDS-1573
> URL: https://issues.apache.org/jira/browse/HDDS-1573
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> - Add metrics for how many pipelines per datanode
> - Add metrics for pipelines that are chosen by the scrubber
> - Add metrics for pipelines that are in violation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-2396) OM rocksdb core dump during writing

2019-11-17 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng resolved HDDS-2396.

  Assignee: Li Cheng
Resolution: Fixed

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2396) OM rocksdb core dump during writing

2019-11-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976315#comment-16976315
 ] 

Li Cheng commented on HDDS-2396:


This is resolved in [https://github.com/apache/hadoop-ozone/pull/100]/. Thanks 
for [~bharat]'s fix.

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976258#comment-16976258
 ] 

Li Cheng commented on HDDS-2356:


I checked out [https://github.com/apache/hadoop-ozone/pull/163] and compiled 
out a jar to deploy onto my cluster. [~bharat]

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2018-11-15-OM-logs.txt, 2019-11-06_18_13_57_422_ERROR, 
> hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png, 
> om-audit-VM_50_210_centos.log, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976255#comment-16976255
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] Make sense. I failed to find Key:plc_1570869510243_5542 related info 
in Om audit logs. It might have been rotated. I can try again today to have 
some fresh logs.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2018-11-15-OM-logs.txt, 2019-11-06_18_13_57_422_ERROR, 
> hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png, 
> om-audit-VM_50_210_centos.log, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976237#comment-16976237
 ] 

Li Cheng commented on HDDS-2356:


Tried [https://github.com/apache/hadoop-ozone/pull/163] to run. It did last 
longer, which is good thing. But eventually it still failed due to a 
NO_SUCH_MULTIPART_UPLOAD_ERROR error. Attached the logs in the attachment. 
Interesting is that the first error happened earlier but didn't prevented 
writing. After a few hours, it failed and bellow is the last error log. 

 

2019-11-15 22:13:56,493 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
 MultipartUpload Commit is failed for Key:plc_1570869510243_5542 in 
Volume/Bucket s325d55ad283aa400af464c76d713c07ad/ozone-test
NO_SUCH_MULTIPART_UPLOAD_ERROR 
org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload is 
with specified uploadId 69162f8b-a923-4247-bb67-b1d6f9fa0d9d-103141824303150377
 at 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:159)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:217)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2018-11-15-OM-logs.txt, 2019-11-06_18_13_57_422_ERROR, 
> hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png, 
> om-audit-VM_50_210_centos.log, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-17 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: 2018-11-15-OM-logs.txt

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2018-11-15-OM-logs.txt, 2019-11-06_18_13_57_422_ERROR, 
> hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png, 
> om-audit-VM_50_210_centos.log, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown 

[jira] [Resolved] (HDDS-2492) Fix test clean up issue in TestSCMPipelineManager

2019-11-15 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng resolved HDDS-2492.

Fix Version/s: 0.4.1
   Resolution: Fixed

> Fix test clean up issue in TestSCMPipelineManager
> -
>
> Key: HDDS-2492
> URL: https://issues.apache.org/jira/browse/HDDS-2492
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.4.1
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This was opened based on [~sammichen]'s investigation on HDDS-2034.
>  
> {quote}Failure is caused by newly introduced function 
> TestSCMPipelineManager#testPipelineOpenOnlyWhenLeaderReported which doesn't 
> close pipelineManager at the end. It's better to fix it in a new JIRA.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Work started] (HDDS-2492) Fix test clean up issue in TestSCMPipelineManager

2019-11-14 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDDS-2492 started by Li Cheng.
--
> Fix test clean up issue in TestSCMPipelineManager
> -
>
> Key: HDDS-2492
> URL: https://issues.apache.org/jira/browse/HDDS-2492
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Li Cheng
>Priority: Major
>
> This was opened based on [~sammichen]'s investigation on HDDS-2034.
>  
> {quote}Failure is caused by newly introduced function 
> TestSCMPipelineManager#testPipelineOpenOnlyWhenLeaderReported which doesn't 
> close pipelineManager at the end. It's better to fix it in a new JIRA.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2492) Fix test clean up issue in TestSCMPipelineManager

2019-11-14 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974837#comment-16974837
 ] 

Li Cheng commented on HDDS-2492:


[https://github.com/apache/hadoop-ozone/pull/179]

> Fix test clean up issue in TestSCMPipelineManager
> -
>
> Key: HDDS-2492
> URL: https://issues.apache.org/jira/browse/HDDS-2492
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Li Cheng
>Priority: Major
>
> This was opened based on [~sammichen]'s investigation on HDDS-2034.
>  
> {quote}Failure is caused by newly introduced function 
> TestSCMPipelineManager#testPipelineOpenOnlyWhenLeaderReported which doesn't 
> close pipelineManager at the end. It's better to fix it in a new JIRA.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-13 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: om-audit-VM_50_210_centos.log

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png, om-audit-VM_50_210_centos.log, 
> om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-13 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16973103#comment-16973103
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] I run the test again. And now see the error in audit log:

]} | ret=FAILURE | NO_SUCH_MULTIPART_UPLOAD_ERROR 
org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload is 
with specified uploadId bd6579f4-22ce-4c15-8402-e979f1251b13-103129369238110745
 at 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:156)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:217)
...skipping...
 uuid: "79bf7bdf-ed29-49d4-bf7c-e88fdbd2ce03"
 ipAddress: "9.134.51.215"
 hostName: "9.134.51.215"
 ports {
 name: "RATIS"
 value: 9858
 }
 ports {
 name: "STANDALONE"
 value: 9859
 }
 networkName: "79bf7bdf-ed29-49d4-bf7c-e88fdbd2ce03"
 networkLocation: "/default-rack"
 }
 state: PIPELINE_OPEN
 type: RATIS
 factor: THREE
 id {
 id: "ec6b06c5-193f-4c30-879b-5a12284dc4f8"
 }
}
]} | ret=FAILURE | NO_SUCH_MULTIPART_UPLOAD_ERROR 
org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload is 
. th specified uploadId bd6579f4-22ce-4c15-8402-e979f1251b13-103129369238110745
 at 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:156)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:217)

 

I'm uploading the entire om audit logs in the attachment.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at 

[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-10 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971290#comment-16971290
 ] 

Li Cheng edited comment on HDDS-2356 at 11/11/19 7:21 AM:
--

[~bharat] In terms of the key in the last stacktrace:

2019-11-08 20:08:24,832 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest:
 MultipartUpload Complete request failed for Key: plc_1570863541668_9278 in 
Volume/Bucket s325d55ad283aa400af464c76d713c07ad/ozone-test
 INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete 
Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: 
ozone-testkey: plc_1570863541668_9278

 

OM audit logs show a bunch of entries for key plc_1570863541668_9278 with 
different clientID, for instance:

2019-11-08 20:19:56,241 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_BLOCK \{volume=s325d55ad283aa400af464c76d713c07ad, 
bucket=ozone-test, key=plc_1570863541668_9278, dataSize=5242880, 
replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], 
clientID=103102209394803336} | ret=SUCCESS |

 

I'm uploading the entire OM audit logs about this key plc_1570863541668_9278.

This file size:

[root@VM_50_210_centos /data/idex_data/zip]# ls -altr -h 
./20191012/plc_1570863541668_9278
-rw-r--r-- 1 1003 users 1.4G 10月 22 10:33 ./20191012/plc_1570863541668_9278

 

You can try creating a similar size of file on you own and see it reproduces 
the issue. Please refer to the description for env details.


was (Author: timmylicheng):
[~bharat] In terms of the key in the last stacktrace:

2019-11-08 20:08:24,832 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest:
 MultipartUpload Complete request failed for Key: plc_1570863541668_9278 in 
Volume/Bucket s325d55ad283aa400af464c76d713c07ad/ozone-test
INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete 
Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: 
ozone-testkey: plc_1570863541668_9278

 

OM audit logs show a bunch of entries for key plc_1570863541668_9278 with 
different clientID, for instance:

2019-11-08 20:19:56,241 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_BLOCK \{volume=s325d55ad283aa400af464c76d713c07ad, 
bucket=ozone-test, key=plc_1570863541668_9278, dataSize=5242880, 
replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], 
clientID=103102209394803336} | ret=SUCCESS |

 

I'm uploading the entire OM audit logs about this key plc_1570863541668_9278.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-10 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: om_audit_log_plc_1570863541668_9278.txt

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png, om_audit_log_plc_1570863541668_9278.txt
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-10 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971290#comment-16971290
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] In terms of the key in the last stacktrace:

2019-11-08 20:08:24,832 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest:
 MultipartUpload Complete request failed for Key: plc_1570863541668_9278 in 
Volume/Bucket s325d55ad283aa400af464c76d713c07ad/ozone-test
INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete 
Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: 
ozone-testkey: plc_1570863541668_9278

 

OM audit logs show a bunch of entries for key plc_1570863541668_9278 with 
different clientID, for instance:

2019-11-08 20:19:56,241 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_BLOCK \{volume=s325d55ad283aa400af464c76d713c07ad, 
bucket=ozone-test, key=plc_1570863541668_9278, dataSize=5242880, 
replicationType=RATIS, replicationFactor=THREE, keyLocationInfo=[], 
clientID=103102209394803336} | ret=SUCCESS |

 

I'm uploading the entire OM audit logs about this key plc_1570863541668_9278.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> 

[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-10 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971289#comment-16971289
 ] 

Li Cheng edited comment on HDDS-2356 at 11/11/19 3:14 AM:
--

[~bharat] As I said, debug logs for Goofys doesn't work since goofys debug mode 
is doing everything in single thread. The same error is not reproduced? Are you 
modeling some sample dataset and env to test the issues? I can try reproducing 
from my side and upload OM log, s3g log as well audit log here. Does that work?

 

Also, do you see 2019-11-06_18_13_57_422_ERROR in the attachment? Does it help?


was (Author: timmylicheng):
[~bharat] As I said, debug logs for Goofys doesn't work since goofys debug mode 
is doing everything in single thread. The same error is not reproduced? Are you 
modeling some sample dataset and env to test the issues? I can try reproducing 
from my side and upload OM log, s3g log as well audit log here. Does that work?

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-10 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16971289#comment-16971289
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] As I said, debug logs for Goofys doesn't work since goofys debug mode 
is doing everything in single thread. The same error is not reproduced? Are you 
modeling some sample dataset and env to test the issues? I can try reproducing 
from my side and upload OM log, s3g log as well audit log here. Does that work?

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-08 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970119#comment-16970119
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] New error shows up using today's master branch.

 

2019-11-08 20:08:24,832 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest:
 MultipartUpload Complete request failed for Key: plc_1570863541668_9278 in 
Volume/Bucket s325d55ad283aa400af464c76d713c07ad/ozone-test
INVALID_PART org.apache.hadoop.ozone.om.exceptions.OMException: Complete 
Multipart Upload Failed: volume: s325d55ad283aa400af464c76d713c07adbucket: 
ozone-testkey: plc_1570863541668_9278
 at 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCompleteRequest.validateAndUpdateCache(S3MultipartUploadCompleteRequest.java:187)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:217)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> 

[jira] [Updated] (HDDS-2443) Python client/interface for Ozone

2019-11-08 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2443:
---
Attachment: (was: OzoneS3.py)

> Python client/interface for Ozone
> -
>
> Key: HDDS-2443
> URL: https://issues.apache.org/jira/browse/HDDS-2443
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Client
>Reporter: Li Cheng
>Priority: Major
> Attachments: OzoneS3.py
>
>
> Original ideas: item#25 in 
> [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors]
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API 
> Impala uses  libhdfs
>  
> Path to try:
> # s3 interface: Ozone s3 gateway(already supported) + AWS python client 
> (boto3)
> # python native RPC
> # pyarrow + libhdfs, which use the Java client under the hood.
> # python + C interface of go / rust ozone library. I created POC go / rust 
> clients earlier which can be improved if the libhdfs interface is not good 
> enough. [By [~elek]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2443) Python client/interface for Ozone

2019-11-08 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2443:
---
Attachment: OzoneS3.py

> Python client/interface for Ozone
> -
>
> Key: HDDS-2443
> URL: https://issues.apache.org/jira/browse/HDDS-2443
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Client
>Reporter: Li Cheng
>Priority: Major
> Attachments: OzoneS3.py
>
>
> Original ideas: item#25 in 
> [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors]
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API 
> Impala uses  libhdfs
>  
> Path to try:
> # s3 interface: Ozone s3 gateway(already supported) + AWS python client 
> (boto3)
> # python native RPC
> # pyarrow + libhdfs, which use the Java client under the hood.
> # python + C interface of go / rust ozone library. I created POC go / rust 
> clients earlier which can be improved if the libhdfs interface is not good 
> enough. [By [~elek]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2443) Python client/interface for Ozone

2019-11-08 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2443:
---
Attachment: OzoneS3.py

> Python client/interface for Ozone
> -
>
> Key: HDDS-2443
> URL: https://issues.apache.org/jira/browse/HDDS-2443
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Client
>Reporter: Li Cheng
>Priority: Major
> Attachments: OzoneS3.py
>
>
> Original ideas: item#25 in 
> [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors]
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API 
> Impala uses  libhdfs
>  
> Path to try:
> # s3 interface: Ozone s3 gateway(already supported) + AWS python client 
> (boto3)
> # python native RPC
> # pyarrow + libhdfs, which use the Java client under the hood.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2443) Python client/interface for Ozone

2019-11-08 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969974#comment-16969974
 ] 

Li Cheng commented on HDDS-2443:


[~elek] That would be great! I've uploaded my naive version of interface. :)

> Python client/interface for Ozone
> -
>
> Key: HDDS-2443
> URL: https://issues.apache.org/jira/browse/HDDS-2443
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Client
>Reporter: Li Cheng
>Priority: Major
> Attachments: OzoneS3.py
>
>
> Original ideas: item#25 in 
> [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors]
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API 
> Impala uses  libhdfs
>  
> Path to try:
> # s3 interface: Ozone s3 gateway(already supported) + AWS python client 
> (boto3)
> # python native RPC
> # pyarrow + libhdfs, which use the Java client under the hood.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-07 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: 2019-11-06_18_13_57_422_ERROR

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-07 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969888#comment-16969888
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] Please check the attachment one more time. I re-upload the logs.

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 2019-11-06_18_13_57_422_ERROR, hs_err_pid9340.log, 
> image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> Updated on 11/06/2019:
> See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs 
> are in the attachment.
>  2019-11-05 18:12:37,766 ERROR 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
>  MultipartUpload Commit is failed for Key:./2
> 0191012/plc_1570863541668_9278 in Volume/Bucket 
> s325d55ad283aa400af464c76d713c07ad/ozone-test
> NO_SUCH_MULTIPART_UPLOAD_ERROR 
> org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload 
> is with specified uploadId fcda8608-b431-48b7-8386-
> 0a332f1a709a-103084683261641950
> at 
> org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
> 56)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
> java:217)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
> at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
> at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
> Updated on 10/28/2019:
> See MISMATCH_MULTIPART_LIST error.
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> 

[jira] [Commented] (HDDS-2443) Python client/interface for Ozone

2019-11-07 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16969771#comment-16969771
 ] 

Li Cheng commented on HDDS-2443:


Prototyping with S3 gateway + boto3 now. Reads, Writes and Deletes can be done. 
Large object read may need some tweak. 

Only concern is when it's only uploading files to S3, it shows read timeout 
towards ozone endpoint: 

ReadTimeoutError: Read timeout on endpoint URL: 
"http://localhost:9878/ozone-test/./20191011/plc_1570784946653_2774;

> Python client/interface for Ozone
> -
>
> Key: HDDS-2443
> URL: https://issues.apache.org/jira/browse/HDDS-2443
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Client
>Reporter: Li Cheng
>Priority: Major
>
> Original ideas:
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API 
> Impala uses                      libhdfs
>  # How Jupyter iPython work: 
> [https://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html]
>  # Eco, 
> Architecture:[https://ipython-books.github.io/chapter-3-mastering-the-jupyter-notebook/]
>  
> Path to try:
> 1. s3 interface: Ozone s3 gateway(already supported) + AWS python client 
> (boto3)
> 2. python native RPC
> 3. pyarrow + libhdfs, which use the Java client under the hood.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2443) Python client/interface for Ozone

2019-11-07 Thread Li Cheng (Jira)
Li Cheng created HDDS-2443:
--

 Summary: Python client/interface for Ozone
 Key: HDDS-2443
 URL: https://issues.apache.org/jira/browse/HDDS-2443
 Project: Hadoop Distributed Data Store
  Issue Type: New Feature
  Components: Ozone Client
Reporter: Li Cheng


Original ideas:

Ozone Client(Python) for Data Science Notebook such as Jupyter.
 # Size: Large
 # PyArrow: [https://pypi.org/project/pyarrow/]
 # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API Impala 
uses                      libhdfs
 # How Jupyter iPython work: 
[https://jupyter.readthedocs.io/en/latest/architecture/how_jupyter_ipython_work.html]
 # Eco, 
Architecture:[https://ipython-books.github.io/chapter-3-mastering-the-jupyter-notebook/]

 

Path to try:

1. s3 interface: Ozone s3 gateway(already supported) + AWS python client (boto3)
2. python native RPC
3. pyarrow + libhdfs, which use the Java client under the hood.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-06 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968978#comment-16968978
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] My program won't abort, but I see some errors in s3g logs with 
timeline matching. Not sure if it's related.

十一月 06, 2019 6:11:35 下午 
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference
 cleanQueue
严重: *~*~*~ Channel ManagedChannelImpl\{logId=32225, target=9.134.51.232:9859} 
was not shutdown properly!!! ~*~*~*
 Make sure to call shutdown()/shutdownNow() and wait until awaitTermination() 
returns true.
java.lang.RuntimeException: ManagedChannel allocation site
 at 
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:103)
 at 
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:53)
 at 
org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:44)
 at 
org.apache.ratis.thirdparty.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:411)
 at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.connectToDatanode(XceiverClientGrpc.java:192)
 at 
org.apache.hadoop.hdds.scm.XceiverClientGrpc.connect(XceiverClientGrpc.java:139)
 at 
org.apache.hadoop.hdds.scm.XceiverClientManager$2.call(XceiverClientManager.java:242)
 at 
org.apache.hadoop.hdds.scm.XceiverClientManager$2.call(XceiverClientManager.java:226)
 at 
com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4767)
 at 
com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
 at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
 at 
com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
 at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
 at com.google.common.cache.LocalCache.get(LocalCache.java:3965)
 at 
com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4764)
 at 
org.apache.hadoop.hdds.scm.XceiverClientManager.getClient(XceiverClientManager.java:226)
 at 
org.apache.hadoop.hdds.scm.XceiverClientManager.acquireClient(XceiverClientManager.java:172)
 at 
org.apache.hadoop.hdds.scm.XceiverClientManager.acquireClientForReadData(XceiverClientManager.java:162)
 at 
org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:154)
 at 
org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:118)
 at 
org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:224)
 at 
org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:173)
 at 
org.apache.hadoop.ozone.client.io.OzoneInputStream.read(OzoneInputStream.java:47)
 at java.io.InputStream.read(InputStream.java:101)
 at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2146)
 at org.apache.commons.io.IOUtils.copy(IOUtils.java:2102)
 at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:2123)
 at org.apache.commons.io.IOUtils.copy(IOUtils.java:2078)
 at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.lambda$get$0(ObjectEndpoint.java:252)
 at 
org.glassfish.jersey.message.internal.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:79)
 at 
org.glassfish.jersey.message.internal.StreamingOutputProvider.writeTo(StreamingOutputProvider.java:61)
 at 
org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.invokeWriteTo(WriterInterceptorExecutor.java:266)
 at 
org.glassfish.jersey.message.internal.WriterInterceptorExecutor$TerminalWriterInterceptor.aroundWriteTo(WriterInterceptorExecutor.java:251)
 at 
org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)
 at 
org.glassfish.jersey.server.internal.JsonWithPaddingInterceptor.aroundWriteTo(JsonWithPaddingInterceptor.java:109)
 at 
org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)
 at 
org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:85)
 at 
org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:163)
 at 
org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1135)
 at 
org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:662)
 at 
org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:395)
 at 
org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:385)
 at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:280)
 at org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)
 at 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-06 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Description: 
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I look 
at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors related 
with Multipart upload. This error eventually causes the  writing to terminate 
and OM to be closed. 

 

Updated on 11/06/2019:

See new multipart upload error NO_SUCH_MULTIPART_UPLOAD_ERROR and full logs are 
in the attachment.

 2019-11-05 18:12:37,766 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
 MultipartUpload Commit is failed for Key:./2
0191012/plc_1570863541668_9278 in Volume/Bucket 
s325d55ad283aa400af464c76d713c07ad/ozone-test
NO_SUCH_MULTIPART_UPLOAD_ERROR 
org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload is 
with specified uploadId fcda8608-b431-48b7-8386-
0a332f1a709a-103084683261641950
at 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
56)
at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
java:217)
at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

 

Updated on 10/28/2019:

See MISMATCH_MULTIPART_LIST error.

 

2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete Multipart 
Upload Request for bucket: ozone-test, key: 20191012/plc_1570863541668_927
 8
 MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
Complete Multipart Upload Failed: volume: 
s3c89e813c80ffcea9543004d57b2a1239bucket:
 ozone-testkey: 20191012/plc_1570863541668_9278
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
 .java:1104)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
 at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
 at 
org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
 at 
org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
 at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
 at 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-06 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Description: 
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I look 
at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors related 
with Multipart upload. This error eventually causes the  writing to terminate 
and OM to be closed. 

 

Updated on 11/06/2019:

See new multipart upload error and full logs are in the attachment.

 

 

2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete Multipart 
Upload Request for bucket: ozone-test, key: 20191012/plc_1570863541668_927
 8
 MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
Complete Multipart Upload Failed: volume: 
s3c89e813c80ffcea9543004d57b2a1239bucket:
 ozone-testkey: 20191012/plc_1570863541668_9278
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
 .java:1104)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
 at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
 at 
org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
 at 
org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
 at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
 at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)

 

The following errors has been resolved in 
https://issues.apache.org/jira/browse/HDDS-2322. 

2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
exit status 2: OMDoubleBuffer flush threadOMDoubleBufferFlushThreadencountered 
Throwable error
 java.util.ConcurrentModificationException
 at java.util.TreeMap.forEach(TreeMap.java:1004)
 at 
org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
 at 
org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
 at org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
 at 
org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
 at java.util.Iterator.forEachRemaining(Iterator.java:116)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
 at java.lang.Thread.run(Thread.java:745)
 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:

  was:
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-11-06 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968957#comment-16968957
 ] 

Li Cheng commented on HDDS-2356:


See new Multipart upload error in yesterday's master. [~bharat]

Check the newly attached log 2019-11-06 18:13:57,422 ERROR for more info. It 
seemed to fail to multiple keys.

2019-11-05 18:12:37,766 ERROR 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest:
 MultipartUpload Commit is failed for Key:./2
0191012/plc_1570863541668_9278 in Volume/Bucket 
s325d55ad283aa400af464c76d713c07ad/ozone-test
NO_SUCH_MULTIPART_UPLOAD_ERROR 
org.apache.hadoop.ozone.om.exceptions.OMException: No such Multipart upload is 
with specified uploadId fcda8608-b431-48b7-8386-
0a332f1a709a-103084683261641950
 at 
org.apache.hadoop.ozone.om.request.s3.multipart.S3MultipartUploadCommitPartRequest.validateAndUpdateCache(S3MultipartUploadCommitPartRequest.java:1
56)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.
java:217)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:132)
 at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
 at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:100)
 at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
 at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)

 

 

 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> 

[jira] [Commented] (HDDS-2396) OM rocksdb core dump during writing

2019-11-03 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966372#comment-16966372
 ] 

Li Cheng commented on HDDS-2396:


[~bharat] Probably not up to date if HDDS 2379 is resolved in recent a week. I 
can try to use the most updated master again. 

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964554#comment-16964554
 ] 

Li Cheng edited comment on HDDS-2356 at 11/1/19 3:45 AM:
-

Also see a core dump in rocksdb during last night's testing. Please check the 
attachment for the entire log.

 

>From the first glance, it looks like when rocksdb is iterating the write_batch 
>to insert to the memtable, there happens a stl memory error during memory 
>movement. It might not be related to ozone, but it would cause rocksdb 
>failure. 

 

Created https://issues.apache.org/jira/browse/HDDS-2396 to track the core dump 
in OM rocksdb.

Below is some part of the stack:

C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
 C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
 :ValueType)+0x51c
 C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
 C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
 C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
 C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
 C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
 C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
 J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
 J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
 j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4


was (Author: timmylicheng):
Also see a core dump in rocksdb during last night's testing. Please check the 
attachment for the entire log.

 

>From the first glance, it looks like when rocksdb is iterating the write_batch 
>to insert to the memtable, there happens a stl memory error during memory 
>movement. It might not be related to ozone, but it would cause rocksdb 
>failure. 

Below is some part of the stack:

C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
:ValueType)+0x51c
C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: 

[jira] [Commented] (HDDS-2396) OM rocksdb core dump during writing

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964578#comment-16964578
 ] 

Li Cheng commented on HDDS-2396:


Attached the entire log for the core dump. Will try to turn on ulimit and 
reproduce this, but it happens only occasionally. 

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDDS-2396) OM rocksdb core dump during writing

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2396:
---
Attachment: hs_err_pid9340.log

> OM rocksdb core dump during writing
> ---
>
> Key: HDDS-2396
> URL: https://issues.apache.org/jira/browse/HDDS-2396
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
>Reporter: Li Cheng
>Priority: Major
> Attachments: hs_err_pid9340.log
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
>  
> There happens core dump in rocksdb while it's occasional. 
>  
> Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
> space=1018k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
> code)
> C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
> C [librocksdbjni3192271038586903156.so+0x358fec] 
> rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&, rocksdb:
> :ValueType)+0x51c
> C [librocksdbjni3192271038586903156.so+0x359d17] 
> rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
> rocksdb::Slice const&)+0x17
> C [librocksdbjni3192271038586903156.so+0x3513bc] 
> rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
> C [librocksdbjni3192271038586903156.so+0x354df9] 
> rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
> unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, 
> bool, unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
> C [librocksdbjni3192271038586903156.so+0x29fd79] 
> rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long, 
> bool, unsigned long*, unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
> C [librocksdbjni3192271038586903156.so+0x2a0431] 
> rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, 
> rocksdb::WriteBatch*)+0x21
> C [librocksdbjni3192271038586903156.so+0x1a064c] 
> Java_org_rocksdb_RocksDB_write0+0xcc
> J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
> [0x7f58f1872d00+0xbe]
> J 10093% C1 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V
>  (400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
> j 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
> j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2396) OM rocksdb core dump during writing

2019-10-31 Thread Li Cheng (Jira)
Li Cheng created HDDS-2396:
--

 Summary: OM rocksdb core dump during writing
 Key: HDDS-2396
 URL: https://issues.apache.org/jira/browse/HDDS-2396
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Manager
Affects Versions: 0.4.1
Reporter: Li Cheng
 Attachments: hs_err_pid9340.log

Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

 

There happens core dump in rocksdb while it's occasional. 

 

Stack: [0x7f5891a23000,0x7f5891b24000], sp=0x7f5891b21bb8, free 
space=1018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
:ValueType)+0x51c
C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4
j java.lang.Thread.run()V+11



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2395) Handle Ozone S3 completeMPU to match with aws s3 behavior.

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964573#comment-16964573
 ] 

Li Cheng commented on HDDS-2395:


Also please note the exclude list issue.

 

2019-11-01 11:25:24,047 [qtp1383524016-27648] INFO - Allocating block with 
ExcludeList \{datanodes = [], containerIds = [], pipelineIds = 
[PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97, 
PipelineID=20d1830a-a77d-498e-a4a1-ba656ead3d97]}

> Handle Ozone S3 completeMPU to match with aws s3 behavior.
> --
>
> Key: HDDS-2395
> URL: https://issues.apache.org/jira/browse/HDDS-2395
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> # When uploaded 2 parts, and when complete upload 1 part no error
>  # During complete multipart upload name/part number not matching with 
> uploaded part and part number then InvalidPart error
>  # When parts are not specified in sorted order InvalidPartOrder
>  # During complete multipart upload when no 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964554#comment-16964554
 ] 

Li Cheng commented on HDDS-2356:


Also see a core dump in rocksdb during last night's testing. Please check the 
attachment for the entire log.

 

>From the first glance, it looks like when rocksdb is iterating the write_batch 
>to insert to the memtable, there happens a stl memory error during memory 
>movement. It might not be related to ozone, but it would cause rocksdb 
>failure. 

Below is some part of the stack:

C [libc.so.6+0x151d60] __memmove_ssse3_back+0x1ae0
C [librocksdbjni3192271038586903156.so+0x358fec] 
rocksdb::MemTableInserter::PutCFImpl(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&, rocksdb:
:ValueType)+0x51c
C [librocksdbjni3192271038586903156.so+0x359d17] 
rocksdb::MemTableInserter::PutCF(unsigned int, rocksdb::Slice const&, 
rocksdb::Slice const&)+0x17
C [librocksdbjni3192271038586903156.so+0x3513bc] 
rocksdb::WriteBatch::Iterate(rocksdb::WriteBatch::Handler*) const+0x45c
C [librocksdbjni3192271038586903156.so+0x354df9] 
rocksdb::WriteBatchInternal::InsertInto(rocksdb::WriteThread::WriteGroup&, 
unsigned long, rocksdb::ColumnFamilyMemTables*, rocksdb::FlushScheduler*, bool, 
unsigned long, rocksdb::DB*, bool, bool, bool)+0x1f9
C [librocksdbjni3192271038586903156.so+0x29fd79] 
rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch*, 
rocksdb::WriteCallback*, unsigned long*, unsigned long, bool, unsigned long*, 
unsigned long, rocksdb::PreReleaseCallback*)+0x24b9
C [librocksdbjni3192271038586903156.so+0x2a0431] 
rocksdb::DBImpl::Write(rocksdb::WriteOptions const&, rocksdb::WriteBatch*)+0x21
C [librocksdbjni3192271038586903156.so+0x1a064c] 
Java_org_rocksdb_RocksDB_write0+0xcc
J 7899 org.rocksdb.RocksDB.write0(JJJ)V (0 bytes) @ 0x7f58f1872dbe 
[0x7f58f1872d00+0xbe]
J 10093% C1 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions()V 
(400 bytes) @ 0x7f58f2308b0c [0x7f58f2307a40+0x10cc]
j org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer$$Lambda$29.run()V+4

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: hs_err_pid9340.log

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: hs_err_pid9340.log, image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16964138#comment-16964138
 ] 

Li Cheng commented on HDDS-2356:


[~msingh] I check OM audit log and it looks like this one: dataSize=215547

2019-10-31 17:44:05,645 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:10,725 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:15,915 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:21,108 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:28,319 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:32,951 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:38,360 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:43,390 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:48,498 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {
2019-10-31 17:44:55,654 | INFO | OMAudit | user=root | ip=9.134.50.210 | 
op=ALLOCATE_KEY {volume=s325d55ad283aa400af464c76d713c07ad, bucket=ozone-test, 
key=./20191012/plc_1570860558836_9937, dataSize=215547, replicationType=RATIS, 
replicationFactor=THREE, keyLocationInfo=[blockID {

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963875#comment-16963875
 ] 

Li Cheng commented on HDDS-2356:


I tried multiprocessing in python to fetch data locally and write to remote 
ozone cluster thru s3 gateway and boto3 (aws python client). Shows an RPC error:

 

2019-10-31 17:44:33,123 [qtp1383524016-25] ERROR 
(OzoneManagerProtocolClientSideTranslatorPB.java:268) - Failed to connect to 
OM. Attempted 10 retries a
nd 10 failovers
2019-10-31 17:44:33,124 [qtp1383524016-25] ERROR (ObjectEndpoint.java:186) - 
Exception occurred in PutObject
java.io.IOException: Failed on local exception: 
org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data length; 
Host Details : local host is:
"VM_50_210_centos/127.0.0.1"; destination host is: "9.134.50.210":9862;
 at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:816)
 at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1515)
 at org.apache.hadoop.ipc.Client.call(Client.java:1457)
 at org.apache.hadoop.ipc.Client.call(Client.java:1367)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
 at com.sun.proxy.$Proxy81.submitRequest(Unknown Source)
 at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
 at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
 at com.sun.proxy.$Proxy81.submitRequest(Unknown Source)
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:338)
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.openKey(OzoneManagerProtocolClientSideTranslatorPB.java:723)
 at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)

at org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
 at com.sun.proxy.$Proxy82.openKey(Unknown Source)
 at org.apache.hadoop.ozone.client.rpc.RpcClient.createKey(RpcClient.java:614)
 at org.apache.hadoop.ozone.client.OzoneBucket.createKey(OzoneBucket.java:325)
 at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.put(ObjectEndpoint.java:173)
 at sun.reflect.GeneratedMethodAccessor26.invoke(Unknown Source)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
 at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:415)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:104)
 at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)
 at org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)
 at org.glassfish.jersey.internal.Errors$1.call(Errors.java:268)
 at org.glassfish.jersey.internal.Errors.process(Errors.java:316)
 at org.glassfish.jersey.internal.Errors.process(Errors.java:298)
 at org.glassfish.jersey.internal.Errors.process(Errors.java:268)
 at 
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:289)
 at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:256)
 at 
org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:703)
 at 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Attachment: image-2019-10-31-18-56-56-177.png

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
> Attachments: image-2019-10-31-18-56-56-177.png
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete 
> Multipart Upload Request for bucket: ozone-test, key: 
> 20191012/plc_1570863541668_927
>  8
>  MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
> Complete Multipart Upload Failed: volume: 
> s3c89e813c80ffcea9543004d57b2a1239bucket:
>  ozone-testkey: 20191012/plc_1570863541668_9278
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
>  at 
> org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
>  .java:1104)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
>  at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
>  at 
> org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
>  at 
> org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
>  at 
> org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:497)
>  at 
> org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
>  at 
> org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
>  at 
> org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
>  at 
> org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)
>  
> The following errors has been resolved in 
> https://issues.apache.org/jira/browse/HDDS-2322. 
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
>  java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Description: 
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I look 
at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors related 
with Multipart upload. This error eventually causes the  writing to terminate 
and OM to be closed. 

 

2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete Multipart 
Upload Request for bucket: ozone-test, key: 20191012/plc_1570863541668_927
 8
 MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
Complete Multipart Upload Failed: volume: 
s3c89e813c80ffcea9543004d57b2a1239bucket:
 ozone-testkey: 20191012/plc_1570863541668_9278
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
 .java:1104)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
 at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
 at 
org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
 at 
org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
 at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
 at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)

 

The following errors has been resolved in 
https://issues.apache.org/jira/browse/HDDS-2322. 

2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
exit status 2: OMDoubleBuffer flush threadOMDoubleBufferFlushThreadencountered 
Throwable error
 java.util.ConcurrentModificationException
 at java.util.TreeMap.forEach(TreeMap.java:1004)
 at 
org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
 at 
org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
 at org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
 at 
org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
 at java.util.Iterator.forEachRemaining(Iterator.java:116)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
 at java.lang.Thread.run(Thread.java:745)
 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:

  was:
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount 

[jira] [Updated] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2356:
---
Description: 
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I look 
at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors related 
with Multipart upload. This error eventually causes the  writing to terminate 
and OM to be closed. 

 

2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete Multipart 
Upload Request for bucket: ozone-test, key: 20191012/plc_1570863541668_927
8
MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
Complete Multipart Upload Failed: volume: 
s3c89e813c80ffcea9543004d57b2a1239bucket:
ozone-testkey: 20191012/plc_1570863541668_9278
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
.java:1104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
at 
org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
at 
org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)

 

The following errors has been resolved in 

2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
exit status 2: OMDoubleBuffer flush threadOMDoubleBufferFlushThreadencountered 
Throwable error
 java.util.ConcurrentModificationException
 at java.util.TreeMap.forEach(TreeMap.java:1004)
 at 
org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
 at 
org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
 at org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
 at 
org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
 at java.util.Iterator.forEachRemaining(Iterator.java:116)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
 at java.lang.Thread.run(Thread.java:745)
 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:

  was:
Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-31 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16963684#comment-16963684
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] I tried debug_s3 and debug_fuse in goofys, but I believe it would 
turn into a single-thread mode in goofys and we won't see the reproduction. 
What are you looking for in audit logs?

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-30 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Comment: was deleted

(was: https://github.com/apache/hadoop-ozone/pull/28)

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-30 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16962740#comment-16962740
 ] 

Li Cheng commented on HDDS-2186:


The OOM issue resolved in https://github.com/apache/hadoop-ozone/pull/28.

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: 0.5.0
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> 

[jira] [Resolved] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-30 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng resolved HDDS-2186.

   Fix Version/s: (was: 0.5.0)
  HDDS-1564
Target Version/s: HDDS-1564
  Resolution: Fixed

https://github.com/apache/hadoop-ozone/pull/28

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> 

[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-29 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961604#comment-16961604
 ] 

Li Cheng edited comment on HDDS-2356 at 10/29/19 6:53 AM:
--

[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
sub_files = os.listdir(dir)
num = 0 
output = "" 
for sub_file in sub_files:
 sub_file_path = dir + sub_file
 with open(dest_dir + sub_file_path, "w") as fw:
   with open(sub_file_path, "r") as fr:
line = fr.readline()
while line:
  num += 1
  output += line
  if (num >= 2000):
fw.write(output)
output = ""
num = 0
  line = fr.readline()
   fw.write(output)
{code}
 

Also I'm using S3 gateway to connect to ozone and mount local file path by fuse 
(goofys). Have you tested s3 gateway? Most unit tests are going thru RPC.


was (Author: timmylicheng):
[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
sub_files = os.listdir(dir)
num = 0 
output = "" 
for sub_file in sub_files:
 sub_file_path = dir + sub_file
 with open(dest_dir + sub_file_path, "w") as fw:
   with open(sub_file_path, "r") as fr:
line = fr.readline()
while line:
  num += 1
  output += line
  if (num >= 2000):
fw.write(output)
output = ""
num = 0
  line = fr.readline()
   fw.write(output)
{code}

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-2356:
--

Assignee: Bharat Viswanadham

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961607#comment-16961607
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] The long printing logs happen only in your branch tho. The full log 
is too huge to put here. You can think of the same logs as attached over and 
over again for multi megabyte size. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961604#comment-16961604
 ] 

Li Cheng edited comment on HDDS-2356 at 10/29/19 2:29 AM:
--

[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
sub_files = os.listdir(dir)
num = 0 
output = "" 
for sub_file in sub_files:
 sub_file_path = dir + sub_file
 with open(dest_dir + sub_file_path, "w") as fw:
   with open(sub_file_path, "r") as fr:
line = fr.readline()
while line:
  num += 1
  output += line
  if (num >= 2000):
fw.write(output)
output = ""
num = 0
  line = fr.readline()
   fw.write(output)
{code}


was (Author: timmylicheng):
[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
 for sub_file in sub_files: sub_file_path = dir + sub_file with open(dest_dir + 
sub_file_path, "w") as fw: with open(sub_file_path, "r") as fr: line = 
fr.readline() while line: num += 1 output += line if (num >= 2000): 
fw.write(output) output = "" num = 0 line = fr.readline() fw.write(output)
{code}

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961604#comment-16961604
 ] 

Li Cheng edited comment on HDDS-2356 at 10/29/19 2:25 AM:
--

[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
 for sub_file in sub_files: sub_file_path = dir + sub_file with open(dest_dir + 
sub_file_path, "w") as fw: with open(sub_file_path, "r") as fr: line = 
fr.readline() while line: num += 1 output += line if (num >= 2000): 
fw.write(output) output = "" num = 0 line = fr.readline() fw.write(output)
{code}


was (Author: timmylicheng):
[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
{code}
for sub_file in sub_files: sub_file_path = dir + sub_file with open(dest_dir + 
sub_file_path, "w") as fw: with open(sub_file_path, "r") as fr: line = 
fr.readline() while line: num += 1 output += line if (num >= 2000): 
fw.write(output) output = "" num = 0 line = fr.readline() fw.write(output)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961604#comment-16961604
 ] 

Li Cheng edited comment on HDDS-2356 at 10/29/19 2:24 AM:
--

[~bharat] I'm using python to write to a OS path. sth like:
{code:java}
// code placeholder
{code}
for sub_file in sub_files: sub_file_path = dir + sub_file with open(dest_dir + 
sub_file_path, "w") as fw: with open(sub_file_path, "r") as fr: line = 
fr.readline() while line: num += 1 output += line if (num >= 2000): 
fw.write(output) output = "" num = 0 line = fr.readline() fw.write(output)


was (Author: timmylicheng):
[~bharat] I'm using python to write to a OS path. sth like:

for sub_file in sub_files:
 sub_file_path = dir + sub_file
 with open(dest_dir + sub_file_path, "w") as fw:
 with open(sub_file_path, "r") as fr:
 line = fr.readline()
 while line:
 num += 1
 output += line
 if (num >= 2000):
 fw.write(output)
 output = ""
 num = 0
 line = fr.readline()
 fw.write(output)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961604#comment-16961604
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] I'm using python to write to a OS path. sth like:

for sub_file in sub_files:
 sub_file_path = dir + sub_file
 with open(dest_dir + sub_file_path, "w") as fw:
 with open(sub_file_path, "r") as fr:
 line = fr.readline()
 while line:
 num += 1
 output += line
 if (num >= 2000):
 fw.write(output)
 output = ""
 num = 0
 line = fr.readline()
 fw.write(output)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Reopened] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reopened HDDS-2356:

  Assignee: (was: Bharat Viswanadham)

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960870#comment-16960870
 ] 

Li Cheng commented on HDDS-2356:


I take this Jira to track issues that seem to be related with Multipart upload 
in my testing. Reopen this. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960866#comment-16960866
 ] 

Li Cheng commented on HDDS-2356:


MISMATCH_MULTIPART_LIST seems to be a recurring error. Never be able to finish 
this. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2322) DoubleBuffer flush termination and OM shutdown's after that.

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960802#comment-16960802
 ] 

Li Cheng commented on HDDS-2322:


https://issues.apache.org/jira/browse/HDDS-2356 is still having issues. Do you 
mean to track here? [~bharat]

> DoubleBuffer flush termination and OM shutdown's after that.
> 
>
> Key: HDDS-2322
> URL: https://issues.apache.org/jira/browse/HDDS-2322
> Project: Hadoop Distributed Data Store
>  Issue Type: Task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> om1_1       | 2019-10-18 00:34:45,317 [OMDoubleBufferFlushThread] ERROR      
> - Terminating with exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> om1_1       | java.util.ConcurrentModificationException
> om1_1       | at 
> java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1660)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyLocationInfoGroup.getProtobuf(OmKeyLocationInfoGroup.java:65)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
> om1_1       | at 
> java.base/java.util.Collections$2.tryAdvance(Collections.java:4745)
> om1_1       | at 
> java.base/java.util.Collections$2.forEachRemaining(Collections.java:4753)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
> om1_1       | at 
> java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
> om1_1       | at 
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
> om1_1       | at 
> java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
> om1_1       | at 
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getProtobuf(OmKeyInfo.java:362)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:37)
> om1_1       | at 
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.toPersistedFormat(OmKeyInfoCodec.java:31)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
> om1_1       | at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
> om1_1       | at 
> org.apache.hadoop.ozone.om.response.key.OMKeyCreateResponse.addToDBBatch(OMKeyCreateResponse.java:58)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:139)
> om1_1       | at 
> java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
> om1_1       | at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:137)
> om1_1       | at java.base/java.lang.Thread.run(Thread.java:834)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960799#comment-16960799
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] In term of reproduction, I have a dataset which includes small files 
as well as big files and I'm using s3 gateway from ozone and mount ozone 
cluster to a local path by goofys. All the data are recursively written to the 
mount path, which essentially leads to ozone cluster. The ozone cluster is 
deployed on a 3-node VMs env and each VM has only 1 disk for ozone data 
writing. I think it's a pretty simple scenario to reproduce. The solely 
operation is writing to ozone cluster thru fuse. 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-28 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960796#comment-16960796
 ] 

Li Cheng commented on HDDS-2356:


Also it's print the same pipeline id in s3g logs like crazy. Wonder if that's 
expected. [~bharat]

2019-10-28 11:43:08,912 [qtp1383524016-24] INFO - Allocating block with 
ExcludeList \{datanodes = [], containerIds = [], pipelineIds = []}
...skipping...
eID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a, 
PipelineID=3c94d3f5-3c0e-4994-9c63-dc487071be1a

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-27 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16960744#comment-16960744
 ] 

Li Cheng commented on HDDS-2356:


[~bharat] shows up another error. See the stack:

 

2019-10-28 11:44:34,079 [qtp1383524016-70] ERROR - Error in Complete Multipart 
Upload Request for bucket: ozone-test, key: 20191012/plc_1570863541668_927
8
MISMATCH_MULTIPART_LIST org.apache.hadoop.ozone.om.exceptions.OMException: 
Complete Multipart Upload Failed: volume: 
s3c89e813c80ffcea9543004d57b2a1239bucket:
 ozone-testkey: 20191012/plc_1570863541668_9278
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.handleError(OzoneManagerProtocolClientSideTranslatorPB.java:732)
 at 
org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.completeMultipartUpload(OzoneManagerProtocolClientSideTranslatorPB
.java:1104)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at org.apache.hadoop.hdds.tracing.TraceAllMethod.invoke(TraceAllMethod.java:66)
 at com.sun.proxy.$Proxy82.completeMultipartUpload(Unknown Source)
 at 
org.apache.hadoop.ozone.client.rpc.RpcClient.completeMultipartUpload(RpcClient.java:883)
 at 
org.apache.hadoop.ozone.client.OzoneBucket.completeMultipartUpload(OzoneBucket.java:445)
 at 
org.apache.hadoop.ozone.s3.endpoint.ObjectEndpoint.completeMultipartUpload(ObjectEndpoint.java:498)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at 
org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory.lambda$static$0(ResourceMethodInvocationHandlerFactory.java:76)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:148)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:191)
 at 
org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:200)
 at 
org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:103)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:493)

at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:415)
 at 
org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:104)
 at org.glassfish.jersey.server.ServerRuntime$1.run(ServerRuntime.java:277)
 at org.glassfish.jersey.internal.Errors$1.call(Errors.java:272)
 at org.glassfish.jersey.internal.Errors$1.call(Errors.java:268)
 at org.glassfish.jersey.internal.Errors.process(Errors.java:316)
 at org.glassfish.jersey.internal.Errors.process(Errors.java:298)
 at org.glassfish.jersey.internal.Errors.process(Errors.java:268)
 at 
org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:289)
 at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:256)
 at 
org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:703)
 at org.glassfish.jersey.servlet.WebComponent.serviceImpl(WebComponent.java:416)
 at org.glassfish.jersey.servlet.WebComponent.service(WebComponent.java:370)
 at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:389)
 at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:342)
 at 
org.glassfish.jersey.servlet.ServletContainer.service(ServletContainer.java:229)
 at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:840)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1780)
 at 
org.apache.hadoop.http.HttpServer2$QuotingInputFilter.doFilter(HttpServer2.java:1609)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
 at org.apache.hadoop.http.NoCacheFilter.doFilter(NoCacheFilter.java:45)
 at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767)
 at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:583)
 at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
 at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548)
 at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)

at 

[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-25 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959468#comment-16959468
 ] 

Li Cheng commented on HDDS-2356:


[~bharat]Yea, you are right. It does happen randomly. Seeing it again. When 
will HDDS-2322 be merged into master?

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-24 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16959407#comment-16959407
 ] 

Li Cheng commented on HDDS-2356:


Quick update. I tried to make ozone have more handlers (like 10+ times more) 
and cease to see this error. See the attached properties. However, writing 
fails due to no more blocks allocated. I guess my cluster cannot keep up with 
the writing. 

 


 ozone.scm.handler.count.key
 128
 OZONE, MANAGEMENT, PERFORMANCE
 
 The number of RPC handler threads for each SCM service
 endpoint.

The default is appropriate for small clusters (tens of nodes).

Set a value that is appropriate for the cluster size. Generally, HDFS
 recommends RPC handler count is set to 20 * log2(Cluster Size) with an
 upper limit of 200. However, SCM will not have the same amount of
 traffic as Namenode, so a value much smaller than that will work well too.
 
 
 
 ozone.om.handler.count.key
 256
 OM, PERFORMANCE
 
 The number of RPC handler threads for OM service endpoints.
 
 
 
 dfs.container.ratis.num.container.op.executors
 128
 OZONE, RATIS, PERFORMANCE
 Number of executors that will be used by Ratis to execute
 container ops.(10 by default).
 
 
 
 dfs.container.ratis.num.write.chunk.threads
 512
 OZONE, RATIS, PERFORMANCE
 Maximum number of threads in the thread pool that Ratis
 will use for writing chunks (60 by default).
 
 

> Multipart upload report errors while writing to ozone Ratis pipeline
> 
>
> Key: HDDS-2356
> URL: https://issues.apache.org/jira/browse/HDDS-2356
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Affects Versions: 0.4.1
> Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
> on a separate VM
>Reporter: Li Cheng
>Assignee: Bharat Viswanadham
>Priority: Blocker
> Fix For: 0.5.0
>
>
> Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
> it's VM0.
> I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path 
> on VM0, while reading data from VM0 local disk and write to mount path. The 
> dataset has various sizes of files from 0 byte to GB-level and it has a 
> number of ~50,000 files. 
> The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I 
> look at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors 
> related with Multipart upload. This error eventually causes the  writing to 
> terminate and OM to be closed. 
>  
> 2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
> exit status 2: OMDoubleBuffer flush 
> threadOMDoubleBufferFlushThreadencountered Throwable error
> java.util.ConcurrentModificationException
>  at java.util.TreeMap.forEach(TreeMap.java:1004)
>  at 
> org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
>  at 
> org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
>  at 
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
>  at 
> org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
>  at 
> org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
>  at java.util.Iterator.forEachRemaining(Iterator.java:116)
>  at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
>  at java.lang.Thread.run(Thread.java:745)
> 2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2356) Multipart upload report errors while writing to ozone Ratis pipeline

2019-10-24 Thread Li Cheng (Jira)
Li Cheng created HDDS-2356:
--

 Summary: Multipart upload report errors while writing to ozone 
Ratis pipeline
 Key: HDDS-2356
 URL: https://issues.apache.org/jira/browse/HDDS-2356
 Project: Hadoop Distributed Data Store
  Issue Type: Bug
  Components: Ozone Manager
Affects Versions: 0.4.1
 Environment: Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM 
on a separate VM
Reporter: Li Cheng


Env: 4 VMs in total: 3 Datanodes on 3 VMs, 1 OM & 1 SCM on a separate VM, say 
it's VM0.

I use goofys as a fuse and enable ozone S3 gateway to mount ozone to a path on 
VM0, while reading data from VM0 local disk and write to mount path. The 
dataset has various sizes of files from 0 byte to GB-level and it has a number 
of ~50,000 files. 

The writing is slow (1GB for ~10 mins) and it stops after around 4GB. As I look 
at hadoop-root-om-VM_50_210_centos.out log, I see OM throwing errors related 
with Multipart upload. This error eventually causes the  writing to terminate 
and OM to be closed. 

 

2019-10-24 16:01:59,527 [OMDoubleBufferFlushThread] ERROR - Terminating with 
exit status 2: OMDoubleBuffer flush threadOMDoubleBufferFlushThreadencountered 
Throwable error
java.util.ConcurrentModificationException
 at java.util.TreeMap.forEach(TreeMap.java:1004)
 at 
org.apache.hadoop.ozone.om.helpers.OmMultipartKeyInfo.getProto(OmMultipartKeyInfo.java:111)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:38)
 at 
org.apache.hadoop.ozone.om.codec.OmMultipartKeyInfoCodec.toPersistedFormat(OmMultipartKeyInfoCodec.java:31)
 at 
org.apache.hadoop.hdds.utils.db.CodecRegistry.asRawData(CodecRegistry.java:68)
 at org.apache.hadoop.hdds.utils.db.TypedTable.putWithBatch(TypedTable.java:125)
 at 
org.apache.hadoop.ozone.om.response.s3.multipart.S3MultipartUploadCommitPartResponse.addToDBBatch(S3MultipartUploadCommitPartResponse.java:112)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.lambda$flushTransactions$0(OzoneManagerDoubleBuffer.java:137)
 at java.util.Iterator.forEachRemaining(Iterator.java:116)
 at 
org.apache.hadoop.ozone.om.ratis.OzoneManagerDoubleBuffer.flushTransactions(OzoneManagerDoubleBuffer.java:135)
 at java.lang.Thread.run(Thread.java:745)
2019-10-24 16:01:59,629 [shutdown-hook-0] INFO - SHUTDOWN_MSG:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1576) Ensure constraint of one raft log per disk is met unless fast media

2019-10-21 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-1576:
--

Assignee: Li Cheng

> Ensure constraint of one raft log per disk is met unless fast media
> ---
>
> Key: HDDS-1576
> URL: https://issues.apache.org/jira/browse/HDDS-1576
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: Ozone Datanode, SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> SCM should not try to create a raft group by placing the raft log on a disk 
> that is already used by existing Ratis ring for an open pipeline.
> This constraint would have to be applied by either throwing an exception 
> during pipeline creation or by looking at configs on the SCM side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1574) Ensure same datanodes are not a part of multiple pipelines

2019-10-21 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955811#comment-16955811
 ] 

Li Cheng commented on HDDS-1574:


[~swagle] Cloud you elaborate more to help understand this task?

Do you mean we don't want 2 pipelines to share the exactly same group of 
datanodes as members?

> Ensure same datanodes are not a part of multiple pipelines
> --
>
> Key: HDDS-1574
> URL: https://issues.apache.org/jira/browse/HDDS-1574
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> Details in design doc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1570) Refactor heartbeat reports to report all the pipelines that are open

2019-10-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953672#comment-16953672
 ] 

Li Cheng commented on HDDS-1570:


As I look into this JIRA, I would like discuss some bout this JIRA:

1. the SCMHeartbeatRequestProto has included a list of PipelineReport 
([https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/proto/StorageContainerDatanodeProtocol.proto#L128])

2. HeartbeatEndpointTask is adding all pipeline reports into every heartbeat 
request 
([https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/states/endpoint/HeartbeatEndpointTask.java#L182])

3. The XceiverServerRatis seems to handle all raft groups together and every 
group has one pipelineId. Hence, every receiver server has all the pipeline 
reports. 
([https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/common/transport/server/ratis/XceiverServerRatis.java#L599])

So it seems that currently heartbeat has considers a bulk of pipelineIds from 
both sender and receiver sides. Do we have to make any changes to heartbeat 
reports then?

And also, what's the benefit for us to only report open pipelines in reports?

 

[~swagle] [~sammichen] [~xyao]

 

> Refactor heartbeat reports to report all the pipelines that are open
> 
>
> Key: HDDS-1570
> URL: https://issues.apache.org/jira/browse/HDDS-1570
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: Ozone Datanode
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> Presently the pipeline report only reports a single pipeline id.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1569) Add ability to SCM for creating multiple pipelines with same datanode

2019-10-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953570#comment-16953570
 ] 

Li Cheng commented on HDDS-1569:


PR [https://github.com/apache/hadoop-ozone/pull/28] is under review.

> Add ability to SCM for creating multiple pipelines with same datanode
> -
>
> Key: HDDS-1569
> URL: https://issues.apache.org/jira/browse/HDDS-1569
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 8h 10m
>  Remaining Estimate: 0h
>
> - Refactor _RatisPipelineProvider.create()_ to be able to create pipelines 
> with datanodes that are not a part of sufficient pipelines
> - Define soft and hard upper bounds for pipeline membership
> - Create SCMAllocationManager that can be leveraged to get a candidate set of 
> datanodes based on placement policies
> - Add the datanodes to internal datastructures



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Issue Comment Deleted] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-17 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Comment: was deleted

(was: [https://github.com/apache/hadoop/pull/1431])

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: 0.5.0
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Issue Comment Deleted] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-17 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Comment: was deleted

(was: [https://github.com/apache/hadoop/pull/1431])

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: 0.5.0
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16953568#comment-16953568
 ] 

Li Cheng commented on HDDS-2186:


Try to resolve this in [https://github.com/apache/hadoop-ozone/pull/28]

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: 0.5.0
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> 

[jira] [Work started] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-12 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on HDDS-2186 started by Li Cheng.
--
> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-12 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949933#comment-16949933
 ] 

Li Cheng commented on HDDS-2186:


[https://github.com/apache/hadoop/pull/1431]

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-12 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949932#comment-16949932
 ] 

Li Cheng commented on HDDS-2186:


[https://github.com/apache/hadoop/pull/1431]

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)

[jira] [Updated] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-12 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Fix Version/s: HDDS-1564

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Updated] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-12 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Component/s: test

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
> Fix For: HDDS-1564
>
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-12 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16949930#comment-16949930
 ] 

Li Cheng commented on HDDS-2186:


It turns out the miniOzoneCluster running out of memory is triggered by endless 
creation of pipeline. Add logic to restrict endless pipeline creation in 
[https://github.com/apache/hadoop/pull/1431]. 

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> 

[jira] [Assigned] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-10-11 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-2186:
--

Assignee: Li Cheng

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>  Labels: flaky-test
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
>  at 

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-09-27 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16939296#comment-16939296
 ] 

Li Cheng commented on HDDS-2186:


[~ljain] After some investigation, it turned out MiniOzoneCluster is abusing 
resources to create pipelines. Reason it didn't have problems before is that 
every datanode could only be assigned to one pipeline so that the quota runs 
out fast. Now the limit is taken off and there is no virtual limit to prevent 
cluster from creating pipelines other than ratis says resource like memory is 
not enough. I'm adding logic to prevent this, but unfortunately, factor ONE and 
factor THREE pipelines need to be handled differently, the logic grows more and 
more complex. 

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Priority: Major
>  Labels: flaky-test
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> 

[jira] [Commented] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-09-26 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16938670#comment-16938670
 ] 

Li Cheng commented on HDDS-2186:


Note that CI is also affected and it cannot print out the correct output due to 
memory issues.

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Priority: Major
>  Labels: flaky-test
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> 

[jira] [Updated] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-09-26 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Labels: flaky-test  (was: )

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Priority: Major
>  Labels: flaky-test
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
>  at 

[jira] [Updated] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-09-26 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng updated HDDS-2186:
---
Affects Version/s: HDDS-1564

> Fix tests using MiniOzoneCluster for its memory related exceptions
> --
>
> Key: HDDS-2186
> URL: https://issues.apache.org/jira/browse/HDDS-2186
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Affects Versions: HDDS-1564
>Reporter: Li Cheng
>Priority: Major
>
> After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a 
> bunch of 'out of memory' exceptions in ratis. Attached sample stacks.
>  
> 2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exception2019-09-26 15:12:22,824 
> [2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
>  ERROR segmented.SegmentedRaftLogWorker 
> (SegmentedRaftLogWorker.java:run(323)) - 
> 2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker
>  hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
> java.nio.Bits.reserveMemory(Bits.java:694) at 
> java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
> java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
> org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
>  at 
> org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
>  at java.lang.Thread.run(Thread.java:748)
>  
> which leads to:
> 2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
> pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
> [RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
> (RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
>  for 
> c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
>  deadline exceeded after 2999881264ns at 
> org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
> org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
>  at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
>  at 
> org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
> at 
> org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
>  at 
> org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
>  at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
>  at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) 
> at 
> java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382)
>  at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
> java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
> java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
> java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
> java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
> java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) 
> at 
> java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
>  at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
> java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
> java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) 
> at 
> org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
>  at 
> java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
>  at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
> 

[jira] [Created] (HDDS-2186) Fix tests using MiniOzoneCluster for its memory related exceptions

2019-09-26 Thread Li Cheng (Jira)
Li Cheng created HDDS-2186:
--

 Summary: Fix tests using MiniOzoneCluster for its memory related 
exceptions
 Key: HDDS-2186
 URL: https://issues.apache.org/jira/browse/HDDS-2186
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
Reporter: Li Cheng


After multi-raft usage, MiniOzoneCluster seems to be fishy and reports a bunch 
of 'out of memory' exceptions in ratis. Attached sample stacks.

 

2019-09-26 15:12:22,824 
[2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
 ERROR segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:run(323)) 
- 
2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker 
hit exception2019-09-26 15:12:22,824 
[2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker]
 ERROR segmented.SegmentedRaftLogWorker (SegmentedRaftLogWorker.java:run(323)) 
- 
2e1e11ca-833a-4fbc-b948-3d93fc8e7288@group-218F3868CEA9-SegmentedRaftLogWorker 
hit exceptionjava.lang.OutOfMemoryError: Direct buffer memory at 
java.nio.Bits.reserveMemory(Bits.java:694) at 
java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at 
java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at 
org.apache.ratis.server.raftlog.segmented.BufferedWriteChannel.(BufferedWriteChannel.java:41)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogOutputStream.(SegmentedRaftLogOutputStream.java:72)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker$StartLogSegment.execute(SegmentedRaftLogWorker.java:566)
 at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLogWorker.run(SegmentedRaftLogWorker.java:289)
 at java.lang.Thread.run(Thread.java:748)

 

which leads to:

2019-09-26 15:12:23,029 [RATISCREATEPIPELINE1] ERROR 
pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$null$2(181)) 
- Failed invoke Ratis rpc 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
 for c1f4d375-683b-42fe-983b-428a63aa88032019-09-26 15:12:23,029 
[RATISCREATEPIPELINE1] ERROR pipeline.RatisPipelineProvider 
(RatisPipelineProvider.java:lambda$null$2(181)) - Failed invoke Ratis rpc 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider$$Lambda$297/1222454951@55d1e990
 for 
c1f4d375-683b-42fe-983b-428a63aa8803org.apache.ratis.protocol.TimeoutIOException:
 deadline exceeded after 2999881264ns at 
org.apache.ratis.grpc.GrpcUtil.tryUnwrapException(GrpcUtil.java:82) at 
org.apache.ratis.grpc.GrpcUtil.unwrapException(GrpcUtil.java:75) at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.blockingCall(GrpcClientProtocolClient.java:178)
 at 
org.apache.ratis.grpc.client.GrpcClientProtocolClient.groupAdd(GrpcClientProtocolClient.java:147)
 at 
org.apache.ratis.grpc.client.GrpcClientRpc.sendRequest(GrpcClientRpc.java:94) 
at 
org.apache.ratis.client.impl.RaftClientImpl.sendRequest(RaftClientImpl.java:278)
 at 
org.apache.ratis.client.impl.RaftClientImpl.groupAdd(RaftClientImpl.java:205) 
at 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$initializePipeline$1(RatisPipelineProvider.java:142)
 at 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$null$2(RatisPipelineProvider.java:177)
 at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184) at 
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1382) 
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at 
java.util.stream.ForEachOps$ForEachTask.compute(ForEachOps.java:291) at 
java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731) at 
java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
java.util.concurrent.ForkJoinTask.doInvoke(ForkJoinTask.java:401) at 
java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:734) at 
java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:160) at 
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:174)
 at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233) at 
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) at 
java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:583) at 
org.apache.hadoop.hdds.scm.pipeline.RatisPipelineProvider.lambda$callRatisRpc$3(RatisPipelineProvider.java:171)
 at 
java.util.concurrent.ForkJoinTask$AdaptedRunnableAction.exec(ForkJoinTask.java:1386)
 at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) at 
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) at 
java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) at 
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)Caused
 by: org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: 
DEADLINE_EXCEEDED: deadline exceeded after 2999881264ns at 

[jira] [Assigned] (HDDS-1574) Ensure same datanodes are not a part of multiple pipelines

2019-09-19 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-1574:
--

Assignee: Li Cheng  (was: Siddharth Wagle)

> Ensure same datanodes are not a part of multiple pipelines
> --
>
> Key: HDDS-1574
> URL: https://issues.apache.org/jira/browse/HDDS-1574
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> Details in design doc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1570) Refactor heartbeat reports to report all the pipelines that are open

2019-09-19 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-1570:
--

Assignee: Li Cheng  (was: Siddharth Wagle)

> Refactor heartbeat reports to report all the pipelines that are open
> 
>
> Key: HDDS-1570
> URL: https://issues.apache.org/jira/browse/HDDS-1570
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: Ozone Datanode
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> Presently the pipeline report only reports a single pipeline id.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1933) Datanode should use hostname in place of ip addresses to allow DN's to work when ipaddress change

2019-09-19 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933136#comment-16933136
 ] 

Li Cheng commented on HDDS-1933:


After discussion, we suppose the change could remain as it is now. And 
[~msingh] Please try to set DFS_DATANODE_USE_DN_HOSTNAME_DEFAULT = true in 
kubernetes env and see it works. 

> Datanode should use hostname in place of ip addresses to allow DN's to work 
> when ipaddress change
> -
>
> Key: HDDS-1933
> URL: https://issues.apache.org/jira/browse/HDDS-1933
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode, SCM
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Priority: Blocker
>
> This was noticed by [~elek] while deploying Ozone on Kubernetes based 
> environment.
> When the datanode ip address change on restart, the Datanode details cease to 
> be correct for the datanode. and this prevents the cluster from functioning 
> after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1933) Datanode should use hostname in place of ip addresses to allow DN's to work when ipaddress change

2019-09-18 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932172#comment-16932172
 ] 

Li Cheng commented on HDDS-1933:


[~swagle] I was talking with [~Sammi] about this. How do we decide whether to 
use IP or hostname? Or rather under which env to use one or the other? 
Currently users can config it in configuration xml and shall we keep it this 
way?

> Datanode should use hostname in place of ip addresses to allow DN's to work 
> when ipaddress change
> -
>
> Key: HDDS-1933
> URL: https://issues.apache.org/jira/browse/HDDS-1933
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode, SCM
>Affects Versions: 0.4.0
>Reporter: Mukul Kumar Singh
>Priority: Blocker
>
> This was noticed by [~elek] while deploying Ozone on Kubernetes based 
> environment.
> When the datanode ip address change on restart, the Datanode details cease to 
> be correct for the datanode. and this prevents the cluster from functioning 
> after a restart.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2116) Create SCMPipelineAllocationManager as background thread for pipeline creation

2019-09-18 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16932146#comment-16932146
 ] 

Li Cheng commented on HDDS-2116:


[~swagle] I think for now the role is taken by PipelinePlacementPolicy, which 
is instantiated with config as well. It retrieves viable datanodes from 
NodeManager and apply complex filter rules, which seems quite alike what 
AllocationManager  should do.

> Create SCMPipelineAllocationManager as background thread for pipeline creation
> --
>
> Key: HDDS-2116
> URL: https://issues.apache.org/jira/browse/HDDS-2116
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Li Cheng
>Priority: Major
>
> SCMAllocationManager can be leveraged to get a candidate set of datanodes 
> based on placement policies. And it should make the pipeline creation process 
> to be async and multi-thread. This should be done when we encounter with 
> performance bottleneck in terms of pipeline creation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-2116) Create SCMAllocationManager as background thread for pipeline creation

2019-09-17 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-2116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931329#comment-16931329
 ] 

Li Cheng commented on HDDS-2116:


I agree on the multi-thread part. [~swagle] what else function do you think 
SCMAllocationManager should have? Or is SCMAllocationManager still needed?

> Create SCMAllocationManager as background thread for pipeline creation
> --
>
> Key: HDDS-2116
> URL: https://issues.apache.org/jira/browse/HDDS-2116
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Li Cheng
>Priority: Major
>
> SCMAllocationManager can be leveraged to get a candidate set of datanodes 
> based on placement policies. And it should make the pipeline creation process 
> to be async and multi-thread. This should be done when we encounter with 
> performance bottleneck in terms of pipeline creation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1569) Add ability to SCM for creating multiple pipelines with same datanode

2019-09-12 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16928362#comment-16928362
 ] 

Li Cheng commented on HDDS-1569:


I separate SCMAllocationManager to another Jira 
https://issues.apache.org/jira/browse/HDDS-2116.

I think the current BackgroundCreator has a sync way to create pipeline and we 
can start from there. As [~swagle] mentioned, we now don't assume many ad-hoc 
pipeline creation operations, the current sync call from background shall be 
able to handle this. Once we hit performance bottle neck for pipeline creation, 
we can invest to have a SCMAllocationManager with self-managed thread model.

> Add ability to SCM for creating multiple pipelines with same datanode
> -
>
> Key: HDDS-1569
> URL: https://issues.apache.org/jira/browse/HDDS-1569
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>
> - Refactor _RatisPipelineProvider.create()_ to be able to create pipelines 
> with datanodes that are not a part of sufficient pipelines
> - Define soft and hard upper bounds for pipeline membership
> - Create SCMAllocationManager that can be leveraged to get a candidate set of 
> datanodes based on placement policies
> - Add the datanodes to internal datastructures



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2116) Create SCMAllocationManager as background thread for pipeline creation

2019-09-12 Thread Li Cheng (Jira)
Li Cheng created HDDS-2116:
--

 Summary: Create SCMAllocationManager as background thread for 
pipeline creation
 Key: HDDS-2116
 URL: https://issues.apache.org/jira/browse/HDDS-2116
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
Reporter: Li Cheng


SCMAllocationManager can be leveraged to get a candidate set of datanodes based 
on placement policies. And it should make the pipeline creation process to be 
async and multi-thread. This should be done when we encounter with performance 
bottleneck in terms of pipeline creation.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-2115) Add smoke/acceptance test for Pipeline related CLI

2019-09-12 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-2115:
--

Assignee: Li Cheng

> Add smoke/acceptance test for Pipeline related CLI 
> ---
>
> Key: HDDS-2115
> URL: https://issues.apache.org/jira/browse/HDDS-2115
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Major
>
> Currently this is no smoke/acceptance for createPipeline or listPipeline CLI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2115) Add smoke/acceptance test for Pipeline related CLI

2019-09-12 Thread Li Cheng (Jira)
Li Cheng created HDDS-2115:
--

 Summary: Add smoke/acceptance test for Pipeline related CLI 
 Key: HDDS-2115
 URL: https://issues.apache.org/jira/browse/HDDS-2115
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
Reporter: Li Cheng


Currently this is no smoke/acceptance for createPipeline or listPipeline CLI.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-1571) Create an interface for pipeline placement policy to support network topologies

2019-09-10 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng resolved HDDS-1571.

Fix Version/s: 0.5.0
   Resolution: Fixed

> Create an interface for pipeline placement policy to support network 
> topologies
> ---
>
> Key: HDDS-1571
> URL: https://issues.apache.org/jira/browse/HDDS-1571
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Leverage the work done in HDDS-700 for pipeline creation for open containers.
> Create an interface that can provide different policy implementations for 
> pipeline creation. The default implementation should take into account no 
> topology information is configured.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDDS-1564) Ozone multi-raft support

2019-09-09 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng reassigned HDDS-1564:
--

Assignee: Li Cheng  (was: Siddharth Wagle)

> Ozone multi-raft support
> 
>
> Key: HDDS-1564
> URL: https://issues.apache.org/jira/browse/HDDS-1564
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Datanode, SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
> Attachments: Ozone Multi-Raft Support.pdf
>
>
> Apache Ratis supports multi-raft by allowing the same node to be a part of 
> multiple raft groups. The proposal is to allow datanodes to be a part of 
> multiple raft groups. The attached design doc explains the reasons for doing 
> this as well a few initial design decisions. 
> Some of the work in this feature also related to HDDS-700 which implements 
> rack-aware container placement for closed containers.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDDS-2089) Add CLI createPipeline

2019-09-05 Thread Li Cheng (Jira)
Li Cheng created HDDS-2089:
--

 Summary: Add CLI createPipeline
 Key: HDDS-2089
 URL: https://issues.apache.org/jira/browse/HDDS-2089
 Project: Hadoop Distributed Data Store
  Issue Type: Sub-task
  Components: Ozone CLI
Affects Versions: 0.5.0
Reporter: Li Cheng
Assignee: Li Cheng


Add a SCMCLI to create pipeline for ozone.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDDS-1571) Create an interface for pipeline placement policy to support network topologies

2019-09-05 Thread Li Cheng (Jira)


[ 
https://issues.apache.org/jira/browse/HDDS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16923227#comment-16923227
 ] 

Li Cheng commented on HDDS-1571:


Put up a PlacementPolicy as basic interface and ScmCommonPlacementPolicy as the 
abstract class for both Pipeline placement related things and container 
placement related things. 

> Create an interface for pipeline placement policy to support network 
> topologies
> ---
>
> Key: HDDS-1571
> URL: https://issues.apache.org/jira/browse/HDDS-1571
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Leverage the work done in HDDS-700 for pipeline creation for open containers.
> Create an interface that can provide different policy implementations for 
> pipeline creation. The default implementation should take into account no 
> topology information is configured.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDDS-1577) Add default pipeline placement policy implementation

2019-09-04 Thread Li Cheng (Jira)


 [ 
https://issues.apache.org/jira/browse/HDDS-1577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Li Cheng resolved HDDS-1577.

Fix Version/s: 0.5.0
   Resolution: Fixed

> Add default pipeline placement policy implementation
> 
>
> Key: HDDS-1577
> URL: https://issues.apache.org/jira/browse/HDDS-1577
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Reporter: Siddharth Wagle
>Assignee: Li Cheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.5.0
>
>  Time Spent: 5h 20m
>  Remaining Estimate: 0h
>
> This is a simpler implementation of the PipelinePlacementPolicy that can be 
> utilized if no network topology is defined for the cluster. We try to form 
> pipelines from existing HEALTHY datanodes randomly, as long as they satisfy 
> PipelinePlacementCriteria.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



  1   2   >