from:"Jitendra Nath Pandey \(Jira\)"

[jira] [Updated] (HDDS-3936) OM client failover ignores suggested leader info

2020-07-08 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3936:
---
Target Version/s: 0.7.0  (was: 0.6.0)

> OM client failover ignores suggested leader info
> 
>
> Key: HDDS-3936
> URL: https://issues.apache.org/jira/browse/HDDS-3936
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: OM HA
>Affects Versions: 0.6.0
>Reporter: Attila Doroszlai
>Assignee: Attila Doroszlai
>Priority: Major
>
> If OM client hits follower OM, failover is performed sequentially, ignoring 
> suggested leader info:
> {code}
> 2020-07-08 17:20:05,249 [main] DEBUG Hadoop3OmTransport:140 - RetryProxy: 
> OM:om1 is not the leader. Suggested leader is OM:om3.
> 2020-07-08 17:20:05,277 [main] DEBUG Hadoop3OmTransport:140 - RetryProxy: 
> OM:om2 is not the leader. Suggested leader is OM:om3.
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3878) Make OMHA serviceID optional if one (but only one) is defined in the config

2020-07-08 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3878:
---
Target Version/s: 0.7.0  (was: 0.6.0)

> Make OMHA serviceID optional if one (but only one) is defined in the config 
> 
>
> Key: HDDS-3878
> URL: https://issues.apache.org/jira/browse/HDDS-3878
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>  Labels: pull-request-available
>
> om.serviceId is required on case of OM.HA in all the client parameters even 
> if there is only one om.serviceId and it can be chosen.
> My goal is:
>  1. Provide better usability
>  2. Simplify the documentation task ;-)
> With using the om.serviceId from the config if 
>  1. config is available
>  2. om ha is configured 
>  3. only one service is configured
> It also makes easier to run the same tests with/without HA



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3765) Fluentd writing to secure Ozone S3 API fails with 500 Error.

2020-07-08 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3765:
---
Priority: Blocker  (was: Critical)

> Fluentd writing to secure Ozone S3 API fails with 500 Error.
> 
>
> Key: HDDS-3765
> URL: https://issues.apache.org/jira/browse/HDDS-3765
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Aravindan Vijayan
>Assignee: Aravindan Vijayan
>Priority: Blocker
>
> Error on fluentd side.
> {code}
> opened
> starting SSL for host1:9879...
> SSL established
> <- "PUT /logs-bucket-1/logs/mytag/2020/06/05//20200605_190411.gz 
> HTTP/1.1\r\nContent-Type: application/x-gzip\r\nAccept-Encoding: 
> \r\nUser-Agent: aws-sdk-ruby3/3.94.0 ruby/2.4.10 x86_64-l
> inux aws-sdk-s3/1.63.0\r\nX-Amz-Storage-Class: STANDARD\r\nExpect: 
> 100-continue\r\nContent-Md5: zGKVGGaD/U5WUK3cdWQiSA==\r\nHost: 
> host1:9879\r\nX-Amz-Content-Sha256:
>  
> 277fe97f57a1127ee1a0765979ffd3270a6c18c6f75ff6a0f843e7163a338de2\r\nContent-Length:
>  44726\r\nX-Amz-Date: 20200608T190412Z\r\nAuthorization: AWS4-HMAC-SHA256 
> Credential=h...@root.hwx.site/202
> 00608/us-east-1/s3/aws4_request, 
> SignedHeaders=content-md5;content-type;expect;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-storage-class,
>  Signature=11c1d0a43325d3f7b9d25dbd02023cef2
> 69b66f6a93fa4e1c935b52e3e372f70\r\nAccept: */*\r\n\r\n"
> -> "HTTP/1.1 100 Continue\r\n"
> -> "\r\n"
> -> "HTTP/1.1 500 Server Error\r\n"
> -> "Pragma: no-cache\r\n"
> -> "X-Content-Type-Options: nosniff\r\n"
> -> "X-FRAME-OPTIONS: SAMEORIGIN\r\n"
> -> "X-XSS-Protection: 1; mode=block\r\n"
> -> "Connection: close\r\n"
> -> "\r\n"
> reading all...
> -> ""
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3932) Hide jOOQ logo message from the log output on compile

2020-07-07 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3932:
---
   Fix Version/s: (was: 0.6.0)
Target Version/s: 0.7.0  (was: 0.6.0)

> Hide jOOQ logo message from the log output on compile
> -
>
> Key: HDDS-3932
> URL: https://issues.apache.org/jira/browse/HDDS-3932
> Project: Hadoop Distributed Data Store
>  Issue Type: Wish
>  Components: Ozone Recon
>Reporter: Neo Yang
>Assignee: Neo Yang
>Priority: Minor
>  Labels: pull-request-available
>
> When Ozone Recon _(org.apache.hadoop:hadoop-ozone-recon)_ compiles, it prints 
> out this self-ad message:
> {code:java}
> 2020-07-07 15:39:05,719 INFO  jooq.Constants (JooqLogger.java:info(338)) - 
> @@
> @@
>   @@@@
> @@
>   @@  @@@@
> @@    @@  @@@@
> @@@@@@
> @@
> @@
> @@@@@@
> @@@@  @@    @@
> @@@@  @@    @@
> @@@@  @  @  @@
> @@@@@@
> @@@  @
> @@
> @@  Thank you for using jOOQ 3.11.9
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDDS-3918) ConcurrentModificationException in ContainerReportHandler.onMessage

2020-07-06 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-3918:
--

Assignee: Nanda kumar

> ConcurrentModificationException in ContainerReportHandler.onMessage
> ---
>
> Key: HDDS-3918
> URL: https://issues.apache.org/jira/browse/HDDS-3918
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Assignee: Nanda kumar
>Priority: Major
>
> 2020-07-03 14:51:45,489 [EventQueue-ContainerReportForContainerReportHandler] 
> ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on 
> execution message 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher$ContainerReportFromDatanode@8f6e7cb
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
> at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1044)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:343)
> at java.util.HashSet.(HashSet.java:120)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:127)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2020-07-03 14:51:45,648 [EventQueue-ContainerReportForContainerReportHandler] 
> ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on 
> execution message 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher$ContainerReportFromDatanode@49d2b84b
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
> at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1044)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:343)
> at java.util.HashSet.(HashSet.java:120)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:127)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3918) ConcurrentModificationException in ContainerReportHandler.onMessage

2020-07-06 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152471#comment-17152471
 ] 

Jitendra Nath Pandey commented on HDDS-3918:


[~Sammi], is this an intermittent issue?

> ConcurrentModificationException in ContainerReportHandler.onMessage
> ---
>
> Key: HDDS-3918
> URL: https://issues.apache.org/jira/browse/HDDS-3918
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Sammi Chen
>Priority: Major
>
> 2020-07-03 14:51:45,489 [EventQueue-ContainerReportForContainerReportHandler] 
> ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on 
> execution message 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher$ContainerReportFromDatanode@8f6e7cb
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
> at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1044)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:343)
> at java.util.HashSet.(HashSet.java:120)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:127)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2020-07-03 14:51:45,648 [EventQueue-ContainerReportForContainerReportHandler] 
> ERROR org.apache.hadoop.hdds.server.events.SingleThreadExecutor: Error on 
> execution message 
> org.apache.hadoop.hdds.scm.server.SCMDatanodeHeartbeatDispatcher$ContainerReportFromDatanode@49d2b84b
> java.util.ConcurrentModificationException
> at java.util.HashMap$HashIterator.nextNode(HashMap.java:1445)
> at java.util.HashMap$KeyIterator.next(HashMap.java:1469)
> at 
> java.util.Collections$UnmodifiableCollection$1.next(Collections.java:1044)
> at java.util.AbstractCollection.addAll(AbstractCollection.java:343)
> at java.util.HashSet.(HashSet.java:120)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:127)
> at 
> org.apache.hadoop.hdds.scm.container.ContainerReportHandler.onMessage(ContainerReportHandler.java:50)
> at 
> org.apache.hadoop.hdds.server.events.SingleThreadExecutor.lambda$onMessage$1(SingleThreadExecutor.java:81)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3741) Reload old OM state if Install Snapshot from Leader fails

2020-07-06 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3741:
---
Status: Patch Available  (was: Open)

> Reload old OM state if Install Snapshot from Leader fails
> -
>
> Key: HDDS-3741
> URL: https://issues.apache.org/jira/browse/HDDS-3741
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: OM HA
>Reporter: Hanisha Koneru
>Assignee: Hanisha Koneru
>Priority: Critical
>  Labels: pull-request-available
>
> Follower OM issues a pause on its services before installing new checkpoint 
> from Leader OM (Install Snapshot). If this installation fails for some 
> reason, the OM stays in paused state. It should be unpaused and the old state 
> should be reloaded.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDDS-3851) Reduce lock contention in RaftServerImpl::replyPendingRequest

2020-06-23 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143498#comment-17143498
 ] 

Jitendra Nath Pandey edited comment on HDDS-3851 at 6/24/20, 4:10 AM:
--

cc [~szetszwo],
We, probably, need a Ratis jira for this.


was (Author: jnp):
cc [~szetszwo]

> Reduce lock contention in RaftServerImpl::replyPendingRequest
> -
>
> Key: HDDS-3851
> URL: https://issues.apache.org/jira/browse/HDDS-3851
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Rajesh Balamohan
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: image-2020-06-23-11-57-19-791.png
>
>
> [https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1341]
> Can be easily refactored to acquire the lock only for leader, to reduce the 
> lock contention.
> Observed this in terasort.
> !image-2020-06-23-11-57-19-791.png|width=760,height=206!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3851) Reduce lock contention in RaftServerImpl::replyPendingRequest

2020-06-23 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143498#comment-17143498
 ] 

Jitendra Nath Pandey commented on HDDS-3851:


cc [~szetszwo]

> Reduce lock contention in RaftServerImpl::replyPendingRequest
> -
>
> Key: HDDS-3851
> URL: https://issues.apache.org/jira/browse/HDDS-3851
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Tsz-wo Sze
>Priority: Major
> Attachments: image-2020-06-23-11-57-19-791.png
>
>
> [https://github.com/apache/incubator-ratis/blob/master/ratis-server/src/main/java/org/apache/ratis/server/impl/RaftServerImpl.java#L1341]
> Can be easily refactored to acquire the lock only for leader, to reduce the 
> lock contention.
> Observed this in terasort.
> !image-2020-06-23-11-57-19-791.png|width=760,height=206!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3858) Remove support to start Ozone and HDFS datanodes in the same JVM

2020-06-23 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143111#comment-17143111
 ] 

Jitendra Nath Pandey commented on HDDS-3858:


The option to run them in same JVM was allowed, to support in-place upgrade 
from HDFS to Ozone in future. It was assumed that having the two Datanodes in 
the same JVM will certainly simplify Hdfs to Ozone upgrade.
Some design work happened in past, but not much progress since then. However, 
that is still a valid goal, IMO. 
Once users are comfortable to switch to Ozone completely, an in-place upgrade 
option will be desirable to avoid data copy, and will be a way faster option to 
move to Ozone.

> Remove support to start Ozone and HDFS datanodes in the same JVM
> 
>
> Key: HDDS-3858
> URL: https://issues.apache.org/jira/browse/HDDS-3858
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Major
>  Labels: pull-request-available
>
> With a few thousands issues ago Ozone was integral part of Hadoop/HDFS 
> project. At that time there were two options to start datanode:
>  1. Start in a separated JVM
>  2. Start in the same JVM with the HDFS
> Today only 1 is the standard way, this is tested and working. 2nd is not 
> working any more but still documented.
> I propose to drop the support of this use case as I can't see any benefit to 
> support it anymore:
>  1. I think 100% of the users will use Ozone as a separated process not as a 
> HDFS Datanode plugin
>  2. Fixing the classpath issues is significant effort as the classpath of 
> HDFS and Ozone are diverged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDDS-3630) Merge rocksdb in datanode

2020-06-15 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136010#comment-17136010
 ] 

Jitendra Nath Pandey edited comment on HDDS-3630 at 6/15/20, 4:52 PM:
--

 It is expected that majority of the containers on a datanode would be closed, 
idle and not being accessed.

Following considerations are relevant:
* What are the in-memory and thread count characteristics for closed container 
vs open containers? We might be able to optimize the closed containers, they 
need only read cache/threads.
* We could attempt to shut down rocksdb instance or tune it down (if possible) 
when container is not being accessed.
* Since closed containers are immutable, we could consider implementing a 
combined cache for closed container rocksDBs, and not rely on rocksDB cache. 


 


was (Author: jnp):
 It is expected that majority of the containers on a datanode would be closed, 
idle and not being accessed.

Following considerations are relevant:
* What are the in-memory and thread count characteristics for closed container 
vs open containers? We might be able to optimize the closed containers, they 
need only read cache/threads.
* We could attempt to shut down rocksdb instance or tune it down (if possible) 
when container is not being accessed.


 

> Merge rocksdb in datanode
> -
>
> Key: HDDS-3630
> URL: https://issues.apache.org/jira/browse/HDDS-3630
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: Merge RocksDB in Datanode-v1.pdf, Merge RocksDB in 
> Datanode-v2.pdf
>
>
> Currently, one rocksdb for one container. one container has 5GB capacity. 
> 10TB data need more than 2000 rocksdb in one datanode.  It's difficult to 
> limit the memory of 2000 rocksdb. So maybe we should limited instance of 
> rocksdb for each disk.
> The design of improvement is in the follow link, but still is a draft. 
> TODO: 
>  1. compatibility with current logic i.e. one rocksdb for each container
>  2. measure the memory usage before and after improvement
>  3. effect on efficiency of read and write.
> https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3630) Merge rocksdb in datanode

2020-06-15 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136014#comment-17136014
 ] 

Jitendra Nath Pandey commented on HDDS-3630:


bq. The capacity of container is 5GB. If container is full, the off-heap memory 
ofRocksDB is about 15MB.

The off-heap memory requirement will depend on the number of entries in the 
rocksdb, i.e. the number of blocks. Suppose we have 10MB blocks we will have 
only 500 entries in rocksdb. How much off-heap memory is needed for 500 
entries? Of course it will vary if average block size grows up and down.



> Merge rocksdb in datanode
> -
>
> Key: HDDS-3630
> URL: https://issues.apache.org/jira/browse/HDDS-3630
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: Merge RocksDB in Datanode-v1.pdf, Merge RocksDB in 
> Datanode-v2.pdf
>
>
> Currently, one rocksdb for one container. one container has 5GB capacity. 
> 10TB data need more than 2000 rocksdb in one datanode.  It's difficult to 
> limit the memory of 2000 rocksdb. So maybe we should limited instance of 
> rocksdb for each disk.
> The design of improvement is in the follow link, but still is a draft. 
> TODO: 
>  1. compatibility with current logic i.e. one rocksdb for each container
>  2. measure the memory usage before and after improvement
>  3. effect on efficiency of read and write.
> https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3630) Merge rocksdb in datanode

2020-06-15 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17136010#comment-17136010
 ] 

Jitendra Nath Pandey commented on HDDS-3630:


 It is expected that majority of the containers on a datanode would be closed, 
idle and not being accessed.

Following considerations are relevant:
* What are the in-memory and thread count characteristics for closed container 
vs open containers? We might be able to optimize the closed containers, they 
need only read cache/threads.
* We could attempt to shut down rocksdb instance or tune it down (if possible) 
when container is not being accessed.


 

> Merge rocksdb in datanode
> -
>
> Key: HDDS-3630
> URL: https://issues.apache.org/jira/browse/HDDS-3630
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Major
> Attachments: Merge RocksDB in Datanode-v1.pdf, Merge RocksDB in 
> Datanode-v2.pdf
>
>
> Currently, one rocksdb for one container. one container has 5GB capacity. 
> 10TB data need more than 2000 rocksdb in one datanode.  It's difficult to 
> limit the memory of 2000 rocksdb. So maybe we should limited instance of 
> rocksdb for each disk.
> The design of improvement is in the follow link, but still is a draft. 
> TODO: 
>  1. compatibility with current logic i.e. one rocksdb for each container
>  2. measure the memory usage before and after improvement
>  3. effect on efficiency of read and write.
> https://docs.google.com/document/d/18Ybg-NjyU602c-MYXaJHP6yrg-dVMZKGyoK5C_pp1mM/edit#



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-2465) S3 Multipart upload failing

2020-06-03 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124724#comment-17124724
 ] 

Jitendra Nath Pandey commented on HDDS-2465:


Is this still a problem?

> S3 Multipart upload failing
> ---
>
> Key: HDDS-2465
> URL: https://issues.apache.org/jira/browse/HDDS-2465
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Priority: Critical
>  Labels: TriagePending
> Attachments: MPU.java
>
>
> When I run attached java program, facing below error, during 
> completeMultipartUpload.
> {code:java}
> ERROR StatusLogger No Log4j 2 configuration file found. Using default 
> configuration (logging only errors to the console), or user programmatically 
> provided configurations. Set system property 'log4j2.debug' to show Log4j 2 
> internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2ERROR StatusLogger No Log4j 2 
> configuration file found. Using default configuration (logging only errors to 
> the console), or user programmatically provided configurations. Set system 
> property 'log4j2.debug' to show Log4j 2 internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2Exception in thread "main" 
> com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: 
> Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 
> c7b87393-955b-4c93-85f6-b02945e293ca; S3 Extended Request ID: 7tnVbqgc4bgb), 
> S3 Extended Request ID: 7tnVbqgc4bgb at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at 
> com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3464)
>  at org.apache.hadoop.ozone.freon.MPU.main(MPU.java:96){code}
> When I debug it is not the request is not been received by S3Gateway, and I 
> don't see any trace of this in audit log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-2328) Support large-scale listing

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2328:
---
Priority: Critical  (was: Major)

> Support large-scale listing 
> 
>
> Key: HDDS-2328
> URL: https://issues.apache.org/jira/browse/HDDS-2328
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Rajesh Balamohan
>Assignee: Hanisha Koneru
>Priority: Critical
>  Labels: TriagePending, performance
>
> Large-scale listing of directory contents takes a lot longer time and also 
> has the potential to run into OOM. I have > 1 million entries in the same 
> level and it took lot longer time with {{RemoteIterator}} (didn't complete as 
> it was stuck in RDB::seek).
> S3A batches it with 5K listing per fetch IIRC.  It would be good to have this 
> feature in ozone as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-2465) S3 Multipart upload failing

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-2465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2465:
---
Priority: Critical  (was: Major)

> S3 Multipart upload failing
> ---
>
> Key: HDDS-2465
> URL: https://issues.apache.org/jira/browse/HDDS-2465
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Bharat Viswanadham
>Priority: Critical
>  Labels: TriagePending
> Attachments: MPU.java
>
>
> When I run attached java program, facing below error, during 
> completeMultipartUpload.
> {code:java}
> ERROR StatusLogger No Log4j 2 configuration file found. Using default 
> configuration (logging only errors to the console), or user programmatically 
> provided configurations. Set system property 'log4j2.debug' to show Log4j 2 
> internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2ERROR StatusLogger No Log4j 2 
> configuration file found. Using default configuration (logging only errors to 
> the console), or user programmatically provided configurations. Set system 
> property 'log4j2.debug' to show Log4j 2 internal initialization logging. See 
> https://logging.apache.org/log4j/2.x/manual/configuration.html for 
> instructions on how to configure Log4j 2Exception in thread "main" 
> com.amazonaws.services.s3.model.AmazonS3Exception: Bad Request (Service: 
> Amazon S3; Status Code: 400; Error Code: 400 Bad Request; Request ID: 
> c7b87393-955b-4c93-85f6-b02945e293ca; S3 Extended Request ID: 7tnVbqgc4bgb), 
> S3 Extended Request ID: 7tnVbqgc4bgb at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
>  at 
> com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
>  at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532) at 
> com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512) at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4921) at 
> com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4867) at 
> com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3464)
>  at org.apache.hadoop.ozone.freon.MPU.main(MPU.java:96){code}
> When I debug it is not the request is not been received by S3Gateway, and I 
> don't see any trace of this in audit log.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-2991) Support ACL in OFS

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-2991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2991:
---
Priority: Blocker  (was: Major)

> Support ACL in OFS
> --
>
> Key: HDDS-2991
> URL: https://issues.apache.org/jira/browse/HDDS-2991
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Xiaoyu Yao
>Priority: Blocker
>  Labels: TriagePending
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-2665) Implement new Ozone Filesystem scheme ofs://

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-2665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2665:
---
Priority: Blocker  (was: Major)

> Implement new Ozone Filesystem scheme ofs://
> 
>
> Key: HDDS-2665
> URL: https://issues.apache.org/jira/browse/HDDS-2665
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Blocker
>  Labels: Triaged
> Attachments: Design ofs v1.pdf
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Implement a new scheme for Ozone Filesystem where all volumes (and buckets) 
> can be access from a single root.
> Alias: Rooted Ozone Filesystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3022) Datanode unable to close Pipeline after disk out of space

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3022:
---
Priority: Critical  (was: Major)

> Datanode unable to close Pipeline after disk out of space
> -
>
> Key: HDDS-3022
> URL: https://issues.apache.org/jira/browse/HDDS-3022
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Shashikant Banerjee
>Priority: Critical
>  Labels: TriagePending
> Attachments: ozone_logs.zip
>
>
> Datanode gets into a loop and keeps throwing errors while trying to close 
> pipeline
> {code:java}
> 2020-02-14 00:25:10,208 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07: changes role from  
> FOLLOWER to CANDIDATE at term 6240 for changeToCandidate
> 2020-02-14 00:25:10,208 ERROR 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  pipeline Action CLOSE  on pipeline 
> PipelineID=02e7e10e-2d50-4ace-a18b-701265ec9f07.Reason : 
> 285cac09-7622-45e6-be02-b3c68ebf8b10 is in candidate state for 31898494ms
> 2020-02-14 00:25:10,208 INFO org.apache.ratis.server.impl.RoleInfo: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10: start LeaderElection
> 2020-02-14 00:25:10,223 INFO org.apache.ratis.server.impl.LeaderElection: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07-LeaderElection37032: 
> begin an election at term 6241 for 0: 
> [d432c890-5ec4-4cf1-9078-28497a08ab85:10.65.6.227:9858, 
> 285cac09-7622-45e6-be02-b3c68ebf8b10:10.65.24.80:9858, 
> cabbdef8-ed6c-4fc7-b7b2-d1ddd07da47e:10.65.8.165:9858], old=null
> 2020-02-14 00:25:10,259 INFO org.apache.ratis.server.impl.LeaderElection: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07-LeaderElection37032 
> got exception when requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> d432c890-5ec4-4cf1-9078-28497a08ab85: group-701265EC9F07 not found.
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.LeaderElection: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07-LeaderElection37032 
> got exception when requesting votes: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> cabbdef8-ed6c-4fc7-b7b2-d1ddd07da47e: group-701265EC9F07 not found.
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.LeaderElection: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07-LeaderElection37032: 
> Election REJECTED; received 0 response(s) [] and 2 exception(s); 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07:t6241, leader=null, 
> voted=285cac09-7622-45e6-be02-b3c68ebf8b10, 
> raftlog=285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07-SegmentedRaftLog:OPENED:c4,f4,i14,
>  conf=0: [d432c890-5ec4-4cf1-9078-28497a08ab85:10.65.6.227:9858, 
> 285cac09-7622-45e6-be02-b3c68ebf8b10:10.65.24.80:9858, 
> cabbdef8-ed6c-4fc7-b7b2-d1ddd07da47e:10.65.8.165:9858], old=null
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.LeaderElection:   
> Exception 0: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> d432c890-5ec4-4cf1-9078-28497a08ab85: group-701265EC9F07 not found.
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.LeaderElection:   
> Exception 1: java.util.concurrent.ExecutionException: 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: INTERNAL: 
> cabbdef8-ed6c-4fc7-b7b2-d1ddd07da47e: group-701265EC9F07 not found.
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.RaftServerImpl: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-701265EC9F07: changes role from 
> CANDIDATE to FOLLOWER at term 6241 for DISCOVERED_A_NEW_TERM
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.RoleInfo: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10: shutdown LeaderElection
> 2020-02-14 00:25:10,270 INFO org.apache.ratis.server.impl.RoleInfo: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10: start FollowerState
> 2020-02-14 00:25:10,680 WARN org.apache.ratis.grpc.server.GrpcLogAppender: 
> 285cac09-7622-45e6-be02-b3c68ebf8b10@group-DD847EC75388->d432c890-5ec4-4cf1-9078-28497a08ab85-GrpcLogAppender:
>  HEARTBEAT appendEntries Timeout, 
> request=AppendEntriesRequest:cid=12669,entriesCount=0,lastEntry=null
> 2020-02-14 00:25:10,752 ERROR 
> org.apache.hadoop.ozone.container.common.transport.server.ratis.XceiverServerRatis:
>  pipeline Action CLOSE  on pipeline 
> PipelineID=7ad5ce51-d3fa-4e71-99f2-dd847ec75388.Reason : 
> 285cac09-7622-45e6-be02-b3c68ebf8b10 has not seen follower/s 
>

[jira] [Updated] (HDDS-3133) Add export objectIds in Ozone as FileIds to allow LLAP to cache the files

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3133:
---
Priority: Critical  (was: Major)

> Add export objectIds in Ozone as FileIds to allow LLAP to cache the files
> -
>
> Key: HDDS-3133
> URL: https://issues.apache.org/jira/browse/HDDS-3133
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Filesystem, Ozone Manager
>Reporter: Mukul Kumar Singh
>Priority: Critical
>  Labels: Triaged
>
> Hive LLAP makes use of the fileids to cache the files data. Ozone's objectIds 
> need to be exported as fileIds to allow the caching to happen effectively.
> https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/HdfsUtils.java#L65



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3103) Have multi-raft pipeline calculator to recommend best pipeline number per datanode

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3103:
---
Priority: Critical  (was: Major)

> Have multi-raft pipeline calculator to recommend best pipeline number per 
> datanode
> --
>
> Key: HDDS-3103
> URL: https://issues.apache.org/jira/browse/HDDS-3103
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: SCM
>Affects Versions: 0.5.0
>Reporter: Li Cheng
>Assignee: Li Cheng
>Priority: Critical
>  Labels: TriagePending
>
> PipelinePlacementPolicy should have a calculator method to recommend better 
> number for pipeline number per node. The number used to come from 
> ozone.datanode.pipeline.limit in config. SCM should be able to consider how 
> many ratis dir and the ratis retry timeout to recommend the best pipeline 
> number for every node.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3096) SCM does not exit Safe mode

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3096:
---
Priority: Critical  (was: Major)

> SCM does not exit Safe mode
> ---
>
> Key: HDDS-3096
> URL: https://issues.apache.org/jira/browse/HDDS-3096
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Critical
>  Labels: Triaged
>
> In a few scenarios, like Disks are gone, the datanode is not up or any other 
> case, we may try to close pipelines.
> If we close pipelines after SCM restart, SCM will not come out of safe mode. 
> This is because of the current implementation where we get the count of the 
> pipeline from DB when creating a SafeMode rule object. During this, if any 
> pipeline is closed/removed from DB, the Rule does not know about it, and it 
> PipelineSafeMode rule is never met, this causes a situation where we never 
> come out of safe mode.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3081) Replication manager should detect and correct containers which don't meet the replication policy

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3081:
---
Fix Version/s: 0.6.0

> Replication manager should detect and correct containers which don't meet the 
> replication policy
> 
>
> Key: HDDS-3081
> URL: https://issues.apache.org/jira/browse/HDDS-3081
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Affects Versions: 0.6.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.6.0
>
>
> In the current implementation, replication manager does not consider the 
> container placement when checking if a container is healthy. Only the number 
> of replicas are checked.
> Now we have network topology available, we should consider whether 
> replication manager should detect and correct mis-replicated containers.
> In HDFS, the namenode will not automatically correct mis-replicated 
> containers automatically, except at startup when all blocks are checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Resolved] (HDDS-3081) Replication manager should detect and correct containers which don't meet the replication policy

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey resolved HDDS-3081.

Resolution: Fixed

The pull request is merged, this can be resolved.

> Replication manager should detect and correct containers which don't meet the 
> replication policy
> 
>
> Key: HDDS-3081
> URL: https://issues.apache.org/jira/browse/HDDS-3081
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: SCM
>Affects Versions: 0.6.0
>Reporter: Stephen O'Donnell
>Assignee: Stephen O'Donnell
>Priority: Major
>  Labels: pull-request-available
>
> In the current implementation, replication manager does not consider the 
> container placement when checking if a container is healthy. Only the number 
> of replicas are checked.
> Now we have network topology available, we should consider whether 
> replication manager should detect and correct mis-replicated containers.
> In HDFS, the namenode will not automatically correct mis-replicated 
> containers automatically, except at startup when all blocks are checked.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-698) Support Topology Awareness for Ozone

2020-06-03 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-698:
--
Priority: Blocker  (was: Major)

> Support Topology Awareness for Ozone
> 
>
> Key: HDDS-698
> URL: https://issues.apache.org/jira/browse/HDDS-698
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>Reporter: Xiaoyu Yao
>Assignee: Sammi Chen
>Priority: Blocker
>  Labels: TriagePending
> Attachments: HDDS-698.000.patch, network-topology-default.xml, 
> network-topology-nodegroup.xml
>
>
> This is an umbrella JIRA to add topology aware support for Ozone Pipelines, 
> Containers and Blocks. Long time since HDFS is created, we provide 
> rack/nodegroup awareness for reliability and high performance for data 
> access. Ozone need a similar mechanism and this can be more flexible for 
> cloud scenarios. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3350) Ozone Retry Policy Improvements

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3350:
---
Priority: Blocker  (was: Major)

> Ozone Retry Policy Improvements
> ---
>
> Key: HDDS-3350
> URL: https://issues.apache.org/jira/browse/HDDS-3350
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: Ozone Client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Blocker
>  Labels: Triaged, pull-request-available
> Attachments: Retry Behaviour in Ozone Client.pdf, Retry Behaviour in 
> Ozone Client_Updated.pdf, Retry Behaviour in Ozone Client_Updated_2.pdf, 
> Retry Policy Results - Teragen 100GB.pdf
>
>
> Currently any ozone client request can spend a huge amount of time in retries 
> and ozone client can retry its requests very aggressively. The waiting time 
> can thus be very high before a client request fails. Further aggressive 
> retries by ratis client used by ozone can bog down a ratis pipeline leader. 
> The Jira aims to make changes to the current retry behavior in Ozone client. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3402) Use proper acls for sub directories created during CreateDirectory operation

2020-06-02 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124463#comment-17124463
 ] 

Jitendra Nath Pandey commented on HDDS-3402:


cc [~xyao]

> Use proper acls for sub directories created during CreateDirectory operation
> 
>
> Key: HDDS-3402
> URL: https://issues.apache.org/jira/browse/HDDS-3402
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Bharat Viswanadham
>Assignee: Rakesh Radhakrishnan
>Priority: Blocker
>  Labels: TriagePending
>
> Use proper ACLS for subdirectories created during create directory operation.
> All subdirectories/missing directories should inherit the ACLS from the 
> bucket if ancestors are not present in key table. If present should inherit 
> the ACLS from its ancestor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3402) Use proper acls for sub directories created during CreateDirectory operation

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3402:
---
Priority: Blocker  (was: Major)

> Use proper acls for sub directories created during CreateDirectory operation
> 
>
> Key: HDDS-3402
> URL: https://issues.apache.org/jira/browse/HDDS-3402
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Manager
>Reporter: Bharat Viswanadham
>Assignee: Rakesh Radhakrishnan
>Priority: Blocker
>  Labels: TriagePending
>
> Use proper ACLS for subdirectories created during create directory operation.
> All subdirectories/missing directories should inherit the ACLS from the 
> bucket if ancestors are not present in key table. If present should inherit 
> the ACLS from its ancestor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3436) Enable TestOzoneContainerRatis test cases

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3436:
---
Priority: Critical  (was: Major)

> Enable TestOzoneContainerRatis test cases
> -
>
> Key: HDDS-3436
> URL: https://issues.apache.org/jira/browse/HDDS-3436
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Nanda kumar
>Priority: Critical
>
> Fix and enable TestOzoneContainerRatis test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3439) Enable TestSecureContainerServer test cases

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3439:
---
Priority: Blocker  (was: Major)

> Enable TestSecureContainerServer test cases
> ---
>
> Key: HDDS-3439
> URL: https://issues.apache.org/jira/browse/HDDS-3439
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>  Components: test
>Affects Versions: 0.5.0
>Reporter: Nanda kumar
>Priority: Blocker
>
> Fix and enable TestSecureContainerServer test cases



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3458) Support Hadoop 2.x with build-time classpath separation instead of isolated classloader

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3458:
---
Priority: Blocker  (was: Major)

> Support Hadoop 2.x with build-time classpath separation instead of isolated 
> classloader
> ---
>
> Key: HDDS-3458
> URL: https://issues.apache.org/jira/browse/HDDS-3458
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>  Components: build
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Blocker
>  Labels: Triaged
> Attachments: classpath.pdf
>
>
> Apache Hadoop Ozone is a Hadoop subproject. It depends on the released Hadoop 
> 3.2. But as Hadoop 3.2 is very rare in production, older versions should be 
> supported to make it possible to work together with Spark, Hive, HBase and 
> older clusters.
> Our current approach is using classloader based separation (ozonefs "legacy" 
> jar), which has multiple problems:
>  1. It's quite complex and hard to debug
>  2. It couldn't work together with security
> The issue proposes to use a different approach
>  1. Reduce the dependency on Hadoop (including the replacement of hadoop 
> metrics and cleanup of the usage of configuration)
>  2. Create multiple version from ozonefs-client with different compile time 
> dependency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3476) Use persisted transaction info during OM startup in OM StateMachine

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3476:
---
Priority: Critical  (was: Major)

> Use persisted transaction info during OM startup in OM StateMachine
> ---
>
> Key: HDDS-3476
> URL: https://issues.apache.org/jira/browse/HDDS-3476
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Bharat Viswanadham
>Assignee: Bharat Viswanadham
>Priority: Critical
>  Labels: Triaged, pull-request-available
>
> HDDS-3475 persisted transaction info into DB. This Jira is to use 
> transactionInfo persisted to DB during OM startup. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3481) SCM ask 31 datanodes to replicate the same container

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3481:
---
Priority: Critical  (was: Major)

> SCM ask 31 datanodes to replicate the same container
> 
>
> Key: HDDS-3481
> URL: https://issues.apache.org/jira/browse/HDDS-3481
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
>  Labels: TriagePending
> Attachments: screenshot-1.png, screenshot-2.png, screenshot-3.png, 
> screenshot-4.png
>
>
> *What's the problem ?*
> As the image shows,  scm ask 31 datanodes to replicate container 2037 every 
> 10 minutes from 2020-04-17 23:38:51.  And at 2020-04-18 08:58:52 scm find the 
> replicate num of container 2037 is 12, then it ask 11 datanodes to delete 
> container 2037. 
>  !screenshot-1.png! 
>  !screenshot-2.png! 
> *What's the reason ?*
> scm check whether  (container replicates num + 
> inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3. If less than 3, it 
> will ask some datanode to replicate the container, and add the action into 
> inflightReplication.get(containerId). The replicate action time out is 10 
> minutes, if action timeout, scm will delete the action from 
> inflightReplication.get(containerId) as the image shows. Then (container 
> replicates num + inflightReplication.get(containerId).size() - 
> inflightDeletion.get(containerId).size()) is less than 3 again, and scm ask 
> another datanode to replicate the container.
> Because replicate container cost a long time,  sometimes it cannot finish in 
> 10 minutes, thus 31 datanodes has to replicate the container every 10 
> minutes.  19 of 31 datanodes replicate container from the same source 
> datanode,  it will also cause big pressure on the source datanode and 
> replicate container become slower. Actually it cost 4 hours to finish the 
> first replicate. 
>  !screenshot-4.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDDS-3508) container replicas are replicated to all available datanodes

2020-06-02 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124309#comment-17124309
 ] 

Jitendra Nath Pandey edited comment on HDDS-3508 at 6/2/20, 11:24 PM:
--

Duplicate of HDDS-3481?


was (Author: arpitagarwal):
Duplicate of HDFS-3481?

> container replicas are replicated to all available datanodes
> 
>
> Key: HDDS-3508
> URL: https://issues.apache.org/jira/browse/HDDS-3508
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: SCM
>Reporter: Nilotpal Nandi
>Assignee: Nanda kumar
>Priority: Major
>  Labels: TriagePending
>
> steps taken :
> ---
> 1. Write data
> 2. Deleted  hdds datanode dir from one of the container replica node (this 
> node is the leader).
> 3. Wait for few hours.
> 4. Stopped datanode  where hdds datanode dir was deleted.
>  
> Container got  replicated  on all available DNs
> {noformat}
> ozone admin container info 25 | egrep 'Container|Datanodes'
> Wed Apr 29 07:28:13 UTC 2020
> Container id: 25
> Container State: CLOSED
> Datanodes: 
> [quasar-xotthq-6.quasar-xotthq.root.hwx.site,quasar-xotthq-7.quasar-xotthq.root.hwx.site,quasar-xotthq-4.quasar-xotthq.root.hwx.site,quasar-xotthq-8.quasar-xotthq.root.hwx.site,quasar-xotthq-3.quasar-xotthq.root.hwx.site,quasar-xotthq-5.quasar-xotthq.root.hwx.site,quasar-xotthq-2.quasar-xotthq.root.hwx.site]{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3512) s3g multi-upload saved content incorrect when client uses aws java sdk 1.11.* jar

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3512:
---
Priority: Blocker  (was: Major)

> s3g multi-upload saved content incorrect when client uses aws java sdk 1.11.* 
> jar
> -
>
> Key: HDDS-3512
> URL: https://issues.apache.org/jira/browse/HDDS-3512
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: s3g
>Reporter: Sammi Chen
>Priority: Blocker
>  Labels: TriagePending
>
> The default multi-part size  is 5MB, which is 5242880 byte, while all the 
> chunks saved by s3g is 5246566 byte which is greater than 5MB.
> By looking into the ObjectEndpoint.java, it seems the chunk size is retrieved 
> from the "Content-Length" header. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3554) Multipart Upload Failed because partName mismatch

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3554:
---
Priority: Critical  (was: Major)

> Multipart Upload Failed because partName mismatch
> -
>
> Key: HDDS-3554
> URL: https://issues.apache.org/jira/browse/HDDS-3554
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: s3g
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
>  Labels: TriagePending
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3562) Datanodes should send ICR when a container replica deletion is successful

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3562:
---
Priority: Blocker  (was: Major)

> Datanodes should send ICR when a container replica deletion is successful
> -
>
> Key: HDDS-3562
> URL: https://issues.apache.org/jira/browse/HDDS-3562
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Nanda kumar
>Assignee: Nanda kumar
>Priority: Blocker
>  Labels: Triaged, pull-request-available
>
> Whenever a datanode executes the delete container command and deletes the 
> container replica, it has to immediately send an ICR to update the container 
> replica state in SCM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3554) Multipart Upload Failed because partName mismatch

2020-06-02 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124457#comment-17124457
 ] 

Jitendra Nath Pandey commented on HDDS-3554:


cc [~bharat] [~elek]

> Multipart Upload Failed because partName mismatch
> -
>
> Key: HDDS-3554
> URL: https://issues.apache.org/jira/browse/HDDS-3554
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: s3g
>Reporter: runzhiwang
>Assignee: runzhiwang
>Priority: Critical
>  Labels: TriagePending
> Attachments: screenshot-1.png
>
>
>  !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Commented] (HDDS-3594) ManagedChannels are leaked in XceiverClientGrpc manager

2020-06-02 Thread Jitendra Nath Pandey (Jira)



[ 
https://issues.apache.org/jira/browse/HDDS-3594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17124455#comment-17124455
 ] 

Jitendra Nath Pandey commented on HDDS-3594:


Is it a duplicate of HDDS-3600 ?

> ManagedChannels are leaked in XceiverClientGrpc manager
> ---
>
> Key: HDDS-3594
> URL: https://issues.apache.org/jira/browse/HDDS-3594
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.6.0
>Reporter: Rakesh Radhakrishnan
>Priority: Major
>  Labels: TriagePending
>
> XceiverClientGrpc#ManagedChannels are leaked when running {{Hadoop Synthetic 
> Load Generator}} pointing to OzoneFS.
> *Stacktrace:*
> {code:java}
> SEVERE: *~*~*~ Channel ManagedChannelImpl{logId=99, target=10.17.248.31:9859} 
> was not shutdown properly!!! ~*~*~*
> Make sure to call shutdown()/shutdownNow() and wait until 
> awaitTermination() returns true.
> java.lang.RuntimeException: ManagedChannel allocation site
> at 
> org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper$ManagedChannelReference.(ManagedChannelOrphanWrapper.java:94)
> at 
> org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:52)
> at 
> org.apache.ratis.thirdparty.io.grpc.internal.ManagedChannelOrphanWrapper.(ManagedChannelOrphanWrapper.java:43)
> at 
> org.apache.ratis.thirdparty.io.grpc.internal.AbstractManagedChannelImplBuilder.build(AbstractManagedChannelImplBuilder.java:518)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.connectToDatanode(XceiverClientGrpc.java:191)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientGrpc.connect(XceiverClientGrpc.java:140)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientManager$2.call(XceiverClientManager.java:244)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientManager$2.call(XceiverClientManager.java:228)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4876)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3529)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2278)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2155)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2045)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache.get(LocalCache.java:3951)
> at 
> org.apache.hadoop.ozone.shaded.com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4871)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientManager.getClient(XceiverClientManager.java:228)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientManager.acquireClient(XceiverClientManager.java:174)
> at 
> org.apache.hadoop.hdds.scm.XceiverClientManager.acquireClientForReadData(XceiverClientManager.java:164)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.getChunkInfos(BlockInputStream.java:184)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.initialize(BlockInputStream.java:133)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockInputStream.read(BlockInputStream.java:254)
> at 
> org.apache.hadoop.ozone.client.io.KeyInputStream.read(KeyInputStream.java:199)
> at 
> org.apache.hadoop.fs.ozone.OzoneFSInputStream.read(OzoneFSInputStream.java:63)
> at java.io.DataInputStream.read(DataInputStream.java:100)
> at 
> org.apache.hadoop.fs.loadGenerator.LoadGenerator$DFSClientThread.read(LoadGenerator.java:284)
> at 
> org.apache.hadoop.fs.loadGenerator.LoadGenerator$DFSClientThread.nextOp(LoadGenerator.java:268)
> at 
> org.apache.hadoop.fs.loadGenerator.LoadGenerator$DFSClientThread.run(LoadGenerator.java:235)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3596) Clean up unused code after HDDS-2940 and HDDS-2942

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3596:
---
Priority: Blocker  (was: Major)

> Clean up unused code after HDDS-2940 and HDDS-2942
> --
>
> Key: HDDS-3596
> URL: https://issues.apache.org/jira/browse/HDDS-3596
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Blocker
>  Labels: Triaged, pull-request-available
>
> It seems some snippets of code should be removed as HDDS-2940 is committed. 
> Update: Pending HDDS-2942 commit before this can be committed.
> For example 
> [this|https://github.com/apache/hadoop-ozone/blob/ffb340e32460ccaa2eae557f0bb71fb90d7ebc7a/hadoop-ozone/ozonefs/src/main/java/org/apache/hadoop/fs/ozone/BasicOzoneFileSystem.java#L495-L499]:
> {code:java|title=BasicOzoneFileSystem#delete}
> if (result) {
>   // If this delete operation removes all files/directories from the
>   // parent directory, then an empty parent directory must be created.
>   createFakeParentDirectory(f);
> }
> {code}
> (Found at 
> https://github.com/apache/hadoop-ozone/pull/906#discussion_r424873030)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3599) Implement ofs://: Add contract test for HA

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3599:
---
Priority: Blocker  (was: Major)

> Implement ofs://: Add contract test for HA
> --
>
> Key: HDDS-3599
> URL: https://issues.apache.org/jira/browse/HDDS-3599
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Blocker
>  Labels: Triaged, pull-request-available
>
> Add contract tests for HA as well.
> Since adding HA contract tests will be another ~10 new classes. [~xyao] and I 
> decided to put HA OFS contract tests in another jira.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3600) ManagedChannels leaked on ratis pipeline when there are many connection retries

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3600:
---
Priority: Critical  (was: Major)

> ManagedChannels leaked on ratis pipeline when there are many connection 
> retries
> ---
>
> Key: HDDS-3600
> URL: https://issues.apache.org/jira/browse/HDDS-3600
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Affects Versions: 0.6.0
>Reporter: Rakesh Radhakrishnan
>Priority: Critical
>  Labels: TriagePending
> Attachments: HeapHistogram-Snapshot-ManagedChannel-Leaked-001.png, 
> outloggenerator-ozonefs-003.log
>
>
> ManagedChannels leaked on ratis pipeline when there are many connection 
> retries
> Observed that too many ManagedChannels opened while running Synthetic Hadoop 
> load generator.
>  Ran benchmark with only one pipeline in the cluster and also ran with only 
> two pipelines in the cluster. 
>  Both the run failed with too many open files and could see many open TCP 
> connections for long time and suspecting channel leaks..
> More details below:
>  *1)* Execute NNloadGenerator
> {code:java}
> [rakeshr@ve1320 loadOutput]$ ps -ef | grep load
> hdfs 362822  1 19 05:24 pts/000:03:16 
> /usr/java/jdk1.8.0_232-cloudera/bin/java -Dproc_jar -Xmx825955249 
> -Djava.net.preferIPv4Stack=true -Djava.net.preferIPv4Stack=true 
> -Dyarn.log.dir=/var/log/hadoop-yarn -Dyarn.log.file=hadoop.log 
> -Dyarn.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/libexec/../../hadoop-yarn
>  -Dyarn.root.logger=INFO,console 
> -Djava.library.path=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop/lib/native
>  -Dhadoop.log.dir=/var/log/hadoop-yarn -Dhadoop.log.file=hadoop.log 
> -Dhadoop.home.dir=/opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/lib/hadoop
>  -Dhadoop.id.str=hdfs -Dhadoop.root.logger=INFO,console 
> -Dhadoop.policy.file=hadoop-policy.xml 
> -Dhadoop.security.logger=INFO,NullAppender org.apache.hadoop.util.RunJar 
> /opt/cloudera/parcels/CDH-7.2.0-1.cdh7.2.0.p0.2982244/jars/hadoop-mapreduce-client-jobclient-3.1.1.7.2.0.0-141-tests.jar
>  NNloadGenerator -root o3fs://bucket2.vol2/
> rakeshr  368739 354174  0 05:41 pts/000:00:00 grep --color=auto load
> {code}
> *2)* Active 9858 TCP connections during the run, which is ratis pipeline 
> default port.
> {code:java}
> [rakeshr@ve1320 loadOutput]$ sudo lsof -a -p 362822 | grep "9858" | wc
>3229   32290  494080
> [rakeshr@ve1320 loadOutput]$ vi tcp_log
> 
> java440633 hdfs 4090u IPv4  271141987   0t0TCP 
> ve1320.halxg.cloudera.com:35190->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java440633 hdfs 4091u IPv4  271127918   0t0TCP 
> ve1320.halxg.cloudera.com:35192->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java440633 hdfs 4092u IPv4  271038583   0t0TCP 
> ve1320.halxg.cloudera.com:59116->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java440633 hdfs 4093u IPv4  271038584   0t0TCP 
> ve1320.halxg.cloudera.com:59118->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> java440633 hdfs 4095u IPv4  271127920   0t0TCP 
> ve1320.halxg.cloudera.com:35196->ve1323.halxg.cloudera.com:9858 (ESTABLISHED)
> [rakeshr@ve1320 loadOutput]$ ^C
>  {code}
> *3)* heapdump shows there are 9571 ManagedChanel objects. Heapdump is quite 
> large and attached snapshot to this jira.
> *4)* Attached output and threadump of the SyntheticLoadGenerator benchmark 
> client process to show the exceptions printed to the console. FYI, this file 
> was quite large and have trimmed few repeated exception traces..



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3611) Ozone client should not consider closed container error as failure

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3611:
---
Priority: Critical  (was: Major)

> Ozone client should not consider closed container error as failure
> --
>
> Key: HDDS-3611
> URL: https://issues.apache.org/jira/browse/HDDS-3611
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Client
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Critical
>  Labels: TriagePending
>
> ContainerNotOpen exception exception is thrown by datanode when client is 
> writing to a non open container. Currently ozone client sees this as failure 
> and would increment the retry count. If client reaches a configured retry 
> count it fails the write. Map reduce jobs were seen failing due to this error 
> with default retry count of 5.
> Idea is to not consider errors due to closed container in retry count. This 
> would make sure that ozone client writes do not fail due to closed container 
> exceptions.
> {code:java}
> 2020-05-15 02:20:28,375 ERROR [main] 
> org.apache.hadoop.ozone.client.io.KeyOutputStream: Retry request failed. 
> retries get failed due to exceeded maximum allowed retries number: 5
> java.io.IOException: Unexpected Storage Container Exception: 
> java.util.concurrent.CompletionException: 
> java.util.concurrent.CompletionException: 
> org.apache.ratis.protocol.StateMachineException: 
> org.apache.hadoop.hdds.scm.container.common.helpers.ContainerNotOpenException 
> from Server e2eec12f-02c5-46e2-9c23-14d6445db219@group-A3BF3ABDC307: 
> Container 15 in CLOSED state
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.setIoException(BlockOutputStream.java:551)
> at 
> org.apache.hadoop.hdds.scm.storage.BlockOutputStream.lambda$writeChunkToContainer$3(BlockOutputStream.java:638)
> at 
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)
> at 
> java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> at 
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:99)
> at 
> org.apache.ratis.client.impl.OrderedAsync$PendingOrderedRequest.setReply(OrderedAsync.java:60)
> at 
> org.apache.ratis.util.SlidingWindow$RequestMap.setReply(SlidingWindow.java:143)
> at 
> org.apache.ratis.util.SlidingWindow$Client.receiveReply(SlidingWindow.java:314)
> at 
> org.apache.ratis.client.impl.OrderedAsync.lambda$sendRequest$9(OrderedAsync.java:242)
> at 
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> at 
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
> at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
> at 
> java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1975)
> at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.lambda$onNext$0(GrpcClientProtocolClient.java:284)
> at java.util.Optional.ifPresent(Optional.java:159)
> at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.handleReplyFuture(GrpcClientProtocolClient.java:340)
> at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers.access$100(GrpcClientProtocolClient.java:264)
> at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:284)
> at 
> org.apache.ratis.grpc.client.GrpcClientProtocolClient$AsyncStreamObservers$1.onNext(GrpcClientProtocolClient.java:267)
> at 
> org.apache.ratis.thirdparty.io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onMessage(ClientCalls.java:436)
> at 
> org.apache.ratis.thirdparty.io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1MessagesAvailable.runInternal(ClientCallImpl.java:658)
> ...{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3612) Allow mounting bucket under other volume

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3612:
---
Priority: Critical  (was: Major)

> Allow mounting bucket under other volume
> 
>
> Key: HDDS-3612
> URL: https://issues.apache.org/jira/browse/HDDS-3612
> Project: Hadoop Distributed Data Store
>  Issue Type: New Feature
>  Components: Ozone Manager
>Reporter: Attila Doroszlai
>Assignee: Attila Doroszlai
>Priority: Critical
>  Labels: Triaged, pull-request-available
>
> Step 2 from S3 [volume mapping design 
> doc|https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/docs/content/design/ozone-volume-management.md#solving-the-mapping-problem-2-4-from-the-problem-listing]:
> Implement a bind mount mechanic which makes it possible to mount any 
> volume/buckets to the specific "s3" volume.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3619) OzoneManager fails with IllegalArgumentException for cmdType RenameKey

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3619:
---
Priority: Critical  (was: Major)

> OzoneManager fails with IllegalArgumentException for cmdType RenameKey
> --
>
> Key: HDDS-3619
> URL: https://issues.apache.org/jira/browse/HDDS-3619
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: HA, Ozone Manager
>Reporter: Lokesh Jain
>Priority: Critical
>  Labels: Triaged
>
> All Ozone Manager instances on startup fail with IllegalArgumentException for 
> command type RenameKey.
> {code:java}
> 2020-05-19 01:26:32,406 WARN 
> org.apache.ratis.grpc.server.GrpcServerProtocolService: om1: installSnapshot 
> onError, lastRequest: om2->om1#4-t34, previous=(t:34, i:44118), 
> leaderCommit=44118, initializing? false, entries: size=1, first=(t:34, 
> i:44119), METADATAENTRY(c:44118): 
> org.apache.ratis.thirdparty.io.grpc.StatusRuntimeException: CANCELLED: 
> cancelled before receiving half close
> 2020-05-19 01:26:33,521 ERROR 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine: Terminating with 
> exit status 1: Request cmdType: RenameKey
> traceID: ""
> clientId: "client-E7949F1158CC"
> userInfo {
>   userName: "h...@halxg.cloudera.com"
>   remoteAddress: "10.17.200.43"
>   hostName: "vb0933.halxg.cloudera.com"
> }
> renameKeyRequest {
>   keyArgs {
> volumeName: "vol1"
> bucketName: "bucket1"
> keyName: "teragen/100G-terasort-input/"
> dataSize: 0
> modificationTime: 1589872757030
>   }
>   toKeyName: "user/ljain/.Trash/Current/teragen/100G-terasort-input/"
> }
> failed with exception
> java.lang.IllegalArgumentException: Trying to set updateID to 35984 which is 
> not greater than the current value of 42661 for OMKeyInfo{volume='vol1', 
> bucket='bucket1', key='teragen/100G-terasort-input/', dataSize='0', 
> creationTime='1589876037688', type='RATIS', factor='ONE'}
> at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:142)
> at 
> org.apache.hadoop.ozone.om.helpers.WithObjectID.setUpdateID(WithObjectID.java:107)
> at 
> org.apache.hadoop.ozone.om.request.key.OMKeyRenameRequest.validateAndUpdateCache(OMKeyRenameRequest.java:213)
> at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:226)
> at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:428)
> at 
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$applyTransaction$1(OzoneManagerStateMachine.java:242)
> at 
> java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 2020-05-19 01:26:33,526 INFO org.apache.hadoop.ozone.om.OzoneManagerStarter: 
> SHUTDOWN_MSG:
> /
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3632) HddsDatanodeService cannot be started if HDFS datanode running in same machine with same user.

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3632:
---
Priority: Critical  (was: Major)

> HddsDatanodeService cannot be started if HDFS datanode running in same 
> machine with same user.
> --
>
> Key: HDDS-3632
> URL: https://issues.apache.org/jira/browse/HDDS-3632
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Uma Maheswara Rao G
>Priority: Critical
>  Labels: Triaged, newbie
>
> since the service names are same and they both referring to same location for 
> pid files, we can not start both services at once.
> Tweak is to export HADOOP_PID_DIR to different location after starting one 
> service and start other one.
> It would be better to have different pid file names.
>  
>  
> {noformat}
> Umas-MacBook-Pro ozone-0.5.0-beta % bin/ozone --daemon start datanode
> datanode is running as process 25167.  Stop it first.
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3632) HddsDatanodeService cannot be started if HDFS datanode running in same machine with same user.

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3632:
---
Priority: Blocker  (was: Critical)

> HddsDatanodeService cannot be started if HDFS datanode running in same 
> machine with same user.
> --
>
> Key: HDDS-3632
> URL: https://issues.apache.org/jira/browse/HDDS-3632
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Affects Versions: 0.5.0
>Reporter: Uma Maheswara Rao G
>Priority: Blocker
>  Labels: Triaged, newbie
>
> since the service names are same and they both referring to same location for 
> pid files, we can not start both services at once.
> Tweak is to export HADOOP_PID_DIR to different location after starting one 
> service and start other one.
> It would be better to have different pid file names.
>  
>  
> {noformat}
> Umas-MacBook-Pro ozone-0.5.0-beta % bin/ozone --daemon start datanode
> datanode is running as process 25167.  Stop it first.
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3627) Remove FilteredClassloader and replace with maven based hadoop2/hadoop3 ozonefs generation

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3627:
---
Priority: Blocker  (was: Major)

> Remove FilteredClassloader and replace with maven based hadoop2/hadoop3 
> ozonefs generation
> --
>
> Key: HDDS-3627
> URL: https://issues.apache.org/jira/browse/HDDS-3627
> Project: Hadoop Distributed Data Store
>  Issue Type: Sub-task
>Reporter: Marton Elek
>Assignee: Marton Elek
>Priority: Blocker
>  Labels: pull-request-available
>
> As described in the parent issue, the final step is to create a Hadoop 
> independent shaded client and hadoop2/hadoop3 related separated client jars.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3694) Reduce dn-audit log

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3694:
---
Priority: Critical  (was: Minor)

> Reduce dn-audit log
> ---
>
> Key: HDDS-3694
> URL: https://issues.apache.org/jira/browse/HDDS-3694
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Dinesh Chitlangia
>Priority: Critical
>  Labels: Triaged, performance, pull-request-available
> Attachments: write_to_dn_audit_causing_high_disk_util.png
>
>
> Do we really need such fine grained audit log? This ends up creating too many 
> entries for chunks.
> {noformat}
> 2020-05-31 23:31:48,477 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267324230275483 bcsId: 93943} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,482 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267323565871437 bcsId: 93940} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,487 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267324230275483 bcsId: 93943} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,497 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 166 locID: 104267324172472725 bcsId: 93934} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,501 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267323675906396 bcsId: 93958} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,504 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267324230275483 bcsId: 93943} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,509 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 166 locID: 104267323685343583 bcsId: 93974} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,512 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 166 locID: 104267324172472725 bcsId: 93934} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,516 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267324332380586 bcsId: 0} | 
> ret=SUCCESS |
> 2020-05-31 23:31:48,726 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 166 locID: 104267324232634780 bcsId: 93964} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,733 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 166 locID: 104267323976323460 bcsId: 93967} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,740 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267324131512723 bcsId: 93952} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,752 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267324230275483 bcsId: 93943} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,760 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 165 locID: 104267323675906396 bcsId: 93958} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,772 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 166 locID: 104267323685343583 bcsId: 93974} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,780 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 164 locID: 104267324304724389 bcsId: 0} | 
> ret=SUCCESS |
> 2020-05-31 23:31:48,787 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 164 locID: 104267323991724421 bcsId: 93970} 
> | ret=SUCCESS |
> 2020-05-31 23:31:48,794 | INFO  | DNAudit | user=null | ip=null | 
> op=WRITE_CHUNK {blockData=conID: 164 locID: 104267323725189479 bcsId: 93963} 
> | ret=SUCCESS |
>  {noformat}
> And ends up choking disk utilization with lesser write/mb/sec.
> Refer to 100+ writes being written with 0.52 MB/sec and choking entire disk 
> utilization.
> !write_to_dn_audit_causing_high_disk_util.png|width=726,height=300!
>  
> Also, the username and IP are currently set as null. This needs to be 
> replaced by using details from grpc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3703) Consider avoiding file lookup calls in writeChunk hotpath

2020-06-02 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3703:
---
Priority: Blocker  (was: Major)

> Consider avoiding file lookup calls in writeChunk hotpath
> -
>
> Key: HDDS-3703
> URL: https://issues.apache.org/jira/browse/HDDS-3703
> Project: Hadoop Distributed Data Store
>  Issue Type: Improvement
>Reporter: Rajesh Balamohan
>Assignee: Attila Doroszlai
>Priority: Blocker
>  Labels: performance
> Attachments: Screenshot 2020-06-02 at 5.28.59 PM.png
>
>
> In getChunkFile internally, it invokes "verifyChunkDirExists". This causes 
> file existence checks for the directory and throws IO exception accordingly. 
> If the file is anyways going to be written, it is better to handle it later 
> and throw the same exception. This could avoid file checks for every 
> "writechunk"
> [https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/FilePerBlockStrategy.java#L106]
>  
> File channels are cached anyways in "OpenFiles". So if we can avoid 
> "file.getAbsolutePath()", this could save memory and resolving paths. 
> [https://github.com/apache/hadoop-ozone/blob/master/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/keyvalue/impl/FilePerBlockStrategy.java#L118]
>  
> Also "validateChunkForOverwrite" can be optimised, as "isOverWritePermitted" 
> would be false most of the times.
>  
> !Screenshot 2020-06-02 at 5.28.59 PM.png|width=835,height=510!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Assigned] (HDDS-3323) Hive fails to work with OM HA

2020-04-27 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey reassigned HDDS-3323:
--

Assignee: Hanisha Koneru  (was: Bharat Viswanadham)

> Hive fails to work with OM HA
> -
>
> Key: HDDS-3323
> URL: https://issues.apache.org/jira/browse/HDDS-3323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Namit Maheshwari
>Assignee: Hanisha Koneru
>Priority: Major
>
> Hive fails to work with OM HA due to the below configuration issue:
> Please look at the below Hive client conf:
> {code:java}
>   
> ozone.om.address.ozone1.om1
> xxx-7.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om2
> xxx-8.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om3
> xxx-6.xxx.root.zzz.site:9862
>   
> {code}
>  
> OM Server has below conf:
> {code:java}
>   
> ozone.om.address.ozone1.om1
> xxx-6.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om2
> xxx-7.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om3
> xxx-8.xxx.root.zzz.site:9862
>   
> {code}
> Let's say om3 is leader which is host xxx-8.xxx.root.zzz.site:9862 as per the 
> OM server.
> Now let's say, client is talking to om1, and the server responds the 
> suggested leader as om3, then client fails over to om3 which is host 
> xxx-6.xxx.root.zzz.site:9862 according to it's conf, but according to the 
> server it is xxx-8.xxx.root.zzz.site:9862.
> So, now the client will keep on talking to follower OM which is 
> xxx-6.xxx.root.zzz.site:9862, and exhaust its retry count and finally the 
> request will fail.
> Thus Hive fails to work with OM HA. We need to address it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3323) Hive fails to work with OM HA

2020-04-27 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3323:
---
Priority: Critical  (was: Major)

> Hive fails to work with OM HA
> -
>
> Key: HDDS-3323
> URL: https://issues.apache.org/jira/browse/HDDS-3323
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Namit Maheshwari
>Assignee: Hanisha Koneru
>Priority: Critical
>
> Hive fails to work with OM HA due to the below configuration issue:
> Please look at the below Hive client conf:
> {code:java}
>   
> ozone.om.address.ozone1.om1
> xxx-7.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om2
> xxx-8.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om3
> xxx-6.xxx.root.zzz.site:9862
>   
> {code}
>  
> OM Server has below conf:
> {code:java}
>   
> ozone.om.address.ozone1.om1
> xxx-6.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om2
> xxx-7.xxx.root.zzz.site:9862
>   
>   
> ozone.om.address.ozone1.om3
> xxx-8.xxx.root.zzz.site:9862
>   
> {code}
> Let's say om3 is leader which is host xxx-8.xxx.root.zzz.site:9862 as per the 
> OM server.
> Now let's say, client is talking to om1, and the server responds the 
> suggested leader as om3, then client fails over to om3 which is host 
> xxx-6.xxx.root.zzz.site:9862 according to it's conf, but according to the 
> server it is xxx-8.xxx.root.zzz.site:9862.
> So, now the client will keep on talking to follower OM which is 
> xxx-6.xxx.root.zzz.site:9862, and exhaust its retry count and finally the 
> request will fail.
> Thus Hive fails to work with OM HA. We need to address it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3225) Trigger a container report on failed volume

2020-04-23 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3225:
---
Priority: Critical  (was: Major)

> Trigger a container report on failed volume
> ---
>
> Key: HDDS-3225
> URL: https://issues.apache.org/jira/browse/HDDS-3225
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>  Components: Ozone Datanode
>Reporter: Lokesh Jain
>Assignee: Lokesh Jain
>Priority: Critical
>
> Datanode should trigger a container report on volume failure. This would help 
> SCM in faster determination of unhealthy containers.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-3282) ozone.http.filter.initializers can't be set properly for SPNEGO auth

2020-04-09 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-3282:
---
Priority: Blocker  (was: Major)

> ozone.http.filter.initializers can't be set properly for SPNEGO auth
> 
>
> Key: HDDS-3282
> URL: https://issues.apache.org/jira/browse/HDDS-3282
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Affects Versions: 0.5.0
>Reporter: Vivek Ratnavel Subramanian
>Assignee: Xiaoyu Yao
>Priority: Blocker
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> After HDDS-2950, we change to use ozone's own initializer defined by 
> ozone.http.filter.initializers instead the one configured with 
> hadoop.http.filter.initializers.
> The FilterInitializer interface was also forked from hadoop common  that 
> prevents us from using 
> org.apache.hadoop.security.AuthenticationFilterInitializer with the following 
> error. 
> This ticket is opened to fix it. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

[jira] [Updated] (HDDS-2891) Apache NiFi PutFile processor is failing with secure Ozone S3G

2020-01-15 Thread Jitendra Nath Pandey (Jira)



 [ 
https://issues.apache.org/jira/browse/HDDS-2891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jitendra Nath Pandey updated HDDS-2891:
---
Reporter: Ifigeneia Derekli  (was: Marton Elek)

> Apache NiFi PutFile processor is failing with secure Ozone S3G
> --
>
> Key: HDDS-2891
> URL: https://issues.apache.org/jira/browse/HDDS-2891
> Project: Hadoop Distributed Data Store
>  Issue Type: Bug
>Reporter: Ifigeneia Derekli
>Assignee: Marton Elek
>Priority: Critical
>
>  
> (1) Create a simple PutS3Object processor in NiFi
> (2) The request from NiFi to S3g will fail with HTTP 500
> (3) The exception in the s3g log:
>  
> {code:java}
>  s3g_1   | Caused by: java.io.IOException: Couldn't create RpcClient 
> protocol
> s3g_1   | at 
> org.apache.hadoop.ozone.client.OzoneClientFactory.getClientProtocol(OzoneClientFactory.java:197)
> s3g_1   | at 
> org.apache.hadoop.ozone.client.OzoneClientFactory.getClientProtocol(OzoneClientFactory.java:173)
> s3g_1   | at 
> org.apache.hadoop.ozone.client.OzoneClientFactory.getClient(OzoneClientFactory.java:74)
> s3g_1   | at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.getClient(OzoneClientProducer.java:114)
> s3g_1   | at 
> org.apache.hadoop.ozone.s3.OzoneClientProducer.createClient(OzoneClientProducer.java:71)
> s3g_1   | at 
> jdk.internal.reflect.GeneratedMethodAccessor10.invoke(Unknown Source)
> s3g_1   | at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> s3g_1   | at 
> java.base/java.lang.reflect.Method.invoke(Method.java:566)
> s3g_1   | at 
> org.jboss.weld.injection.StaticMethodInjectionPoint.invoke(StaticMethodInjectionPoint.java:88)
> s3g_1   | ... 92 more
> s3g_1   | Caused by: 
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken):
>  Invalid S3 identifier:OzoneToken owner=testuser/s...@example.com, renewer=, 
> realUser=, issueDate=0, maxDate=0, sequenceNumber=0, masterKeyId=0, 
> strToSign=AWS4-HMAC-SHA256
> s3g_1   | 20200115T101329Z
> s3g_1   | 20200115/us-east-1/s3/aws4_request
> s3g_1   | (hash), signature=(sign), 
> awsAccessKeyId=testuser/s...@example.com{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

55 matches

Mail list logo