[ 
https://issues.apache.org/jira/browse/HDDS-2443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17033169#comment-17033169
 ] 

mingchao zhao commented on HDDS-2443:
-------------------------------------

Hi [~cxorm] Any progress on the previous question? Here's what I got:

I had a look at the pyarrow connect() execution process. Pyarrow's connect() 
use libhdfs‘s 
(jni-based)[hdfsConnect|[https://github.com/apache/arrow/blob/207b3507be82e92ebf29ec7d6d3b0bb86091c09a/python/pyarrow/hdfs.py#L206]].
 Here are some questions:
The first time this method is called in the process, it will take a long time 
to load the library.
In my test, each operation would start a separate process and then Connect and 
upload. Each connect will cost about 1.5 secondse. If the user's scenario is 
the same as mine, their operation will be slow too. We tested AWS python client 
(boto3)  and boto3 performed much better under the same conditions

*It would be much better if the user only created connect once and then reused 
it.* I've tested the reuse of connect and the performance has improved 
tremendously:
Test cluster: use pyarrow client. 9 physical machines, each with 10 HDD disks, 
1 as master for OM and SCM, 8 as datanodes.

|upload files|Total size|Multi Raft latency(s)
reuse connect|Multi Raft latency(s)
no reuse connect|
|100KB * 1000 files|100MB|151.858362913|2471.23463202|
|100KB * 20000 files |2GB|2482.97329998
=~0.69h|49398.845176
=~13.7h|

> Python client/interface for Ozone
> ---------------------------------
>
>                 Key: HDDS-2443
>                 URL: https://issues.apache.org/jira/browse/HDDS-2443
>             Project: Hadoop Distributed Data Store
>          Issue Type: New Feature
>          Components: Ozone Client
>            Reporter: Li Cheng
>            Priority: Major
>         Attachments: Ozone with pyarrow.html, OzoneS3.py
>
>
> This Jira will be used to track development for python client/interface of 
> Ozone.
> Original ideas: item#25 in 
> [https://cwiki.apache.org/confluence/display/HADOOP/Ozone+project+ideas+for+new+contributors]
> Ozone Client(Python) for Data Science Notebook such as Jupyter.
>  # Size: Large
>  # PyArrow: [https://pypi.org/project/pyarrow/]
>  # Python -> libhdfs HDFS JNI library (HDFS, S3,...) -> Java client API 
> Impala uses  libhdfs
> Path to try:
>  # s3 interface: Ozone s3 gateway(already supported) + AWS python client 
> (boto3)
>  # python native RPC
>  # pyarrow + libhdfs, which use the Java client under the hood.
>  # python + C interface of go / rust ozone library. I created POC go / rust 
> clients earlier which can be improved if the libhdfs interface is not good 
> enough. [By [~elek]]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: ozone-issues-h...@hadoop.apache.org

Reply via email to