[ 
https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Zhao updated HDFS-2656:
----------------------------

    Attachment: teragen_terasort_teravalidate_performance.png
                HDFS-2656.patch

Some update on the libwebhdfs. The main change is trying to keep the same 
writing semantic with libhdfs (thanks to Zhanwei for pointing that out), i.e., 
when a client opens a file for writing/appending, before the client closes the 
file, other clients should not be able to open the same file for writing. This 
is achieved by maintain a single http connection with the corresponding 
datanode between the open and close operation. Also addressed part of 
Nicholas's comments -- try to get rid of some of unnecessary memory copying. 

For the performance measurement, before we directly compare the performance 
between libhdfs and libwebhdfs, we first run teragen, terasort, teravalidate in 
a 3-node mini cluster (data size 100,000,000), and compare the performance 
between using DFSClient and WebHdfs. The measurement result is also attached. 
It seems like the main performance bottleneck for webhdfs is in reading 
(teravalidate), which is >3 times more than DFSClient. This is maybe because 
currently webhdfs uses a datanode as a proxy node for reading data even if it 
is across block boundaries.

So in the next step, my work will focus on 1) test/fix/improve current code, 
and 2) to develop a smarter reading mechanism in the client side (i.e., to 
identify the block locations for a large file in the client side), and 3) to 
improve client reading performance by decreasing the number of times of http 
connection creation.

Waiting for your guys' comments!
                
> Implement a pure c client based on webhdfs
> ------------------------------------------
>
>                 Key: HDFS-2656
>                 URL: https://issues.apache.org/jira/browse/HDFS-2656
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: webhdfs
>            Reporter: Zhanwei.Wang
>         Attachments: HDFS-2656.patch, HDFS-2656.patch, 
> HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png
>
>
> Currently, the implementation of libhdfs is based on JNI. The overhead of JVM 
> seems a little big, and libhdfs can also not be used in the environment 
> without hdfs.
> It seems a good idea to implement a pure c client by wrapping webhdfs. It 
> also can be used to access different version of hdfs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to