[ https://issues.apache.org/jira/browse/HDFS-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jing Zhao updated HDFS-2656: ---------------------------- Attachment: teragen_terasort_teravalidate_performance.png HDFS-2656.patch Some update on the libwebhdfs. The main change is trying to keep the same writing semantic with libhdfs (thanks to Zhanwei for pointing that out), i.e., when a client opens a file for writing/appending, before the client closes the file, other clients should not be able to open the same file for writing. This is achieved by maintain a single http connection with the corresponding datanode between the open and close operation. Also addressed part of Nicholas's comments -- try to get rid of some of unnecessary memory copying. For the performance measurement, before we directly compare the performance between libhdfs and libwebhdfs, we first run teragen, terasort, teravalidate in a 3-node mini cluster (data size 100,000,000), and compare the performance between using DFSClient and WebHdfs. The measurement result is also attached. It seems like the main performance bottleneck for webhdfs is in reading (teravalidate), which is >3 times more than DFSClient. This is maybe because currently webhdfs uses a datanode as a proxy node for reading data even if it is across block boundaries. So in the next step, my work will focus on 1) test/fix/improve current code, and 2) to develop a smarter reading mechanism in the client side (i.e., to identify the block locations for a large file in the client side), and 3) to improve client reading performance by decreasing the number of times of http connection creation. Waiting for your guys' comments! > Implement a pure c client based on webhdfs > ------------------------------------------ > > Key: HDFS-2656 > URL: https://issues.apache.org/jira/browse/HDFS-2656 > Project: Hadoop HDFS > Issue Type: Improvement > Components: webhdfs > Reporter: Zhanwei.Wang > Attachments: HDFS-2656.patch, HDFS-2656.patch, > HDFS-2656.unfinished.patch, teragen_terasort_teravalidate_performance.png > > > Currently, the implementation of libhdfs is based on JNI. The overhead of JVM > seems a little big, and libhdfs can also not be used in the environment > without hdfs. > It seems a good idea to implement a pure c client by wrapping webhdfs. It > also can be used to access different version of hdfs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira