Hi everyone,

My name is Sunyu Duan, this year GSOC Student who working on Parquet.

As most of the work has been done, I wrote this report to summarize what I
have done and the result.

My Project is Using Zero-Copy read path in new Hadoop API. The goal is to
exploit the Zero-Copy API introduced by Hadoop to improve read performance
of parquet tasks running locally. My contribution is to replace byte array
based API with ByteBuffer based API in the reading path to avoid byte array
copy and keep compatible with old APIs. Here is the complete pull request.
https://github.com/apache/incubator-parquet-mr/pull/6

My work includes two parts.

   1. Make the whole read path use ByteBuffer directly.


   - Introduce an initFromPage interface in ValueRead and implement it in
   each ValueReader.
   - Introduce a ByteBufferInputStream.
   - Introduce a ByteBufferBytesInput.
   - Replace unpack8values method with a ByteBuffer version.
   - Use introduced ByteBuffer based method in the read path.


   1. Introduce a Compatible layer to keep compatible with old Hadoop API


   - Introduce a CompatibilityUtil
   - Using the CompatiblityUtil to perform read action



After coding, I started to benchmark the improvement. After discussion with
my mentor, I modified the TestInputOutputFormat test to inherit
ClusterMapReduceTestCase which will start a MiniCluster for unit test. In
the unit test, I enabled caching and read shortcircuiting. I created a
500MB and a 1GB log file on my dev box for the test. The test will read in
the log file and write to the temporary parquet format file using
MapReduce. Then it will read from the temporary parquet format file and
write to an output file. I inserted time counter on the latter mapreduce
task and used the time spent on the seconde MapReduce Job as an indicator.
I ran the unit test with and without Zero-Copy API enabled on 500MB and 1GB
log file and compared the time spent on each situation. The result shows
below.



                                                    File
Size                       Average Reading Time(s)            Improvement

Without Zero-Copy API             500MB
576s

Zero-Copy API
500MB                                 394s
             46%

Without Zero-Copy API             1024MB
1080s

Zero-Copy API                             1024MB
    781s                                              38%



As we can see, there is about 30~50% improvement on reading performance
which shows the project has reached its goal. But the benchmark is
insufficient. My dev box has very limited resources and 1GB file is the
maximum file I can put. After GSOC, it'd be better to invite more people to
try it out on real cluster with larger file to benchmark its effect on real
situation.


Best,

Sunyu

Reply via email to