Re: GSOC Report

sunyu duan Tue, 26 Aug 2014 18:53:40 -0700

Hi Julien and Jacques,

I saw there seems another thread on ByteBuffer based reading on
https://issues.apache.org/jira/browse/PARQUET-77 which is similiar to my
commits. And it also introduced a CompatibilityUtil.java in Parquet. I
think we can combine the code in here and my pull request on github.


Best,
Sunyu


On Fri, Aug 15, 2014 at 11:21 PM, sunyu duan <[email protected]> wrote:

> Thank you! I've updated the pull request to enable enforcer-plugin and
> added some compatible interface to make it won't break compatible test.
> Now I think the pull request is ready to merge.
> I'm waiting for comments on the codes if it still needs some improvement
> before merged. And I'm really happy if you can try it out on real cluster.
>
>
> On Thu, Aug 14, 2014 at 11:56 PM, Jacques Nadeau <[email protected]>
> wrote:
>
>> Hi Sunyu,
>>
>> Nice work!  We've been working with your patch and enhancing it for
>> incorporation into Apache Drill.  What do you think the timeline and steps
>> are to get this into master?  We'd be more than happy to help depending on
>> your time for this in the coming weeks.
>>
>> thanks,
>> Jacques
>>
>>
>>
>>
>> On Thu, Aug 14, 2014 at 8:30 AM, sunyu duan <[email protected]> wrote:
>>
>> > Hi everyone,
>> >
>> > My name is Sunyu Duan, this year GSOC Student who working on Parquet.
>> >
>> > As most of the work has been done, I wrote this report to summarize
>> what I
>> > have done and the result.
>> >
>> > My Project is Using Zero-Copy read path in new Hadoop API. The goal is
>> to
>> > exploit the Zero-Copy API introduced by Hadoop to improve read
>> performance
>> > of parquet tasks running locally. My contribution is to replace byte
>> array
>> > based API with ByteBuffer based API in the reading path to avoid byte
>> array
>> > copy and keep compatible with old APIs. Here is the complete pull
>> request.
>> > https://github.com/apache/incubator-parquet-mr/pull/6
>> >
>> > My work includes two parts.
>> >
>> >    1. Make the whole read path use ByteBuffer directly.
>> >
>> >
>> >    - Introduce an initFromPage interface in ValueRead and implement it
>> in
>> >    each ValueReader.
>> >    - Introduce a ByteBufferInputStream.
>> >    - Introduce a ByteBufferBytesInput.
>> >    - Replace unpack8values method with a ByteBuffer version.
>> >    - Use introduced ByteBuffer based method in the read path.
>> >
>> >
>> >    1. Introduce a Compatible layer to keep compatible with old Hadoop
>> API
>> >
>> >
>> >    - Introduce a CompatibilityUtil
>> >    - Using the CompatiblityUtil to perform read action
>> >
>> >
>> >
>> > After coding, I started to benchmark the improvement. After discussion
>> with
>> > my mentor, I modified the TestInputOutputFormat test to inherit
>> > ClusterMapReduceTestCase which will start a MiniCluster for unit test.
>> In
>> > the unit test, I enabled caching and read shortcircuiting. I created a
>> > 500MB and a 1GB log file on my dev box for the test. The test will read
>> in
>> > the log file and write to the temporary parquet format file using
>> > MapReduce. Then it will read from the temporary parquet format file and
>> > write to an output file. I inserted time counter on the latter mapreduce
>> > task and used the time spent on the seconde MapReduce Job as an
>> indicator.
>> > I ran the unit test with and without Zero-Copy API enabled on 500MB and
>> 1GB
>> > log file and compared the time spent on each situation. The result shows
>> > below.
>> >
>> >
>> >
>> >                                                     File
>> > Size                       Average Reading Time(s)
>> Improvement
>> >
>> > Without Zero-Copy API             500MB
>> > 576s
>> >
>> > Zero-Copy API
>> > 500MB                                 394s
>> >              46%
>> >
>> > Without Zero-Copy API             1024MB
>> > 1080s
>> >
>> > Zero-Copy API                             1024MB
>> >     781s                                              38%
>> >
>> >
>> >
>> > As we can see, there is about 30~50% improvement on reading performance
>> > which shows the project has reached its goal. But the benchmark is
>> > insufficient. My dev box has very limited resources and 1GB file is the
>> > maximum file I can put. After GSOC, it'd be better to invite more
>> people to
>> > try it out on real cluster with larger file to benchmark its effect on
>> real
>> > situation.
>> >
>> >
>> > Best,
>> >
>> > Sunyu
>> >
>>
>
>

Re: GSOC Report

Reply via email to