Re: GSOC Report

Julien Le Dem Fri, 29 Aug 2014 16:30:12 -0700

rather than a patch, it would be great if you could link to a pull request
based on Sunyu's.
It has the added benefit of tracking Sunyu's commits and giving credit to
the author of each part.



On Tue, Aug 26, 2014 at 10:25 PM, Parth Chandra <[email protected]>
wrote:

> Hi Sunyu,
>   I logged the JIRA and the patch I attached is based off of your GSOC
> work. We would like to have it included in the same pull request, but there
> is some work that needs to be done on the patch I submitted.
>   In particular, I think we need to solve the issue of backward
> compatibility for the direct decompressor implementation for the Snappy
> decompressor.
>   Hoping for some feedback on the best way to address that.
>   Thanks
>
> Parth
>
>
> On Tue, Aug 26, 2014 at 6:52 PM, sunyu duan <[email protected]> wrote:
>
> > Hi Julien and Jacques,
> >
> > I saw there seems another thread on ByteBuffer based reading on
> > https://issues.apache.org/jira/browse/PARQUET-77 which is similiar to my
> > commits. And it also introduced a CompatibilityUtil.java in Parquet. I
> > think we can combine the code in here and my pull request on github.
> >
> > Best,
> > Sunyu
> >
> >
> > On Fri, Aug 15, 2014 at 11:21 PM, sunyu duan <[email protected]> wrote:
> >
> > > Thank you! I've updated the pull request to enable enforcer-plugin and
> > > added some compatible interface to make it won't break compatible test.
> > > Now I think the pull request is ready to merge.
> > > I'm waiting for comments on the codes if it still needs some
> improvement
> > > before merged. And I'm really happy if you can try it out on real
> > cluster.
> > >
> > >
> > > On Thu, Aug 14, 2014 at 11:56 PM, Jacques Nadeau <[email protected]>
> > > wrote:
> > >
> > >> Hi Sunyu,
> > >>
> > >> Nice work!  We've been working with your patch and enhancing it for
> > >> incorporation into Apache Drill.  What do you think the timeline and
> > steps
> > >> are to get this into master?  We'd be more than happy to help
> depending
> > on
> > >> your time for this in the coming weeks.
> > >>
> > >> thanks,
> > >> Jacques
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, Aug 14, 2014 at 8:30 AM, sunyu duan <[email protected]>
> wrote:
> > >>
> > >> > Hi everyone,
> > >> >
> > >> > My name is Sunyu Duan, this year GSOC Student who working on
> Parquet.
> > >> >
> > >> > As most of the work has been done, I wrote this report to summarize
> > >> what I
> > >> > have done and the result.
> > >> >
> > >> > My Project is Using Zero-Copy read path in new Hadoop API. The goal
> is
> > >> to
> > >> > exploit the Zero-Copy API introduced by Hadoop to improve read
> > >> performance
> > >> > of parquet tasks running locally. My contribution is to replace byte
> > >> array
> > >> > based API with ByteBuffer based API in the reading path to avoid
> byte
> > >> array
> > >> > copy and keep compatible with old APIs. Here is the complete pull
> > >> request.
> > >> > https://github.com/apache/incubator-parquet-mr/pull/6
> > >> >
> > >> > My work includes two parts.
> > >> >
> > >> >    1. Make the whole read path use ByteBuffer directly.
> > >> >
> > >> >
> > >> >    - Introduce an initFromPage interface in ValueRead and implement
> it
> > >> in
> > >> >    each ValueReader.
> > >> >    - Introduce a ByteBufferInputStream.
> > >> >    - Introduce a ByteBufferBytesInput.
> > >> >    - Replace unpack8values method with a ByteBuffer version.
> > >> >    - Use introduced ByteBuffer based method in the read path.
> > >> >
> > >> >
> > >> >    1. Introduce a Compatible layer to keep compatible with old
> Hadoop
> > >> API
> > >> >
> > >> >
> > >> >    - Introduce a CompatibilityUtil
> > >> >    - Using the CompatiblityUtil to perform read action
> > >> >
> > >> >
> > >> >
> > >> > After coding, I started to benchmark the improvement. After
> discussion
> > >> with
> > >> > my mentor, I modified the TestInputOutputFormat test to inherit
> > >> > ClusterMapReduceTestCase which will start a MiniCluster for unit
> test.
> > >> In
> > >> > the unit test, I enabled caching and read shortcircuiting. I
> created a
> > >> > 500MB and a 1GB log file on my dev box for the test. The test will
> > read
> > >> in
> > >> > the log file and write to the temporary parquet format file using
> > >> > MapReduce. Then it will read from the temporary parquet format file
> > and
> > >> > write to an output file. I inserted time counter on the latter
> > mapreduce
> > >> > task and used the time spent on the seconde MapReduce Job as an
> > >> indicator.
> > >> > I ran the unit test with and without Zero-Copy API enabled on 500MB
> > and
> > >> 1GB
> > >> > log file and compared the time spent on each situation. The result
> > shows
> > >> > below.
> > >> >
> > >> >
> > >> >
> > >> >                                                     File
> > >> > Size                       Average Reading Time(s)
> > >> Improvement
> > >> >
> > >> > Without Zero-Copy API             500MB
> > >> > 576s
> > >> >
> > >> > Zero-Copy API
> > >> > 500MB                                 394s
> > >> >              46%
> > >> >
> > >> > Without Zero-Copy API             1024MB
> > >> > 1080s
> > >> >
> > >> > Zero-Copy API                             1024MB
> > >> >     781s                                              38%
> > >> >
> > >> >
> > >> >
> > >> > As we can see, there is about 30~50% improvement on reading
> > performance
> > >> > which shows the project has reached its goal. But the benchmark is
> > >> > insufficient. My dev box has very limited resources and 1GB file is
> > the
> > >> > maximum file I can put. After GSOC, it'd be better to invite more
> > >> people to
> > >> > try it out on real cluster with larger file to benchmark its effect
> on
> > >> real
> > >> > situation.
> > >> >
> > >> >
> > >> > Best,
> > >> >
> > >> > Sunyu
> > >> >
> > >>
> > >
> > >
> >
>

Re: GSOC Report

Reply via email to