rather than a patch, it would be great if you could link to a pull request based on Sunyu's. It has the added benefit of tracking Sunyu's commits and giving credit to the author of each part.
On Tue, Aug 26, 2014 at 10:25 PM, Parth Chandra <[email protected]> wrote: > Hi Sunyu, > I logged the JIRA and the patch I attached is based off of your GSOC > work. We would like to have it included in the same pull request, but there > is some work that needs to be done on the patch I submitted. > In particular, I think we need to solve the issue of backward > compatibility for the direct decompressor implementation for the Snappy > decompressor. > Hoping for some feedback on the best way to address that. > Thanks > > Parth > > > On Tue, Aug 26, 2014 at 6:52 PM, sunyu duan <[email protected]> wrote: > > > Hi Julien and Jacques, > > > > I saw there seems another thread on ByteBuffer based reading on > > https://issues.apache.org/jira/browse/PARQUET-77 which is similiar to my > > commits. And it also introduced a CompatibilityUtil.java in Parquet. I > > think we can combine the code in here and my pull request on github. > > > > Best, > > Sunyu > > > > > > On Fri, Aug 15, 2014 at 11:21 PM, sunyu duan <[email protected]> wrote: > > > > > Thank you! I've updated the pull request to enable enforcer-plugin and > > > added some compatible interface to make it won't break compatible test. > > > Now I think the pull request is ready to merge. > > > I'm waiting for comments on the codes if it still needs some > improvement > > > before merged. And I'm really happy if you can try it out on real > > cluster. > > > > > > > > > On Thu, Aug 14, 2014 at 11:56 PM, Jacques Nadeau <[email protected]> > > > wrote: > > > > > >> Hi Sunyu, > > >> > > >> Nice work! We've been working with your patch and enhancing it for > > >> incorporation into Apache Drill. What do you think the timeline and > > steps > > >> are to get this into master? We'd be more than happy to help > depending > > on > > >> your time for this in the coming weeks. > > >> > > >> thanks, > > >> Jacques > > >> > > >> > > >> > > >> > > >> On Thu, Aug 14, 2014 at 8:30 AM, sunyu duan <[email protected]> > wrote: > > >> > > >> > Hi everyone, > > >> > > > >> > My name is Sunyu Duan, this year GSOC Student who working on > Parquet. > > >> > > > >> > As most of the work has been done, I wrote this report to summarize > > >> what I > > >> > have done and the result. > > >> > > > >> > My Project is Using Zero-Copy read path in new Hadoop API. The goal > is > > >> to > > >> > exploit the Zero-Copy API introduced by Hadoop to improve read > > >> performance > > >> > of parquet tasks running locally. My contribution is to replace byte > > >> array > > >> > based API with ByteBuffer based API in the reading path to avoid > byte > > >> array > > >> > copy and keep compatible with old APIs. Here is the complete pull > > >> request. > > >> > https://github.com/apache/incubator-parquet-mr/pull/6 > > >> > > > >> > My work includes two parts. > > >> > > > >> > 1. Make the whole read path use ByteBuffer directly. > > >> > > > >> > > > >> > - Introduce an initFromPage interface in ValueRead and implement > it > > >> in > > >> > each ValueReader. > > >> > - Introduce a ByteBufferInputStream. > > >> > - Introduce a ByteBufferBytesInput. > > >> > - Replace unpack8values method with a ByteBuffer version. > > >> > - Use introduced ByteBuffer based method in the read path. > > >> > > > >> > > > >> > 1. Introduce a Compatible layer to keep compatible with old > Hadoop > > >> API > > >> > > > >> > > > >> > - Introduce a CompatibilityUtil > > >> > - Using the CompatiblityUtil to perform read action > > >> > > > >> > > > >> > > > >> > After coding, I started to benchmark the improvement. After > discussion > > >> with > > >> > my mentor, I modified the TestInputOutputFormat test to inherit > > >> > ClusterMapReduceTestCase which will start a MiniCluster for unit > test. > > >> In > > >> > the unit test, I enabled caching and read shortcircuiting. I > created a > > >> > 500MB and a 1GB log file on my dev box for the test. The test will > > read > > >> in > > >> > the log file and write to the temporary parquet format file using > > >> > MapReduce. Then it will read from the temporary parquet format file > > and > > >> > write to an output file. I inserted time counter on the latter > > mapreduce > > >> > task and used the time spent on the seconde MapReduce Job as an > > >> indicator. > > >> > I ran the unit test with and without Zero-Copy API enabled on 500MB > > and > > >> 1GB > > >> > log file and compared the time spent on each situation. The result > > shows > > >> > below. > > >> > > > >> > > > >> > > > >> > File > > >> > Size Average Reading Time(s) > > >> Improvement > > >> > > > >> > Without Zero-Copy API 500MB > > >> > 576s > > >> > > > >> > Zero-Copy API > > >> > 500MB 394s > > >> > 46% > > >> > > > >> > Without Zero-Copy API 1024MB > > >> > 1080s > > >> > > > >> > Zero-Copy API 1024MB > > >> > 781s 38% > > >> > > > >> > > > >> > > > >> > As we can see, there is about 30~50% improvement on reading > > performance > > >> > which shows the project has reached its goal. But the benchmark is > > >> > insufficient. My dev box has very limited resources and 1GB file is > > the > > >> > maximum file I can put. After GSOC, it'd be better to invite more > > >> people to > > >> > try it out on real cluster with larger file to benchmark its effect > on > > >> real > > >> > situation. > > >> > > > >> > > > >> > Best, > > >> > > > >> > Sunyu > > >> > > > >> > > > > > > > > >
