++ Adding Xinli and Gidon. We discussed this during the Apache Con NA. I will be putting the slide deck slideshare after making some changes soon.
On Tue, Sep 27, 2022 at 11:29 AM Mukund Madhav Thakur <mtha...@cloudera.com> wrote: > Hi Team, > We in hadoop project recently added a new feature in Hadoop Vectored IO > which will be released in the upcoming 3.3.5 hadoop release. > This is a high performance scatter/gather extension of PositionedReadable > API optimized for reading columnar data in cloud storage. > https://issues.apache.org/jira/browse/HADOOP-18103. > We observed really good performance improvements in hive tpch and tpcds > benchmark for orc data stored in S3. > > We are now looking at Parquet integration as well. > https://issues.apache.org/jira/browse/PARQUET-2171 > I have a draft patch which works locally through sparks file reader. > https://github.com/apache/parquet-mr/pull/999 > > We know Parquet likes to support builds against the older versions of > hadoop, we are working on a solution to offer the API through a > shim library. > As I have never contributed to the Parquet codebase and it is totally new > for me, I would really appreciate some help in implementing, testing and > releasing this feature in the best possible way. > > I will be talking about all these in the upcoming Apache Conference NA > next week Tuesday, October 04, 4:10 PM CDT. It would be really great to > meet anyone who would be interested in getting involved in this. > > > > Thanks, > Mukund >