Hi, https://issues.apache.org/jira/browse/PARQUET-2450 is current affecting many of our customers. https://github.com/apache/parquet-mr/pull/1300#issuecomment-2046590751 will fix the issue. Can 1.14.0 be expedited? Or can we do a 1.13.2 patch release to get this fix out faster?
Let me know if there’s anything on my end that I can do to help. On 2024/02/27 14:42:39 Fokko Driesprong wrote: > Hey everyone, > > Thanks for the many responses. > > We should check that parquet-mr implements everything introduced by the new > > parquet-format release. > > > Good call and I fully agree with that. Let's double check that before > starting any releases. > > We should check on every ongoing PRs and Jira's that seem to be targeting > > the next parquet-mr release, and decide if we want to wait for them or > > not. > > > I'm happy to do a first pass on that. > > I am currently doing some work related to direct memory. Not all the related > > jiras are created. Will try to create them and set 1.14.0 as target. Will > > try to finalize everything by the end of next week. > > > Thanks, it is not my main area of expertise, but let me know if you need a > review. I would not want to rush the release if there is still ongoing > work, just wanted to get the ball rolling and collect expectations. > > For the new API, I feel like we're doing a 1.15 and then jump to 2.0, which > is also totally fine with me. > > For who's there, see you at the sync! > > Kind regards, > Fokko Driesprong > > Op do 22 feb 2024 om 13:47 schreef Steve Loughran > <st...@cloudera.com.inva<mailto:st...@cloudera.com.inva>lid>: > > > Apologies for not making any progress -been too busy with releases. > > > > This week I am helping Hadoop 3.4.0 out the door. Hopefully we will only > > need one more iteration to get the packaging right (essentially strip out > > as many transient JARs as we can). My release module does actually build > > parquet as one stage in the validation, so I'm happy we aren't breaking > > your build. > > > > Moving to 3.3+ would be absolutely wonderful; it has been out for years and > > we have fixed many issues as well as done our best to move to less insecure > > transitive dependencies -that is still ongoing. It is ongoing forever I > > suspect. > > > > Unless you use a release with vector IO (3.3.5+) you'll still need to use > > reflection there. > > > > What you will get as soon as you move to 3.3.0 is the openFile() API which > > lets you > > Explicitly declare the read/seek policy of a file. For parquet, "random" is > > what you want. > > Pass in the filestatus or file length when opening a file. For object > > stores, that can save the overhead of an HTTP HEAD request as we can skip > > the probe for the existence and length of the file. > > > > Random IO is the biggest saving here; s3a FS tries to guess your read > > policy and switch to random on the first backwards seek, but it isn't > > perfect. > > > > Regarding vectored read APIs, the Hadoop one maps trivially to the java nio > > scatter/gather read API. Which can deliver great speed ups on native > > storage, especially SSD -more from the ability to do parallel block reads > > than anything else. What does that mean? use the hadoop raw local fS and > > you get it. It also means that any non-hadoop java code should use the nio > > read API directly. > > > > Anyway: I do plan to get onto that PR request as soon as I get a chance. > > - add range overlap detection in the parquet code > > - make sure all hadoop filesystem reject that too. s3a already does AFAIK, > > but I want consistency, contract tests and coverage in the specification. > > > > > > On Wed, 21 Feb 2024 at 15:30, Gang Wu > > <us...@gmail.com<mailto:us...@gmail.com>> wrote: > > > > > Hi, > > > > > > Thanks for bringing this up! > > > > > > For the 1.14.0 release, I think it would be good to include some open > > > PRs, e.g. [1]. > > > > > > Thanks Gabor for the idea of new APIs! I agree that we need to clean > > > up some misused APIs and remove the Hadoop dependencies. In the > > > meanwhile, I actually have some concerns. For example, recently I have > > > just investigated how ApacheSpark and Apache Iceberg support vectorized > > > reading parquet. I have seen many code duplication between them but > > > have different high-level APIs. If we aim to support similar vectorized > > > reader based on Arrow vectors, I am not sure if these clients are willing > > > to > > > migrate due to the difference in type system, performance of vector > > > conversion, etc. That said, this is worth doing and we need to collect > > > sufficient feedback from different communities. > > > > > > [1] https://github.com/apache/parquet-mr/pull/1139 > > > > > > Best, > > > Gang > > > > > > On Wed, Feb 21, 2024 at 8:48 PM Gábor Szádovszky > > > <ga...@apache.org<mailto:ga...@apache.org>> > > wrote: > > > > > > > Thanks for bringing this up, Fokko. > > > > Unfortunately, I won't be able to join next week. (Hopefully I will be > > > > there at the one after.) > > > > So, let me write my thoughts here. > > > > > > > > I agree it is time to start preparing the next parquet-mr release. I > > have > > > > some thoughts: > > > > - We should check that parquet-mr implements everything introduced by > > the > > > > new parquet-format release > > > > - We should check on every ongoing PRs and jiras that seem to be > > > targeting > > > > the next parquet-mr release, and decide if we want to wait for them or > > > not > > > > - I am currently doing some work related to direct memory. Not all the > > > > related jiras are created. Will try to create them and set 1.14.0 as > > > > target. Will try to finalize everything by the end of next week. > > > > > > > > About parquet-mr 2.0: we need to decide what we expect from it. The > > java > > > > upgrade is just one thing that even can be done without a major version > > > > (e.g. separate releases for different java versions) > > > > My original thoughts about 2.0 was to provide a new API for our clients > > > > > > > > - We've had many issues because different API users started using > > > > classes/methods that were originally implemented for internal use only. > > > > Like reading the pages directly. > > > > > > > > - We need to have different levels of APIs that support all current > > > > use-cases. e.g.: > > > > > > > > - Easy to use high level row-wise reading/writing > > > > > > > > - vectorized reading/writing; probably native support of Arrow vectors > > > > > > > > - We need to get rid of the Hadoop dependencies > > > > > > > > - The goal is to have a well-defined public API that we share with our > > > > clients and hide everything else. It is much easier to keep backward > > > > compatibility for the public API only. > > > > > > > > - The new API itself does not need a major release. We can start > > working > > > on > > > > it in a separate module. We'll need some minor release cycles to build > > > it. > > > > (We'll need our client's feedback.) What we need a major release for is > > > > (after having the finalized new API) moving all current public classes > > to > > > > internal modules. > > > > > > > > > > > > Cheers, > > > > Gabor > > > > > > > > > > > > > > > > Fokko Driesprong <fo...@apache.org<mailto:fo...@apache.org>> ezt írta > > > > (időpont: 2024. febr. > > 21., > > > > Sze, 13:04): > > > > > > > > > Hi everyone, > > > > > > > > > > I'm seeing some great progress on the Parquet side and it was almost > > > one > > > > > year ago that I ran the last 1.13.1 release (May 2023). Are there any > > > > > considerations of doing a 1.14.0 release? > > > > > > > > > > Looking forward, I would like to discuss a Parquet-mr 2.0 release. > > > > > > > > > > - Looking at other projects in the space there are more and more > > > that > > > > > are moving to Java 11+, for example, Spark 4.0 (June 2024) and > > > Iceberg > > > > > 2.0 > > > > > (the first release after 1.5.0 that's being voted on right now). > > > > > - We currently have support for Hadoop 2.x which is compiled > > against > > > > > Java 7. I would suggest dropping everything below 3.3 as that's > > the > > > > > minimal > > > > > version supporting Java 11 > > > > > < > > > > > > > > https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+Java+Versions > > > > >. > > > > > Because some APIs changed, we also have to use reflection, which > > is > > > > not > > > > > great. > > > > > > > > > > I would also like to thank Xinli for updating the Parquet Sync > > invite. > > > I > > > > > was there on the 30th of January, but all by myself. The next sync > > next > > > > > week Tuesday would be a great opportunity to go over this topic. > > > > > > > > > > Looking forward to your thoughts! > > > > > > > > > > Kind regards, > > > > > Fokko Driesprong > > > > > > > > > > > > > > >