Hi Gabor - that's a good point. When I say unstable, I don't mean the code itself, I mean universal "compatibility" of the files produced (ex. Prem's case of compatibility of Parquet files produced/consumed by Spark + Dremio). <vinoo.gan...@gmail.com>
In the last Parquet meeting, I brought up discussing / planning for a parquet-mr 2.0 release which I think should at least establish a parquet-mr release as the "formal implementation" of the standard (even if it's mostly a vanity release). On Thu, Apr 25, 2024 at 9:36 AM Gábor Szádovszky <ga...@apache.org> wrote: > Hey, > > I don't think we should call Parquet v2.x features unstable. Since they > were released officially, we maintain backward compatibility. So, from > Parquet format point of view, these features are stable. > It is another question whether a Parquet implementation supports all of > these features or only a subset of them. I think, parquet-mr and > parquet-cpp (Arrow) are keeping up well with these features. Other > implementation (e.g. Impala) might be lagging behind. > I agree it is very hard for the implementations to implement everything or > choose what is really required. There was an initiative a couple of years > ago that I've started but failed to finish. See > https://github.com/apache/parquet-format/pull/164 for details. > > I think the main question is what systems do you create your parquet files > for. If you can list these systems (e.g. Spark, Hive etc.) you can > validate if the files are working with them properly. In many cases > parquet-mr or parquet-cpp are the actual implementation behind. If you want > to create parquet files for any systems to read, you should not use newer > features. (The encodings we are talking about as v2 encodings are 10+ years > old in the Parquet spec.) But keep in mind that in many cases it is not > that simple. For example compression codecs might be supported in a system > or not independently from the actual Parquet implementation. For parquet-mr > it is expected to have the related native libraries installed for some > codecs. > > Cheers, > Gabor > > Prem Sahoo <prem.re...@gmail.com> ezt írta (időpont: 2024. ápr. 24., Sze, > 20:10): > > > Hello Vinoo, > > Thanks for your assistance . Pyarrow folks are using Parquet V2 though it > > is not recommended . I don't want to make any mess so I am just checking > > with all different groups . > > > > On Wed, Apr 24, 2024 at 12:31 PM Vinoo Ganesh <vinoo.gan...@gmail.com> > > wrote: > > > > > I'm not sure what you're looking for. A few different folks (Ryan/Steve > > on > > > the Spark list, Wes on the Arrow list, and Gang/me on the Parquet list) > > > have said that they wouldn't recommend using the Parquet V2 encodings, > > but > > > you're free to do whatever you want in your own data stack, as are the > > > clients who are using Parquet V2. Again, I (and others) personally > > wouldn't > > > recommend storing production data in an unstable format, and that's the > > > reason we are warning against it. > > > > > > On Wed, Apr 24, 2024 at 11:47 AM Prem Sahoo <prem.re...@gmail.com> > > wrote: > > > > > >> Hello Vinoo, > > >> Can you please share a link where it says Parquet V2 is not official > or > > >> not stable for use by third parties ? > > >> > > >> > > >> On Wed, Apr 24, 2024 at 11:28 AM Vinoo Ganesh <vinoo.gan...@gmail.com > > > > >> wrote: > > >> > > >>> Hi Prem, Wes' comment on the thread you posted on the arrow dev list > > >>> should clear up your confusion: > > >>> https://lists.apache.org/thread/72qwr66wf3xyrl5cozgojz88ct23qzxx. > > There > > >>> is a difference between the "standard" itself (parquet-format) and > the > > >>> implementation (parquet-mr, etc...). > > >>> > > >>> Parquet-format (https://github.com/apache/parquet-format) contains > > >>> mostly just the docs and thrift definition now that a PR to clean up > > the > > >>> remaining deprecated code was just merged. Releases of this just > > format, > > >>> which again, is mostly just docs, is what Gang was referring to in > [2]. > > >>> > > >>> We just started conversations about how a Parquet 2.0 release may > look > > >>> in the meeting yesterday. As these conversations progress, the dev > list > > >>> will be kept updated. > > >>> > > >>> > > >>> On Wed, Apr 24, 2024 at 11:10 AM Prem Sahoo <prem.re...@gmail.com> > > >>> wrote: > > >>> > > >>>> Hello Vinoo/Team, > > >>>> As per pyarrow Team , They don't see any concern , please check > > below. > > >>>> Please let us know *where it says Parquet V2 is not official * > > >>>> > > >>>> "> *As per Apache Parquet Community Parquet V2 is not final yet so > it > > >>>> is not > > >>>> > official . They are advising not to use Parquet V2 for writing > > (though > > >>>> code > > >>>> > is available ) .* > > >>>> > > >>>> This would be news to me. Parquet releases are listed (by the > parquet > > >>>> community) at [1] > > >>>> > > >>>> The vote to release parquet 2.10 is here: [2] > > >>>> > > >>>> > > >>>> *Neither of these links mention anything about this being an > > >>>> experimental,unofficial, or non-finalized release.* > > >>>> > > >>>> I understand your concern. I believe your quotes are coming from > your > > >>>> discussion on the parquet mailing list here [3]. This communication > > is > > >>>> unfortunate and confusing to me as well. > > >>>> > > >>>> [1] https://parquet.apache.org/blog/ > > >>>> [2] > https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6 > > >>>> [3] > https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3" > > >>>> > > >>>> > > >>>> On Mon, Apr 22, 2024 at 4:56 PM Prem Sahoo <prem.re...@gmail.com> > > >>>> wrote: > > >>>> > > >>>>> Hello Vinoo/Team,. > > >>>>> I was going through pyarrow and they have started using V2 as > default > > >>>>> . isn't it they should avoid it as it is not official. > > >>>>> > > >>>>> > > >>>>> > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > >>>>> > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > >>>>> > > >>>>> Determine which Parquet logical types are available for use, > whether > > >>>>> the reduced set from the Parquet 1.x.x format or the expanded > > logical types > > >>>>> added in later format versions. Files written with version=’2.4’ or > > ‘2.6’ > > >>>>> may not be readable in all Parquet implementations, so > version=’1.0’ > > is > > >>>>> likely the choice that maximizes file compatibility. UINT32 and > some > > >>>>> logical types are only available with version ‘2.4’. Nanosecond > > timestamps > > >>>>> are only available with version ‘2.6’. Other features such as > > compression > > >>>>> algorithms or the new serialized data page format must be enabled > > >>>>> separately (see ‘compression’ and ‘data_page_version’). > > >>>>> > > >>>> > > >