Hi All - Sorry I missed this email chain. I've been mostly responsible
for building the infrastructure around the new parquet-site website, but
have mostly left the existing content alone. I'm happy to just link to the
parquet-format repo, but that would mean the content is no longer
searchable from the website, and users would have to first find the link to
the parquet-format repo from the docs and then navigate there.

I could just embed the parquet-format README in an iframe on the spec docs.
Alternatively, as part of the release actions, we can add a task that opens
an issue on parquet-site for update.

Do people have thoughts / opinions on these two?

On Thu, Jan 18, 2024 at 1:33 PM Kaili Zhang <[email protected]> wrote:

> Hi Gabor
>
> I am OK with that. As long as the information is up-to-date, whatever
> method most convenient for the devs will do.
>
> Kind regards
>
> Kaili
>
> ________________________________
> From: Gábor Szádovszky <[email protected]>
> Sent: Monday, January 15, 2024 12:25:39 AM
> To: [email protected] <[email protected]>
> Subject: Re: Discrepancy in parquet format documentation
>
> Hey Gang, Kaili,
>
> I think the easiest way to solve this issue is to completely remove the
> spec from the site and add a reference to the parquet-format repo instead.
> We should probably add the release tag links when we make a release of
> parquet-format with a "latest" link. This way we would also avoid potential
> issues when someone would make decisions based on un-released spec changes.
>
> Cheers,
> Gabor
>
> Kaili Zhang <[email protected]> ezt írta (időpont: 2024. jan. 13., Szo,
> 20:53):
>
> > Hi Gang
> >
> > Thank you for looking into this. Updating the description on
> > parquet.apache.org will save everyone searching for this information a
> > few hours of head scratching. It is unfortunate that the slightly
> > out-of-date spec features more prominently in Google results.
> >
> > Kind regards
> >
> > Kaili
> > ________________________________
> > From: Gang Wu <[email protected]>
> > Sent: Tuesday, January 9, 2024 5:56 PM
> > To: [email protected] <[email protected]>
> > Subject: Re: Discrepancy in parquet format documentation
> >
> > Hi Kaili,
> >
> > You're right. Please refer to the parquet-format repo for specs. The site
> > is unfortunately out of sync for a long time and there isn't any
> automatic
> > process to update it. Let me update the site manually to be in sync with
> > the latest format release.
> >
> > Best,
> > Gang
> >
> > On Sun, Jan 7, 2024 at 8:03 AM Kaili Zhang <[email protected]> wrote:
> >
> > > Hi all
> > >
> > > I found this page via Google when searching for a description of the
> > > parquet binary format:
> > > https://parquet.apache.org/docs/file-format/data-pages/. This page
> > > suggests that definition levels are written before repetition levels.
> > >
> > > However, after experimenting with parquet files generated by pandas and
> > > pyarrow and perusing the arrow source code (especially
> > > InitializeLevelDecoders in
> > >
> >
> https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc
> > ),
> > > I strongly believe that repetition levels are written before definition
> > > levels. I also found this other documentation of parquet format that
> has
> > > repetition levels before definition levels
> > > https://github.com/apache/parquet-format.
> > >
> > > The content of the parquet.apache.org/docs site appears to be tracked
> on
> > > Github under https://github.com/apache/parquet-site. Is the
> > documentation
> > > content still being actively updated? Has there been an effort to
> > > synchronize the format descriptions under apache/parquet-site with
> those
> > > under apache/parquet-format?
> > >
> > > Kind regards
> > >
> > > Kaili
> > >
> > >
> >
>

Reply via email to