Hi All - Sorry I missed this email chain. I've been mostly responsible for building the infrastructure around the new parquet-site website, but have mostly left the existing content alone. I'm happy to just link to the parquet-format repo, but that would mean the content is no longer searchable from the website, and users would have to first find the link to the parquet-format repo from the docs and then navigate there.
I could just embed the parquet-format README in an iframe on the spec docs. Alternatively, as part of the release actions, we can add a task that opens an issue on parquet-site for update. Do people have thoughts / opinions on these two? On Thu, Jan 18, 2024 at 1:33 PM Kaili Zhang <[email protected]> wrote: > Hi Gabor > > I am OK with that. As long as the information is up-to-date, whatever > method most convenient for the devs will do. > > Kind regards > > Kaili > > ________________________________ > From: Gábor Szádovszky <[email protected]> > Sent: Monday, January 15, 2024 12:25:39 AM > To: [email protected] <[email protected]> > Subject: Re: Discrepancy in parquet format documentation > > Hey Gang, Kaili, > > I think the easiest way to solve this issue is to completely remove the > spec from the site and add a reference to the parquet-format repo instead. > We should probably add the release tag links when we make a release of > parquet-format with a "latest" link. This way we would also avoid potential > issues when someone would make decisions based on un-released spec changes. > > Cheers, > Gabor > > Kaili Zhang <[email protected]> ezt írta (időpont: 2024. jan. 13., Szo, > 20:53): > > > Hi Gang > > > > Thank you for looking into this. Updating the description on > > parquet.apache.org will save everyone searching for this information a > > few hours of head scratching. It is unfortunate that the slightly > > out-of-date spec features more prominently in Google results. > > > > Kind regards > > > > Kaili > > ________________________________ > > From: Gang Wu <[email protected]> > > Sent: Tuesday, January 9, 2024 5:56 PM > > To: [email protected] <[email protected]> > > Subject: Re: Discrepancy in parquet format documentation > > > > Hi Kaili, > > > > You're right. Please refer to the parquet-format repo for specs. The site > > is unfortunately out of sync for a long time and there isn't any > automatic > > process to update it. Let me update the site manually to be in sync with > > the latest format release. > > > > Best, > > Gang > > > > On Sun, Jan 7, 2024 at 8:03 AM Kaili Zhang <[email protected]> wrote: > > > > > Hi all > > > > > > I found this page via Google when searching for a description of the > > > parquet binary format: > > > https://parquet.apache.org/docs/file-format/data-pages/. This page > > > suggests that definition levels are written before repetition levels. > > > > > > However, after experimenting with parquet files generated by pandas and > > > pyarrow and perusing the arrow source code (especially > > > InitializeLevelDecoders in > > > > > > https://github.com/apache/arrow/blob/main/cpp/src/parquet/column_reader.cc > > ), > > > I strongly believe that repetition levels are written before definition > > > levels. I also found this other documentation of parquet format that > has > > > repetition levels before definition levels > > > https://github.com/apache/parquet-format. > > > > > > The content of the parquet.apache.org/docs site appears to be tracked > on > > > Github under https://github.com/apache/parquet-site. Is the > > documentation > > > content still being actively updated? Has there been an effort to > > > synchronize the format descriptions under apache/parquet-site with > those > > > under apache/parquet-format? > > > > > > Kind regards > > > > > > Kaili > > > > > > > > >
