You do want version control and a place to discuss spec changes for all
spec documents so they need to be in *some* repo. The website is nice to
have, but it should just be derived from documents stored in a repo.
Whether that repo is parquet-format or parquet-mr isn't too significant.
Having said that, as someone who maintains a proprietary Parquet
implementation, I enjoy the fact that spec documents and PRs are not mixed
with countless Java implementation PRs. By checking the git log of
parquet-format, I can quickly check what spec changes were done in the last
year or so, which is very useful in determining whether there is something
new I should incorporate into our implementation.
The same could be achieved by somehow tagging spec commits in parquet-mr
(e.g. by giving their name a specific prefix, so I can grep for them), but
then there is no guard that someone forgets tagging their commit. All in
all, it would be harder to find spec-only changes; the lines would get
blurred.

So I have to say that I like the current state. These markdown documents
are very important for people who maintain Parquet implementations. So even
if parquet-format is "just" a repo with some markdown documents and one
thrift declaration, these documents are immensely valuable for
implementation maintainers.

Cheers,
Jan

Am Mi., 6. März 2024 um 04:57 Uhr schrieb Vinoo Ganesh <
[email protected]>:

> Hi Gang,
>     Thanks - the historical context definitely makes sense and I hear your
> concern about breaking existing links. One thing I observed though, is that
> this choice also makes Parquet a bit unique in this space.
>
> For example, Iceberg's Table spec (https://iceberg.apache.org/spec/) and
> Puffin (https://iceberg.apache.org/puffin-spec/) exist solely on the
> website and not in a separate repo. Avro's spec (
> https://avro.apache.org/docs/1.11.1/) is in the same situation. Arrow does
> the same: https://arrow.apache.org/docs/format/Columnar.html in a
> versioned
> way (last version: https://arrow.apache.org/docs/14.0/format/Columnar.html
> ).
>
> Orc seems to have just recently (3 months ago) introduced an orc-format
> repo, though their specs are also published in a versioned way on the
> website: https://orc.apache.org/specification/ORCv0/,
> https://orc.apache.org/specification/ORCv1/, and even their draft one:
> https://orc.apache.org/specification/ORCv2/. It may be worth talking to
> them about why they choose to do this.
>
> Regarding parquet-format, I'm not suggesting that we outright remove it,
> but I think there may be value in archiving the repo (so that it's read
> only) and doing the work moving forward on the website, just as Iceberg and
> Avro seem to do. It could also be a personal bias, but I think the website
> offers a bit more flexibility and readability than navigating through
> individual markdown files on the repo. We're also using docsy as our
> template (as it seems Avro is) so it shouldn't be too crazy to adopt their
> model.
>
> Thanks, Vinoo
>
>
> <[email protected]>
>
>
> On Tue, Mar 5, 2024 at 10:08 PM Gang Wu <[email protected]> wrote:
>
> > Hi Vinoo,
> >
> > IMO, we cannot do this because the parquet-format repo serves as the
> > dedicated place to hold the parquet specs, which includes the thrift
> > definition file and a set of documents tagged for all versions. Some
> > projects
> > also directly reference the link of the markdown files, which will be
> > broken
> > if we remove the repo. Even for the deprecated Java code you mentioned
> > above, I remember that someone told me the code may still be used by
> > legacy projects. So it would not be easy to do such a move.
> >
> > Best,
> > Gang
> >
> > On Wed, Mar 6, 2024 at 10:31 AM Vinoo Ganesh <[email protected]>
> > wrote:
> >
> >> Hi Parquet Dev -
> >>
> >> There have been some conversations about content stored on the
> >> parquet-format github repo vs. the website. Doing a cursory pass of the
> >> parquet-format <https://github.com/apache/parquet-format> repo, it
> looks
> >> like, other than the markdown documentation stored in the repo, most of
> the
> >> core code was marked as deprecated here:
> >> https://github.com/apache/parquet-format/pull/105, content was moved to
> >> parquet-mr, and that entire repo really only exists to host this file:
> >>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> .
> >> It's possible I'm missing something, but is my understanding correct?
> >>
> >> If so, would it make sense to just deprecate parquet-format as a repo,
> >> move the content to be exclusively hosted on parquet-site
> >> <https://github.com/apache/parquet-site/tree/asf-site>, and host the
> >> thrift file elsewhere? This would solve the content duplication problem
> >> between parquet format and the website, and would cut down on having to
> >> manage a separate repo. I know there is benefit to having
> >> comments/discussions on PRs or issues on the repo, but we could also
> pretty
> >> easily port this to the site.
> >>
> >> I'm sure this proposal will elicit some strong responses, but wanted to
> >> see if anyone had insights here / if I'm missing anything.
> >>
> >> Thanks, Vinoo
> >>
> >>
> >> <[email protected]>
> >>
> >
>

Reply via email to