You do want version control and a place to discuss spec changes for all spec documents so they need to be in *some* repo. The website is nice to have, but it should just be derived from documents stored in a repo. Whether that repo is parquet-format or parquet-mr isn't too significant. Having said that, as someone who maintains a proprietary Parquet implementation, I enjoy the fact that spec documents and PRs are not mixed with countless Java implementation PRs. By checking the git log of parquet-format, I can quickly check what spec changes were done in the last year or so, which is very useful in determining whether there is something new I should incorporate into our implementation. The same could be achieved by somehow tagging spec commits in parquet-mr (e.g. by giving their name a specific prefix, so I can grep for them), but then there is no guard that someone forgets tagging their commit. All in all, it would be harder to find spec-only changes; the lines would get blurred.
So I have to say that I like the current state. These markdown documents are very important for people who maintain Parquet implementations. So even if parquet-format is "just" a repo with some markdown documents and one thrift declaration, these documents are immensely valuable for implementation maintainers. Cheers, Jan Am Mi., 6. März 2024 um 04:57 Uhr schrieb Vinoo Ganesh < [email protected]>: > Hi Gang, > Thanks - the historical context definitely makes sense and I hear your > concern about breaking existing links. One thing I observed though, is that > this choice also makes Parquet a bit unique in this space. > > For example, Iceberg's Table spec (https://iceberg.apache.org/spec/) and > Puffin (https://iceberg.apache.org/puffin-spec/) exist solely on the > website and not in a separate repo. Avro's spec ( > https://avro.apache.org/docs/1.11.1/) is in the same situation. Arrow does > the same: https://arrow.apache.org/docs/format/Columnar.html in a > versioned > way (last version: https://arrow.apache.org/docs/14.0/format/Columnar.html > ). > > Orc seems to have just recently (3 months ago) introduced an orc-format > repo, though their specs are also published in a versioned way on the > website: https://orc.apache.org/specification/ORCv0/, > https://orc.apache.org/specification/ORCv1/, and even their draft one: > https://orc.apache.org/specification/ORCv2/. It may be worth talking to > them about why they choose to do this. > > Regarding parquet-format, I'm not suggesting that we outright remove it, > but I think there may be value in archiving the repo (so that it's read > only) and doing the work moving forward on the website, just as Iceberg and > Avro seem to do. It could also be a personal bias, but I think the website > offers a bit more flexibility and readability than navigating through > individual markdown files on the repo. We're also using docsy as our > template (as it seems Avro is) so it shouldn't be too crazy to adopt their > model. > > Thanks, Vinoo > > > <[email protected]> > > > On Tue, Mar 5, 2024 at 10:08 PM Gang Wu <[email protected]> wrote: > > > Hi Vinoo, > > > > IMO, we cannot do this because the parquet-format repo serves as the > > dedicated place to hold the parquet specs, which includes the thrift > > definition file and a set of documents tagged for all versions. Some > > projects > > also directly reference the link of the markdown files, which will be > > broken > > if we remove the repo. Even for the deprecated Java code you mentioned > > above, I remember that someone told me the code may still be used by > > legacy projects. So it would not be easy to do such a move. > > > > Best, > > Gang > > > > On Wed, Mar 6, 2024 at 10:31 AM Vinoo Ganesh <[email protected]> > > wrote: > > > >> Hi Parquet Dev - > >> > >> There have been some conversations about content stored on the > >> parquet-format github repo vs. the website. Doing a cursory pass of the > >> parquet-format <https://github.com/apache/parquet-format> repo, it > looks > >> like, other than the markdown documentation stored in the repo, most of > the > >> core code was marked as deprecated here: > >> https://github.com/apache/parquet-format/pull/105, content was moved to > >> parquet-mr, and that entire repo really only exists to host this file: > >> > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift > . > >> It's possible I'm missing something, but is my understanding correct? > >> > >> If so, would it make sense to just deprecate parquet-format as a repo, > >> move the content to be exclusively hosted on parquet-site > >> <https://github.com/apache/parquet-site/tree/asf-site>, and host the > >> thrift file elsewhere? This would solve the content duplication problem > >> between parquet format and the website, and would cut down on having to > >> manage a separate repo. I know there is benefit to having > >> comments/discussions on PRs or issues on the repo, but we could also > pretty > >> easily port this to the site. > >> > >> I'm sure this proposal will elicit some strong responses, but wanted to > >> see if anyone had insights here / if I'm missing anything. > >> > >> Thanks, Vinoo > >> > >> > >> <[email protected]> > >> > > >
