xhochy commented on code in PR #53: URL: https://github.com/apache/parquet-site/pull/53#discussion_r1582661355
########## content/en/docs/Overview/_index.md: ########## @@ -7,3 +7,30 @@ description: > --- Apache Parquet is a columnar storage format available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model or programming language. + +This documentation contains information about both the [parquet-mr](https://github.com/apache/parquet-mr) and [parquet-format](https://github.com/apache/parquet-format) projects. + + +### Parquet Format + +The "Parquet Format" project hosts the official specification of the Parquet file format, defining how data is structured and stored. This specification, along with Thrift metadata definitions and other crucial components, is essential for developers to effectively read and write Parquet files. The parquet-format project specifically contains the format specifications needed to understand and properly utilize Parquet files. + +As a repository focused on specification, the parquet-format repository does not contain source code. + + +### Parquet MR + +The parquet-mr GitHub repository is part of the Apache Parquet project and specifically focuses on providing Java tools for handling the Parquet file format within the Hadoop ecosystem. Essentially, this repository includes all the necessary Java libraries and modules that allow developers to read and write Parquet files. + + Parquet MR can be thought of the a "reference" implementation of parquet-format. There are a number of other Parquet Format implementations, such as [parquet-cpp](https://github.com/apache/parquet-cpp) and [parquet rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md). + + +* Java/Scala Implementation: It contains the core Java/Scala implementation of the Parquet format, making it possible to use Parquet files in Java applications, particularly those based on Hadoop. + +* Utilities and APIs: It provides various utilities and APIs for working with Parquet files, including tools for data import/export, schema management, and data conversion. + + +### Other Clients / Libraries / Tools Review Comment: I think it would make sense to generally cover the `parquet-cpp` / `arrow` situation as this is also leading sometimes to a bit of confusion. The `parquet` part of the Arrow repository is actually part of the Parquet PMC, but back in the past we decided to merge it into Arrow as the Parquet C++ community was a subset of the Arrow C++ community and all development happened in the context of Arrow C++/Python. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org