(parquet-site) branch production updated: First draft of docs about parquet format vs mr (#53)

gangwu Wed, 15 May 2024 04:53:16 -0700

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch production
in repository https://gitbox.apache.org/repos/asf/parquet-site.git



The following commit(s) were added to refs/heads/production by this push:
     new de58c2d  First draft of docs about parquet format vs mr (#53)
de58c2d is described below

commit de58c2d227649707b5ea9b5024f1264623ace642
Author: Vinoo Ganesh <vinoogan...@users.noreply.github.com>
AuthorDate: Wed May 15 07:53:05 2024 -0400

    First draft of docs about parquet format vs mr (#53)
    
    
    Co-authored-by: Andrew Lamb <and...@nerdnetworks.org>
    Co-authored-by: Ed Seidl <etse...@users.noreply.github.com>
---
 content/en/docs/Overview/_index.md | 37 +++++++++++++++++++++++++++++++++++++
 1 file changed, 37 insertions(+)

diff --git a/content/en/docs/Overview/_index.md 
b/content/en/docs/Overview/_index.md
index ced1989..58e9e1d 100644
--- a/content/en/docs/Overview/_index.md
+++ b/content/en/docs/Overview/_index.md
@@ -7,3 +7,40 @@ description: >
 ---
 
 Apache Parquet is a columnar storage format available to any project in the 
Hadoop ecosystem, regardless of the choice of data processing framework, data 
model or programming language.
+
+This documentation contains information about both the 
[parquet-mr](https://github.com/apache/parquet-mr) and 
[parquet-format](https://github.com/apache/parquet-format) repositories. 
+
+
+### parquet-format
+
+The parquet-format repository hosts the official specification of the Apache 
Parquet file format, defining how data is structured and stored. This 
specification, along with Thrift metadata definitions and other crucial 
components, is essential for developers to effectively read and write Parquet 
files. The parquet-format project specifically contains the format 
specifications needed to understand and properly utilize Parquet files.
+
+As a repository focused on specification, the parquet-format repository does 
not contain source code. 
+
+
+### parquet-mr
+
+The parquet-mr repository is part of the Apache Parquet project and 
specifically focuses on providing Java tools for handling the Parquet file 
format within the Hadoop ecosystem. The "mr" stands for MapReduce. Essentially, 
this repository includes all the necessary Java libraries and modules that 
allow developers to read and write Apache Parquet files.
+
+The parquet-mr repository contains an implementation of the Apache Parquet 
format. There are a number of other Parquet format implementations, which are 
listed below. 
+
+Included in parquet-mr:
+* Java Implementation: It contains the core Java implementation of the Apache 
Parquet format, making it possible to use Parquet files in Java applications, 
particularly those based on Hadoop.
+
+* Utilities and APIs: It provides various utilities and APIs for working with 
Apache Parquet files, including tools for data import/export, schema 
management, and data conversion.
+
+
+###  Other Clients / Libraries / Tools
+
+The Parquet ecosystem is rich and varied, encompassing a wide array of tools, 
libraries, and clients, each offering different levels of feature support. It's 
important to note that not all implementations support the same features of the 
Parquet format. When integrating multiple Parquet implementations within your 
workflow, it is crucial to conduct thorough testing to ensure compatibility and 
performance across different platforms and tools.
+
+Here is a non-exhaustive list of Parquet implementations:
+
+* [Parquet-mr](https://github.com/apache/parquet-mr)
+* [Parquet C++, a subproject of Arrow 
C++](https://github.com/apache/arrow/tree/main/cpp/src/parquet) 
([documentation](https://arrow.apache.org/docs/cpp/parquet.html))
+* [Parquet Go, a subproject for Arrow 
Go](https://github.com/apache/arrow/tree/main/go/parquet) 
([documentation](https://github.com/apache/arrow/tree/main/go))
+* [Parquet 
Rust](https://github.com/apache/arrow-rs/blob/master/parquet/README.md)
+* [cuDF](https://github.com/rapidsai/cudf)
+* [Apache Impala](https://github.com/apache/impala)
+* [DuckDB](https://github.com/duckdb/duckdb)
+* [fastparquet, a Python implementation of the Apache Parquet 
format](https://github.com/dask/fastparquet)
\ No newline at end of file

(parquet-site) branch production updated: First draft of docs about parquet format vs mr (#53)

Reply via email to