[GitHub] [arrow] ksuarez1423 commented on a diff in pull request #14018: ARROW-14161: [C++][Docs] Improve Parquet C++ docs

GitBox Tue, 27 Sep 2022 08:52:15 -0700


ksuarez1423 commented on code in PR #14018:
URL: https://github.com/apache/arrow/pull/14018#discussion_r981398668



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`.
+It will use the batch size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing to reduce number of API calls:
+
+.. code-block:: cpp
+
+   auto reader_properties = parquet::ReaderProperties(pool);
+   reader_properties.enable_buffered_stream();
+   reader_properties.set_buffer_size(4096 * 4); // This is default value
+
+The defaults are generally tuned towards good performance, but parallel column
+decoding is off by default. Enable it in the constructor of 
:class:`ArrowReaderProperties`:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = 
parquet::ArrowReaderProperties(/*use_threads=*/true);
+
+If memory efficiency is more important than performance, then:
+
+#. Do not turn on read coalescing (pre-buffering).
+#. Read data in batches.
+#. Turn off ``use_buffered_stream``.
+
+In addition, if you know certain columns contain many repeated values, you can
+read them as dictionary encoded columns. This is enabled with the 
``set_read_dictionary``
+setting on :class:`ArrowReaderProperties`. If the files were written with Arrow
+C++ and the ``store_schema`` was activated, then the original Arrow schema will
+be automatically read and will override this setting.
+
+StreamReader
+------------
+
+The :class:`StreamReader` allows for Parquet files to be read using
+standard C++ input operators which ensures type-safety.
+
+Please note that types must match the schema exactly i.e. if the
+schema field is an unsigned 16-bit integer then you must supply a
+uint16_t type.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to read field by supplying the incorrect type.
+
+* Attempt to read beyond end of row.
+
+* Attempt to read beyond end of file.
+
+.. code-block:: cpp
+
+   #include "arrow/io/file.h"
+   #include "parquet/stream_reader.h"
+
+   {
+      std::shared_ptr<arrow::io::ReadableFile> infile;
+
+      PARQUET_ASSIGN_OR_THROW(
+         infile,
+         arrow::io::ReadableFile::Open("test.parquet"));
+
+      parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)};
+
+      std::string article;
+      float price;
+      uint32_t quantity;
+
+      while ( !os.eof() )
+      {
+         os >> article >> price >> quantity >> parquet::EndRow;
+         // ...
+      }
+   }
+
+Writing Parquet files
+=====================
+
+WriteTable
+----------
+
+The :func:`arrow::WriteTable` function writes an entire
+:class:`::arrow::Table` to an output file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 8-9
+   :dedent: 2
+
+.. warning::
+
+   Column compression is off by default in C++. See below for how to choose a

Review Comment:
   It is unclear where below is without reviewing the page's table of contents 
-- could include an internal reference link?



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.

Review Comment:
   This happens a few times throughout the article. Why `::arrow::Table` 
instead of `arrow::Table`, or even including the Arrow namespace in the first 
place and just using `Table`?



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the

Review Comment:
   I'd suggest putting a sub-header here -- I was expecting linearity, and had 
to double-take when I realized the code example following this prose does not 
follow from the one above, but is instead another path to file reading. 



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`.
+It will use the batch size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing to reduce number of API calls:
+
+.. code-block:: cpp
+
+   auto reader_properties = parquet::ReaderProperties(pool);
+   reader_properties.enable_buffered_stream();
+   reader_properties.set_buffer_size(4096 * 4); // This is default value
+
+The defaults are generally tuned towards good performance, but parallel column
+decoding is off by default. Enable it in the constructor of 
:class:`ArrowReaderProperties`:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = 
parquet::ArrowReaderProperties(/*use_threads=*/true);
+
+If memory efficiency is more important than performance, then:
+
+#. Do not turn on read coalescing (pre-buffering).
+#. Read data in batches.
+#. Turn off ``use_buffered_stream``.
+
+In addition, if you know certain columns contain many repeated values, you can
+read them as dictionary encoded columns. This is enabled with the 
``set_read_dictionary``
+setting on :class:`ArrowReaderProperties`. If the files were written with Arrow
+C++ and the ``store_schema`` was activated, then the original Arrow schema will
+be automatically read and will override this setting.
+
+StreamReader
+------------
+
+The :class:`StreamReader` allows for Parquet files to be read using
+standard C++ input operators which ensures type-safety.
+
+Please note that types must match the schema exactly i.e. if the
+schema field is an unsigned 16-bit integer then you must supply a
+uint16_t type.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to read field by supplying the incorrect type.
+
+* Attempt to read beyond end of row.
+
+* Attempt to read beyond end of file.
+
+.. code-block:: cpp
+
+   #include "arrow/io/file.h"
+   #include "parquet/stream_reader.h"
+
+   {
+      std::shared_ptr<arrow::io::ReadableFile> infile;
+
+      PARQUET_ASSIGN_OR_THROW(
+         infile,
+         arrow::io::ReadableFile::Open("test.parquet"));
+
+      parquet::StreamReader os{parquet::ParquetFileReader::Open(infile)};
+
+      std::string article;
+      float price;
+      uint32_t quantity;
+
+      while ( !os.eof() )
+      {
+         os >> article >> price >> quantity >> parquet::EndRow;
+         // ...
+      }
+   }
+
+Writing Parquet files
+=====================
+
+WriteTable
+----------
+
+The :func:`arrow::WriteTable` function writes an entire
+:class:`::arrow::Table` to an output file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 8-9
+   :dedent: 2
+
+.. warning::
+
+   Column compression is off by default in C++. See below for how to choose a
+   compression codec in the writer properties.
+
+To write out data batch-by-batch, use :class:`arrow::FileWriter`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status WriteInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 12-13,20,24
+   :dedent: 2
+
+.. seealso::
+
+   For reading multi-file datasets or pushing down filters to prune row groups,
+   see :ref:`Tabular Datasets<cpp-dataset>`.
+
+StreamWriter
+------------
+
+The :class:`StreamWriter` allows for Parquet files to be written using
+standard C++ output operators.  This type-safe approach also ensures
+that rows are written without omitting fields and allows for new row
+groups to be created automatically (after certain volume of data) or
+explicitly by using the :type:`EndRowGroup` stream modifier.
+
+Exceptions are used to signal errors.  A :class:`ParquetException` is
+thrown in the following circumstances:
+
+* Attempt to write a field using an incorrect type.
+
+* Attempt to write too many fields in a row.
+
+* Attempt to skip a required field.
+
+.. code-block:: cpp
+
+   #include "arrow/io/file.h"
+   #include "parquet/stream_writer.h"
+
+   {
+      std::shared_ptr<arrow::io::FileOutputStream> outfile;
+
+      PARQUET_ASSIGN_OR_THROW(
+         outfile,
+         arrow::io::FileOutputStream::Open("test.parquet"));
+
+      parquet::WriterProperties::Builder builder;
+      std::shared_ptr<parquet::schema::GroupNode> schema;
+
+      // Set up builder with required compression type etc.
+      // Define schema.
+      // ...
+
+      parquet::StreamWriter os{
+         parquet::ParquetFileWriter::Open(outfile, schema, builder.build())};
+
+      // Loop over some data structure which provides the required
+      // fields to be written and write each row.
+      for (const auto& a : getArticles())
+      {
+         os << a.name() << a.price() << a.quantity() << parquet::EndRow;
+      }
+   }
+
+Writer properties

Review Comment:
   It is not clear in this article where the `WriterProperties` could be used 
once it is built -- could this include a block that shows the use of the 
properties, like in the reading example? 



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`

Review Comment:
   ```suggestion
   :class:`arrow::FileReaderBuilder` helper class, when paired with the 
:class:`ReaderProperties`
   ```
   
   It appears to me that you use the property classes in tandem with the 
`FileReaderBuilder`, so it seems worth being explicit about that.



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`.

Review Comment:
   ```suggestion
   For reading as a stream of batches, use the 
:class:`arrow::RecordBatchReader`, which you can get via 
:func:`arrow::FileReader::GetRecordBatchReader`.
   ```
   
   



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`.
+It will use the batch size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing to reduce number of API calls:
+
+.. code-block:: cpp
+
+   auto reader_properties = parquet::ReaderProperties(pool);
+   reader_properties.enable_buffered_stream();
+   reader_properties.set_buffer_size(4096 * 4); // This is default value
+
+The defaults are generally tuned towards good performance, but parallel column
+decoding is off by default. Enable it in the constructor of 
:class:`ArrowReaderProperties`:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = 
parquet::ArrowReaderProperties(/*use_threads=*/true);
+
+If memory efficiency is more important than performance, then:
+
+#. Do not turn on read coalescing (pre-buffering).

Review Comment:
   Could drop this pre-buffering if you add the one above.



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is

Review Comment:
   Is this value by value, or a choice between reading full rows or full 
columns?



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`.
+It will use the batch size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing to reduce number of API calls:
+
+.. code-block:: cpp
+
+   auto reader_properties = parquet::ReaderProperties(pool);
+   reader_properties.enable_buffered_stream();
+   reader_properties.set_buffer_size(4096 * 4); // This is default value
+
+The defaults are generally tuned towards good performance, but parallel column
+decoding is off by default. Enable it in the constructor of 
:class:`ArrowReaderProperties`:
+
+.. code-block:: cpp
+
+   auto arrow_reader_props = 
parquet::ArrowReaderProperties(/*use_threads=*/true);
+
+If memory efficiency is more important than performance, then:
+
+#. Do not turn on read coalescing (pre-buffering).
+#. Read data in batches.
+#. Turn off ``use_buffered_stream``.
+
+In addition, if you know certain columns contain many repeated values, you can
+read them as dictionary encoded columns. This is enabled with the 
``set_read_dictionary``

Review Comment:
   Do dictionary-encoded columns come up before here in the Arrow 
documentation? I don't remember them off the top of my head.



##########
docs/source/cpp/parquet.rst:
##########
@@ -32,6 +32,298 @@ is a space-efficient columnar storage format for complex 
data.  The Parquet
 C++ implementation is part of the Apache Arrow project and benefits
 from tight integration with the Arrow C++ classes and facilities.
 
+Reading Parquet files
+=====================
+
+The :class:`arrow::FileReader` class reads data for an entire
+file or row group into an :class:`::arrow::Table`.
+
+The :class:`StreamReader` and :class:`StreamWriter` classes allow for
+data to be written using a C++ input/output streams approach to
+read/write fields column by column and row by row.  This approach is
+offered for ease of use and type-safety.  It is of course also useful
+when data must be streamed as files are read and written
+incrementally.
+
+Please note that the performance of the :class:`StreamReader` and
+:class:`StreamWriter` classes will not be as good due to the type
+checking and the fact that column values are processed one at a time.
+
+FileReader
+----------
+
+The Parquet :class:`arrow::FileReader` requires a
+:class:`::arrow::io::RandomAccessFile` instance representing the input
+file.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadFullFile(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 9-10,14
+   :dedent: 2
+
+Finer-grained options are available through the
+:class:`arrow::FileReaderBuilder` helper class, and the 
:class:`ReaderProperties`
+and :class:`ArrowReaderProperties` classes.
+
+For reading as a stream of batches, use the 
:func:`arrow::FileReader::GetRecordBatchReader`.
+It will use the batch size set in :class:`ArrowReaderProperties`.
+
+.. literalinclude:: ../../../cpp/examples/arrow/parquet_read_write.cc
+   :language: cpp
+   :start-after: arrow::Status ReadInBatches(
+   :end-before: return arrow::Status::OK();
+   :emphasize-lines: 25
+   :dedent: 2
+
+Performance and Memory Efficiency
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+For remote filesystems, use read coalescing to reduce number of API calls:

Review Comment:
   ```suggestion
   For remote filesystems, use read coalescing (pre-buffering) to reduce number 
of API calls:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] ksuarez1423 commented on a diff in pull request #14018: ARROW-14161: [C++][Docs] Improve Parquet C++ docs

Reply via email to