westonpace commented on code in PR #14452:
URL: https://github.com/apache/arrow/pull/14452#discussion_r1000008261


##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.

Review Comment:
   ```suggestion
   to create Arrow Tables or a stream of Arrow RecordBatches.
   ```
   
   This sentence feels slightly grammatically off to me.  It sounds like the 
input data is tables or batches but really it's the output.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.

Review Comment:
   Close enough I suppose.  The streaming reader does use some threads and will 
(hopefully) use more in the future.  But I don't know that we can meaningfully 
explain that here without going too far into the details.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }

Review Comment:
   ```suggestion
         while (true) {
           // Attempt to read the first RecordBatch
           arrow::Status status = reader->ReadNext(&batch);
   
           if (!status.ok()) {
             // Handle read error
           }
   
           if (batch == NULL) {
             // Handle end of file
             break;
           }
           
           // Do something with the batch
         }
   ```
   
   I know it is an example but I feel we should show actual streaming 
consumption.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.

Review Comment:
   Or else what?  It may not be clear to the user this would lead to an error 
like "XYZ is not a valid float32"



##########
docs/source/cpp/csv.rst:
##########
@@ -25,15 +25,26 @@ Reading and Writing CSV files
 =============================
 
 Arrow provides a fast CSV reader allowing ingestion of external data
-as Arrow tables.
+as Arrow Tables or streamed as Arrow RecordBatches.
 
 .. seealso::
    :ref:`CSV reader/writer API reference <cpp-api-csv>`.
 
-Basic usage
-===========
+Reading CSV files
+=================
+
+Data in a CSV file can either be read in as a single Arrow Table using
+:class:`~arrow::csv::TableReader` or streamed as RecordBatches using
+:class:`~arrow::csv::StreamingReader`. See :ref:`Tradeoffs 
<cpp-csv-tradeoffs>` for a
+discussion of the tradeoffs between the two methods.
+
+Both these readers require an :class:`arrow::io::InputStream` instance

Review Comment:
   Not a problem at the moment.  However, there is [a 
PR](https://github.com/apache/arrow/pull/14269) which would allow you to get 
slightly better performance if you use a random access file (typically if 
you're connected to cloud storage) instead of an input stream.  So we might 
need to update this text after that pull merges.



##########
docs/source/cpp/csv.rst:
##########
@@ -56,19 +67,84 @@ A CSV file is read from a :class:`~arrow::io::InputStream`.
                                       parse_options,
                                       convert_options);
       if (!maybe_reader.ok()) {
-         // Handle TableReader instantiation error...
+        // Handle TableReader instantiation error...
       }
       std::shared_ptr<arrow::csv::TableReader> reader = *maybe_reader;
 
       // Read table from CSV file
       auto maybe_table = reader->Read();
       if (!maybe_table.ok()) {
-         // Handle CSV read error
-         // (for example a CSV syntax error or failed type conversion)
+        // Handle CSV read error
+        // (for example a CSV syntax error or failed type conversion)
       }
       std::shared_ptr<arrow::Table> table = *maybe_table;
    }
 
+StreamingReader
+---------------
+
+.. code-block:: cpp
+
+   #include "arrow/csv/api.h"
+
+   {
+      // ...
+      arrow::io::IOContext io_context = arrow::io::default_io_context();
+      std::shared_ptr<arrow::io::InputStream> input = ...;
+
+      auto read_options = arrow::csv::ReadOptions::Defaults();
+      auto parse_options = arrow::csv::ParseOptions::Defaults();
+      auto convert_options = arrow::csv::ConvertOptions::Defaults();
+
+      // Instantiate StreamingReader from input stream and options
+      auto maybe_reader =
+        arrow::csv::StreamingReader::Make(io_context,
+                                          input,
+                                          read_options,
+                                          parse_options,
+                                          convert_options);
+      if (!maybe_reader.ok()) {
+        // Handle StreamingReader instantiation error...
+      }
+      std::shared_ptr<arrow::csv::StreamingReader> reader = *maybe_reader;
+
+      // Set aside a RecordBatch pointer for re-use while streaming
+      std::shared_ptr<RecordBatch> batch;
+
+      // Attempt to read the first RecordBatch
+      arrow::Status status = reader->ReadNext(&batch);
+
+      if (!status.ok()) {
+        // Handle read error
+      }
+
+      if (batch == NULL) {
+        // Handle end of file
+      }
+   }
+
+.. _cpp-csv-tradeoffs:
+
+Tradeoffs
+---------
+
+The choice between using :class:`~arrow::csv::TableReader` or
+:class:`~arrow::csv::StreamingReader` will depend on your use case but two
+caveats are worth pointing out:
+
+1. :class:`~arrow::csv::TableReader` is capable of using multiple threads (See
+   :ref:`Performance <cpp-csv-performance>`) whereas
+   :class:`~arrow::csv::StreamingReader` is always single-threaded and will
+   ignore :member:`ReadOptions::use_threads`.
+2. :class:`~arrow::csv::StreamingReader` performs type inference off the first
+   block that's read in, after which point the types are frozen. Either set
+   :member:`ReadOptions::block_size` to a large enough value or use
+   :member:`ConvertOptions::column_types` to set the desired data types
+   explicitly.
+

Review Comment:
   These are all cons :)
   
   The table reader requires loading all of the data into memory.  This is a 
pretty significant tradeoff we should point out here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to