[GitHub] [arrow] wesm commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

GitBox Sun, 12 Jul 2020 14:44:07 -0700


wesm commented on a change in pull request #7534:
URL: https://github.com/apache/arrow/pull/7534#discussion_r453366215




##########
File path: rust/datafusion/README.md
##########
@@ -69,5 +69,15 @@ DataFusion includes a simple command-line interactive SQL 
utility. See the [CLI
 - [x] Parquet primitive types
 - [ ] Parquet nested types
 
-# Examples
+# Supported SQL
 
+This library currently supports the following SQL constructs:
+
+* `CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a 
table's locations
+* `SELECT ... FROM ...` together with any expression
+* `ALIAS` to name an expression
+* `CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
+* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, 
`tan`, `>=`.
+* `WHERE` to filter
+* `GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, 
`COUNT`, `SUM`, `AVG`
+* `ORDER BY` together with an expression and optional `DESC`

Review comment:
       This shouldn't be here

##########
File path: cpp/src/parquet/arrow/reader.h
##########
@@ -151,8 +151,10 @@ class PARQUET_EXPORT FileReader {
       int i, std::shared_ptr<::arrow::ChunkedArray>* out) = 0;
 
   /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices, the
-  ///    ordering in row_group_indices matters.
-  /// \returns error Status if row_group_indices contains invalid index
+  ///     ordering in row_group_indices matters. Each row group will held in 
memory until
+  ///     each slice has been yielded as a RecordBatch, at which point the 
next row group
+  ///     will be loaded. FileReaders must outlive their RecordBatchReaders.

Review comment:
       Probably want to add a TODO here that we want to fix this (this is a 
problem for memory use)

##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -794,8 +773,35 @@ Status FileReaderImpl::GetRecordBatchReader(const 
std::vector<int>& row_group_in
     END_PARQUET_CATCH_EXCEPTIONS
   }
 
-  return RowGroupRecordBatchReader::Make(row_group_indices, column_indices, 
this,
-                                         reader_properties_.batch_size(), out);
+  using ::arrow::RecordBatchIterator;
+
+  // NB: This lambda will be invoked lazily whenever a new row group must be
+  // scanned, so it must capture `column_indices` by value (it will not
+  // otherwise outlive the scope of `GetRecordBatchReader()`). `this` is a 
non-owning
+  // pointer so we are relying on the parent FileReader outliving this 
RecordBatchReader.
+  auto row_group_index_to_batch_iterator =
+      [column_indices, this](const int* i) -> 
::arrow::Result<RecordBatchIterator> {
+    std::shared_ptr<::arrow::Table> table;
+    RETURN_NOT_OK(RowGroup(*i)->ReadTable(column_indices, &table));

Review comment:
       It isn't great to be reading the whole row group into memory before 
returning it in chunks, is there an issue about materializing only a chunk at a 
time (or a background actor-type reader that will prepare the materialize the 
next chunk asynchronously in between calls to `Next`)?

##########
File path: cpp/src/parquet/arrow/reader.h
##########
@@ -163,8 +165,11 @@ class PARQUET_EXPORT FileReader {
   /// \brief Return a RecordBatchReader of row groups selected from
   ///     row_group_indices, whose columns are selected by column_indices. The
   ///     ordering in row_group_indices and column_indices matter.
+  ///     Each row group will held in memory until each slice has been yielded 
as a
+  ///     RecordBatch, at which point the next row group will be loaded. 
FileReaders must
+  ///     outlive their RecordBatchReaders.

Review comment:
       Add a TODO that this is considered to be a problem




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] wesm commented on a change in pull request #7534: ARROW-8729: [C++][Dataset] Ensure non-empty batches when only virtual columns are projected

Reply via email to