[GitHub] [arrow] wesm commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

GitBox Fri, 25 Sep 2020 11:57:43 -0700


wesm commented on a change in pull request #8188:
URL: https://github.com/apache/arrow/pull/8188#discussion_r495174760




##########
File path: c_glib/test/dataset/test-scan-options.rb
##########
@@ -28,7 +28,7 @@ def test_schema
   end
 
   def test_batch_size
-    assert_equal(1<<15,
+    assert_equal(1<<20,

Review comment:
       Should probably be a constant, but doesn't need to be fixed here

##########
File path: cpp/src/arrow/dataset/scanner.h
##########
@@ -73,7 +73,7 @@ class ARROW_DS_EXPORT ScanOptions {
   RecordBatchProjector projector;
 
   // Maximum row count for scanned batches.
-  int64_t batch_size = 1 << 15;
+  int64_t batch_size = 1 << 20;

Review comment:
       Should this be a constant `kDefaultBatchSize`?

##########
File path: ci/scripts/python_test.sh
##########
@@ -29,4 +29,4 @@ export LD_LIBRARY_PATH=${ARROW_HOME}/lib:${LD_LIBRARY_PATH}
 # Enable some checks inside Python itself
 export PYTHONDEVMODE=1
 
-pytest -r s --pyargs pyarrow
+pytest -r s -vvv --pyargs pyarrow

Review comment:
       Revert this?

##########
File path: cpp/src/parquet/arrow/reader.cc
##########
@@ -856,18 +856,32 @@ Status FileReaderImpl::GetRecordBatchReader(const 
std::vector<int>& row_groups,
     return Status::OK();
   }
 
+  int64_t num_rows = 0;
+  for (int row_group : row_groups) {
+    num_rows += parquet_reader()->metadata()->RowGroup(row_group)->num_rows();
+  }

Review comment:
       Should this be a helper method on the FileMetaData?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] wesm commented on a change in pull request #8188: ARROW-9924: [C++][Dataset] Enable per-column parallelism for single ParquetFileFragment scans

Reply via email to