[GitHub] [arrow] lidavidm commented on a change in pull request #9656: ARROW-11772: [C++] Provide reentrant IPC file reader

GitBox Tue, 09 Mar 2021 08:53:11 -0800


lidavidm commented on a change in pull request #9656:
URL: https://github.com/apache/arrow/pull/9656#discussion_r590544791




##########
File path: cpp/src/arrow/ipc/reader.cc
##########
@@ -1022,6 +1209,31 @@ class RecordBatchFileReaderImpl : public 
RecordBatchFileReader {
 
   ReadStats stats() const override { return stats_; }
 
+  Result<AsyncGenerator<std::shared_ptr<RecordBatch>>> GetRecordBatchGenerator(
+      int readahead_messages, const io::IOContext& io_context) override {
+    auto state = std::make_shared<IpcFileRecordBatchGeneratorState>();
+    state->num_dictionaries_ = num_dictionaries();
+    state->num_record_batches_ = num_record_batches();
+    state->file_ = file_;
+    state->options_ = options_;
+    state->owned_file_ = owned_file_;
+    state->footer_buffer_ = footer_buffer_;
+    state->footer_ = footer_;
+    // Must regenerate uncopyable DictionaryMemo
+    RETURN_NOT_OK(UnpackSchemaMessage(state->footer_->schema(), 
state->options_,
+                                      &state->dictionary_memo_, 
&state->schema_,
+                                      &state->out_schema_, 
&state->field_inclusion_mask_,
+                                      &state->swap_endian_));
+    AsyncGenerator<std::shared_ptr<Message>> message_generator =
+        IpcMessageGenerator(state, io_context);
+    if (readahead_messages > 0) {
+      message_generator =
+          MakeReadaheadGenerator(std::move(message_generator), 
readahead_messages);
+    }
+    return IpcFileRecordBatchGenerator(state, message_generator,
+                                       arrow::internal::GetCpuThreadPool());

Review comment:
       Looking at the ReadFile benchmark it seems reading a 1 MiB batch takes 
about 1ms once there are >=1024 columns. 
   
   ```
   
------------------------------------------------------------------------------------------
   Benchmark                                Time             CPU   Iterations 
UserCounters...
   
------------------------------------------------------------------------------------------
   ReadFile/1/real_time                  8130 ns         8130 ns        86111 
bytes_per_second=120.115G/s
   ReadFile/4/real_time                 10734 ns        10734 ns        65153 
bytes_per_second=90.9826G/s
   ReadFile/16/real_time                21779 ns        21779 ns        32081 
bytes_per_second=44.8389G/s
   ReadFile/64/real_time                67087 ns        67086 ns        10189 
bytes_per_second=14.5567G/s
   ReadFile/256/real_time              274905 ns       274901 ns         2543 
bytes_per_second=3.55236G/s
   ReadFile/1024/real_time            1074018 ns      1074004 ns          650 
bytes_per_second=931.083M/s
   ReadFile/4096/real_time            4307403 ns      4307316 ns          164 
bytes_per_second=232.158M/s
   ReadFile/8192/real_time            8266500 ns      8266343 ns           84 
bytes_per_second=120.97M/s
   ```
   
   So I'll change this to not use a separate thread pool by default. (I'd also 
like to evaluate this benchmark when compression is involved, though.)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] lidavidm commented on a change in pull request #9656: ARROW-11772: [C++] Provide reentrant IPC file reader

Reply via email to