pitrou commented on code in PR #45330:
URL: https://github.com/apache/arrow/pull/45330#discussion_r1950518354


##########
cpp/src/arrow/dataset/scanner.cc:
##########
@@ -318,10 +319,14 @@ Result<EnumeratedRecordBatchGenerator> FragmentToBatches(
       RecordBatch::Make(options->dataset_schema, /*num_rows=*/0, 
std::move(columns)));
   auto enumerated_batch_gen = MakeEnumeratedGenerator(std::move(batch_gen));
 
-  auto combine_fn =
-      [fragment](const Enumerated<std::shared_ptr<RecordBatch>>& record_batch) 
{
-        return EnumeratedRecordBatch{record_batch, fragment};
-      };
+  auto combine_fn = [fragment, cache_metadata = options->cache_metadata](
+                        const Enumerated<std::shared_ptr<RecordBatch>>& 
record_batch) {
+    if (!cache_metadata && record_batch.last) {
+      ARROW_WARN_NOT_OK(fragment.value->ClearCachedMetadata(),
+                        "Could not clear cached metadata on fragment");
+    }

Review Comment:
   Ok, if we want to do something like that, perhaps it would be even better to 
have a separate utility that calls a continuation when a generator reached its 
end. I'm lukewarm towards this because:
   1) it's much more complicated than just checking the `last` flag, which is 
there for a reason
   2) there may be a delay between receiving the item with the `last` flag, and 
receiving the iteration-end item that follows; this would go counter the goal 
of releasing memory as promptly as possible
   
   What do you think?



##########
cpp/src/arrow/dataset/scanner.cc:
##########
@@ -318,10 +319,14 @@ Result<EnumeratedRecordBatchGenerator> FragmentToBatches(
       RecordBatch::Make(options->dataset_schema, /*num_rows=*/0, 
std::move(columns)));
   auto enumerated_batch_gen = MakeEnumeratedGenerator(std::move(batch_gen));
 
-  auto combine_fn =
-      [fragment](const Enumerated<std::shared_ptr<RecordBatch>>& record_batch) 
{
-        return EnumeratedRecordBatch{record_batch, fragment};
-      };
+  auto combine_fn = [fragment, cache_metadata = options->cache_metadata](
+                        const Enumerated<std::shared_ptr<RecordBatch>>& 
record_batch) {
+    if (!cache_metadata && record_batch.last) {
+      ARROW_WARN_NOT_OK(fragment.value->ClearCachedMetadata(),
+                        "Could not clear cached metadata on fragment");
+    }

Review Comment:
   Ok, if we want to do something like that, perhaps it would be even better to 
have a separate utility that calls a continuation when a generator reaches its 
end. I'm lukewarm towards this because:
   1) it's much more complicated than just checking the `last` flag, which is 
there for a reason
   2) there may be a delay between receiving the item with the `last` flag, and 
receiving the iteration-end item that follows; this would go counter the goal 
of releasing memory as promptly as possible
   
   What do you think?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to