dramaticlly commented on code in PR #15426:
URL: https://github.com/apache/iceberg/pull/15426#discussion_r2849240473
##########
core/src/main/java/org/apache/iceberg/ManifestGroup.java:
##########
@@ -222,13 +222,23 @@ public <T extends ScanTask> CloseableIterable<T>
plan(CreateTasksFunction<T> cre
/**
* Returns an iterable for manifest entries in the set of manifests.
*
- * <p>Entries are not copied and it is the caller's responsibility to make
defensive copies if
- * adding these entries to a collection.
+ * <p>When parallel execution is enabled via {@link
#planWith(ExecutorService)}, entries are
+ * defensively copied since the underlying {@link ManifestReader} reuses
entry objects during
+ * iteration. Otherwise, entries are not copied and it is the caller's
responsibility to make
+ * defensive copies if adding these entries to a collection.
*
* @return a CloseableIterable of manifest entries.
*/
public CloseableIterable<ManifestEntry<DataFile>> entries() {
- return CloseableIterable.concat(entries((manifest, entries) -> entries));
+ if (executorService != null) {
+ // copy entries to avoid object reuse issues when scanning manifests in
parallel,
+ // as ManifestReader reuses entry objects during iteration
+ Iterable<CloseableIterable<ManifestEntry<DataFile>>> entryIterables =
+ entries((manifest, entries) -> CloseableIterable.transform(entries,
ManifestEntry::copy));
Review Comment:
Thanks Russell, I think you raised a good point as existing consumer of this
API might already apply the defensive copy so we dont want to double the
memory. I took the stab to follow what's existing
https://github.com/apache/iceberg/blob/9534c9b3adc29d127ecc541ce131f49fd72f1980/core/src/main/java/org/apache/iceberg/ManifestGroup.java#L180
is doing and add a new method which takes a function for transform.
So testManifestGroupEntriesWithParallelExecution() kind of illustrate that
if we only need filePath, we can skip defensive copy like how we collect
DataFiles in FindFiles::collect
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]