starpact commented on code in PR #799:
URL: https://github.com/apache/iceberg-go/pull/799#discussion_r2982710153


##########
table/arrow_utils.go:
##########
@@ -1343,72 +1344,84 @@ func computeStatsPlan(sc *iceberg.Schema, props 
iceberg.Properties) (map[int]tbl
        return result, nil
 }
 
-func filesToDataFiles(ctx context.Context, fileIO iceio.IO, meta 
*MetadataBuilder, paths iter.Seq[string]) iter.Seq2[iceberg.DataFile, error] {
-       return func(yield func(iceberg.DataFile, error) bool) {
-               defer func() {
-                       if r := recover(); r != nil {
-                               switch e := r.(type) {
-                               case string:
-                                       yield(nil, fmt.Errorf("error 
encountered during file conversion: %s", e))
-                               case error:
-                                       yield(nil, fmt.Errorf("error 
encountered during file conversion: %w", e))
-                               }
-                       }
-               }()
-
-               partitionSpec, err := meta.CurrentSpec()
-               if err != nil || partitionSpec == nil {
-                       yield(nil, fmt.Errorf("%w: cannot add files without a 
current spec", err))
-
-                       return
-               }
+func filesToDataFiles(ctx context.Context, fileIO iceio.IO, meta 
*MetadataBuilder, filePaths []string, concurrency int) (_ []iceberg.DataFile, 
err error) {

Review Comment:
   I think the returned `dataFiles` should keep the same order of input 
`filePaths`. It's possible to achieve it while maintaining an iterator 
interface but I feel the extra complexity isn't worth it.
   
   Generally I don't think implementing this as an iterator brings practical 
benefits:
   1. performance wise: it's lazy but all `DataFile`s are materialized as full 
slice `snapshotProducer.addedFiles` later anyway
   2. parallelizing array processing is more straightforward



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to