Re: [PR] fix(table/scanner): Fix nested field scan [iceberg-go]

via GitHub Wed, 26 Feb 2025 08:56:44 -0800


zeroshade commented on code in PR #311:
URL: https://github.com/apache/iceberg-go/pull/311#discussion_r1971976048



##########
table/arrow_utils.go:
##########
@@ -497,7 +497,10 @@ func (c convertToArrow) Field(field iceberg.NestedField, 
result arrow.Field) arr
 
 func (c convertToArrow) List(list iceberg.ListType, elemResult arrow.Field) 
arrow.Field {
        elemField := c.Field(list.ElementField(), elemResult)
-       return arrow.Field{Type: arrow.LargeListOfField(elemField)}
+       if c.useLargeTypes {

Review Comment:
   That's actually pretty difficult to do, and would make schema conversion 
inconsistent.
   
   Determining whether to use the large types cannot be determined until you 
have the data at which point, if you're streaming the data, it's too late to 
switch as you can't safely change the schema during a stream. For example:
   
   1. Start reading parquet file with string column, total column data in this 
file is only 100MB so we use a regular string (small type) and start streaming 
record batches
   2. one of the last files has 3GB of raw data in the string column so we use 
LargeString for that column from that file
   3. We can't cast the LargeString to String, we can't change the schema of 
the stream to change the column to LargeString, so now we're stuck.
   
   The same problem can occur for List/LargeList depending on the number of 
total elements in a given column among the lists.
   
   We also can't determine ahead of time by the stats in the iceberg metadata 
alone whether or not we should use Large types or not. The only way to know 
ahead of time is to read the parquet file metadata for every data file *before* 
we start producing record batches and then reconcile whether or not we should 
use large types before we start processing the files.
   
   It looks like in the pyiceberg PR you linked, if i'm reading it correctly, 
you just automatically push everything to large types for streaming to avoid 
the problems I mentioned above? which kinda defeats the benefits if the goal 
was to avoid using LargeTypes when they aren't needed (in most cases they 
aren't needed).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] fix(table/scanner): Fix nested field scan [iceberg-go]

Reply via email to