westonpace commented on PR #13938:
URL: https://github.com/apache/arrow/pull/13938#issuecomment-1224349673

   > Hmm, it would be nice if this could work even without disabling schema 
evolution. Perhaps a heuristic is possible?
   >
   >    if a field name is unique, do as usual
   >    if a field name is non-unique, require that it has the same number of 
occurrences in both schema, and iterate on those pairs in order
   
   That would work.  I'm not really opposed to it.  Though it seems like it 
would be very rare that this is the correct behavior.  I think we would just be 
hiding the corner case rather than really resolving it.  Either a user is 
creating files with consistent column ordering, in which case duplicates are 
fine, or they are not creating files with consistent column ordering, in which 
case duplicates are a problem.  It would be rather odd for a user that a user 
has inconsistent column ordering except in the case of duplicate column names.
   
   > Indeed, we shouldn't disallow this since Parquet itself allows duplicate 
field names. And our Parquet reader actually also can read such files (it's 
only the dataset code that fails):
   
   Yes, it is not at all a problem when you only have one file and I agree the 
datasets code should be updated to handle this single-file case correctly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to