westonpace commented on PR #13938: URL: https://github.com/apache/arrow/pull/13938#issuecomment-1224349673
> Hmm, it would be nice if this could work even without disabling schema evolution. Perhaps a heuristic is possible? > > if a field name is unique, do as usual > if a field name is non-unique, require that it has the same number of occurrences in both schema, and iterate on those pairs in order That would work. I'm not really opposed to it. Though it seems like it would be very rare that this is the correct behavior. I think we would just be hiding the corner case rather than really resolving it. Either a user is creating files with consistent column ordering, in which case duplicates are fine, or they are not creating files with consistent column ordering, in which case duplicates are a problem. It would be rather odd for a user that a user has inconsistent column ordering except in the case of duplicate column names. > Indeed, we shouldn't disallow this since Parquet itself allows duplicate field names. And our Parquet reader actually also can read such files (it's only the dataset code that fails): Yes, it is not at all a problem when you only have one file and I agree the datasets code should be updated to handle this single-file case correctly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
