[GitHub] [iceberg] kbendick commented on pull request #3530: Core: Introduce a CharSeq equals method for faster file path filtering


kbendick commented on pull request #3530:
URL: https://github.com/apache/iceberg/pull/3530#issuecomment-967725102



   > My only concern would be what's mentioned on the comment for 
Comparators.charSequences() implementation:
   > 
   > 
https://github.com/apache/iceberg/blob/b6ce66112bea752513d5317c0813c77ea980643a/api/src/main/java/org/apache/iceberg/types/Comparators.java#L335-L342
   > 
   > Can we run this with path sequences with a mixture of 4-byte character and 
3-byte character UTF-8 elements?
   > 
   > I'm admittedly not sure if it's a huge concern for path filtering, given 
the examples I could find. But I don't know any languages that use anything 
other than the extended latin alphabet, so there might be some concerns if 
people have maybe certain Chinese symbols or characters from other languages?
   > 
   > This page has some example elements you could possibly use to try (though 
cat emoji and exclamation point probably aren't in the path on any file systems 
so a more real world example would probably be better): 
https://stackoverflow.com/questions/6063148/java-unicode-where-to-find-example-n-byte-unicode-characters
   
   Reading through it again, this comment shouldn't matter. We're not sorting 
the strings / char sequences. Just checking for equality. And you convert to 
`toString` for at least one of them.
   
   So even if they are lexicographically somewhat off for ordering purposes. I 
don't think it would affect the correctness of equals if they were both the 
same mixture of 3 byte and 4 byte code points.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] kbendick commented on pull request #3530: Core: Introduce a CharSeq equals method for faster file path filtering

Reply via email to