kbendick commented on pull request #3530: URL: https://github.com/apache/iceberg/pull/3530#issuecomment-967725102
> My only concern would be what's mentioned on the comment for Comparators.charSequences() implementation: > > https://github.com/apache/iceberg/blob/b6ce66112bea752513d5317c0813c77ea980643a/api/src/main/java/org/apache/iceberg/types/Comparators.java#L335-L342 > > Can we run this with path sequences with a mixture of 4-byte character and 3-byte character UTF-8 elements? > > I'm admittedly not sure if it's a huge concern for path filtering, given the examples I could find. But I don't know any languages that use anything other than the extended latin alphabet, so there might be some concerns if people have maybe certain Chinese symbols or characters from other languages? > > This page has some example elements you could possibly use to try (though cat emoji and exclamation point probably aren't in the path on any file systems so a more real world example would probably be better): https://stackoverflow.com/questions/6063148/java-unicode-where-to-find-example-n-byte-unicode-characters Reading through it again, this comment shouldn't matter. We're not sorting the strings / char sequences. Just checking for equality. And you convert to `toString` for at least one of them. So even if they are lexicographically somewhat off for ordering purposes. I don't think it would affect the correctness of equals if they were both the same mixture of 3 byte and 4 byte code points. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
