Justin, I like your proposal. I just had a couple thoughts that I think are worth pointing out. I'm not 100% clear about this point, but I don't believe we currently handle the possibility of various external data sources with heterogeneous encoding schemes in our Kafka topics. As it stands, I believe it's up to users to ensure their external systems have the same encoding as the machines hosting Metron (since we use the default charset). It seems we haven't run into issues with this yet, but it should be noted that if we change to UTF-8 while the external systems continue to use a different default, e.g. ISO-8859-1, things will work fine for ASCII chars 0-127, and then break for 128-255 where ISO-8859-1 still uses a single-byte representation, but UTF-8 uses a 2-byte representation. https://en.wikipedia.org/wiki/ISO/IEC_8859-1 https://en.wikipedia.org/wiki/UTF-8
I don't believe this change breaks anything - HDFS/Hadoop already uses UTF-8 as its default encoding and we'd likely have seen or heard about issues, but it's worth mentioning that there may be some gotchas at the edges outside of Metron. I did a quick search in your PR and didn't see any README updates - the only thing I might recommend is a release notice that Metron is changing our encoding handling to be more explicitly opinionated and will use UTF-8 rather than the default charset. Best, Mike On Wed, Feb 20, 2019 at 8:19 AM Justin Leet <justinjl...@gmail.com> wrote: > Hey all, > > I wanted to bring a bit of attention to a change and its effects before I > push to master. > > PR#1341 <https://github.com/apache/metron/pull/1341> Removes all uses of > the default charset (which is platform dependent) and moves everything to > UTF-8. This PR currently has a +1, but obviously any new input is more than > welcome. > > Going forward, it will be a compiler error to use the default charset (e.g. > getBytes without StandardCharsets.UTF_8, or using APIs that don't allow an > encoding to be passed. I want to call attention to the fact that existing > PRs may break after this merge. > > Post merging this PR, I would send another email to the list (and Slack and > so on), to note that existing PRs should merge in master (or at least > determine if they need to). > > Does anyone have any objections or concerns? Should I be more proactive > about a heads up on PRs, or are we okay with what I'm proposing? >