[ https://issues.apache.org/jira/browse/FLINK-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15726729#comment-15726729 ]
ASF GitHub Bot commented on FLINK-3921: --------------------------------------- Github user fhueske commented on the issue: https://github.com/apache/flink/pull/2901 The byte-level implementation as in `IntParser` was originally used to avoid String object instances. Such an implementation was not possible with Double because it led to very imprecise results. Therefore, we chose the String approach there. You are of course right, that we would need to use the String approach as well if a charset is used that is not compatible with the current byte-level parsing. IMO, it makes sense to open a JIRA for that and solving this as a follow up to this issue. Feel free to merge this PR. > StringParser not specifying encoding to use > ------------------------------------------- > > Key: FLINK-3921 > URL: https://issues.apache.org/jira/browse/FLINK-3921 > Project: Flink > Issue Type: Improvement > Components: Core > Affects Versions: 1.0.3 > Reporter: Tatu Saloranta > Assignee: Rekha Joshi > Priority: Trivial > > Class `flink.types.parser.StringParser` has javadocs indicating that contents > are expected to be Ascii, similar to `StringValueParser`. That makes sense, > but when constructing actual instance, no encoding is specified; on line 66 > f.ex: > this.result = new String(bytes, startPos+1, i - startPos - 2); > which leads to using whatever default platform encoding is. If contents > really are always Ascii (would not count on that as parser is used from CSV > reader), not a big deal, but it can lead to the usual Latin-1-VS-UTF-8 issues. > So I think that encoding should be explicitly specified, whatever is to be > used: javadocs claim ascii, so could be "us-ascii", but could well be UTF-8 > or even ISO-8859-1. -- This message was sent by Atlassian JIRA (v6.3.4#6332)