Github user greghogan commented on the issue:
https://github.com/apache/flink/pull/2060
Apologies for the long delay. I'd like to attempt to summarize this ticket
and pull request to validate my understanding.
Previously StringParser was using the system encoding and
`GenericCsvInputFormat` was using UTF-8 for the delimiter and an overloadable
UTF-8 for the comment prefix.
StringParser's quoteCharacter remains a `byte` with no encoding.
Now GenericCsvInputFormat can be configured with a charset which is used
for the delimiter, comment prefix, and field parsers (only used in
StringParser).
Should `setCommentPrefix(String commentPrefix, Charset charset)` and
`setCommentPrefix(String commentPrefix, String charsetName)` be removed from
`GenericCsvInputFormat`? Would different encodings be used on the same file?
Allow the user to set the character encoding in `CsvReader` which would be
applied in `CsvReader.configureInputFormat`?
Are the new tests checking the encoding? The test strings are using using
characters common to UTF-8 and ASCII. We could instead use one of the UTF-16
encodings from
https://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---