[
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150860#comment-16150860
]
Luke Hutchison commented on FLINK-6016:
---------------------------------------
Yes, that's what I'm suggesting. The data doesn't have to be read twice, it can
be emitted in the first pass, but the efficiency of doing so depends on the
bandwidth between the single reading thread and the worker threads for each
shard.
A more scalable approach, though more complex, would be to build a state
machine for each shard, recording the state at each input character, and then
"run off the end" of each shard boundary until the state of the parser from the
previous shard matches the state of the parser for the next shard at the same
character position. The "overrun" parser state overwrites the next shard parser
state until the states match. Then the state marker for unquoted newline is
found to determine line breaks.
> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
> Key: FLINK-6016
> URL: https://issues.apache.org/jira/browse/FLINK-6016
> Project: Flink
> Issue Type: Bug
> Components: Batch Connectors and Input/Output Formats
> Affects Versions: 1.2.0
> Reporter: Luke Hutchison
>
> The RFC for the CSV format specifies that newlines are valid in quoted
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING
> Expect field types: class java.lang.String, class java.lang.String
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)