[
https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150827#comment-16150827
]
Mikhail Lipkovich commented on FLINK-6016:
------------------------------------------
Thank you for the reply Luke
For now FileInputFormat identifies splits using information about blocks, no
data is actually read. If I correctly understand you, the suggestion is to
modify this reader so that it downloads all blocks, parses it according to
quoted newline characters and returns split boundaries. Therefore the data will
be traversed twice: once in a single thread for splits identification and the
second one for actual data processing.
Probably I'm able to implement it but I think it would be better for me to
implement few easier tasks before diving into this one.
Please let me know if my understanding of your comment is wrong
> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
> Key: FLINK-6016
> URL: https://issues.apache.org/jira/browse/FLINK-6016
> Project: Flink
> Issue Type: Bug
> Components: Batch Connectors and Input/Output Formats
> Affects Versions: 1.2.0
> Reporter: Luke Hutchison
>
> The RFC for the CSV format specifies that newlines are valid in quoted
> strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING
> Expect field types: class java.lang.String, class java.lang.String
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)