I think that's a fair assumption to make.

I'll open a JIRA for making quoted string parsing optional and a
configurable quote character.

2014-12-09 18:51 GMT+01:00 Max Michels <[email protected]>:

> That sounds like a good idea. Just like setDelimeter("|"), one should be
> able to do a setParseDoubleQuotes(false) to disable the special handling of
> double quotes.
>
> You're right, Fabian, the current implementation treats all String fields
> alike. Maybe we can expect the user to provide a consistently formatted
> input file (i.e. with or without the use of double quotes as identifiers)?
>
> On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <[email protected]> wrote:
>
>> With the current implementation, quoted string parsing kicks in, if the
>> first non-whitespace character of a field is a double quote (just as in
>> Malte's case). I think this behaviour can be quite unexpected for users.
>> Wouldn't it be better to make the behaviour of the String parsing more
>> explicit, i.e., add a switch to dis/enable quoted string parsing. With the
>> current implementation, the configuration would affect all String fields in
>> a file, though...
>>
>> Cheers, Fabian
>>
>> 2014-12-09 12:17 GMT+01:00 Max Michels <[email protected]>:
>>
>>> Hi Malte,
>>>
>>> Typically, double quotes are used to identify strings and thus are not
>>> interpreted literally. Any data in a field after a double quoted string is
>>> regarded as invalid trailing data.
>>>
>>> You could replace double quotes with single quotes:
>>>
>>> A|ggg
>>> B|'hhh' xx
>>> C|xxx
>>>
>>> This results in the expected >'hhh' xx< for the second line.
>>>
>>> Best regards,
>>> Max
>>>
>>> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <[email protected]> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> The result should be >"hhh“ xx<  as field value. Enclosures should be
>>>> disabled but there seems to be no method to do that.
>>>>
>>>>
>>>> Malte
>>>>
>>>> Von: Stephan Ewen <[email protected]>
>>>> Antworten an: <[email protected]>
>>>> Datum: Freitag, 5. Dezember 2014 16:28
>>>> An: <[email protected]>
>>>> Betreff: Re: Quotes in fields of CsvInputFormat
>>>>
>>>> Hi!
>>>>
>>>> The parser interprets the quotes as quotes for the field. That means
>>>> the second field (the string) stops after the "hhh" and the xx is
>>>> considered invalid trailing data.
>>>>
>>>> What do you expect as the result of parsing that line?
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I’m try to import a CSV file but the parser seems to have problems
>>>>> this quotes in the beginning of a field. Is there a way to set or disable
>>>>> enclosures for the CSV input?
>>>>>
>>>>> This is my  code:
>>>>>
>>>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>>>>                 .fieldDelimiter('|')
>>>>>                 .types(String.class, String.class)
>>>>>
>>>>> CSV:
>>>>>
>>>>> A|ggg
>>>>> B|"hhh" xx
>>>>> C|xxx
>>>>>
>>>>> As result I’m receiving a ParserException for line B:
>>>>>
>>>>> *org.apache.flink.api.common.io.ParseException: Line could not be
>>>>> parsed: 'B|"hhh" xx**‘*
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Malte
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to