Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

Koert Kuipers Thu, 26 May 2016 16:18:06 -0700

ok, thanks for creating ticket.

just to be clear: my example was in scala


On Thu, May 26, 2016 at 7:07 PM, Reynold Xin <r...@databricks.com> wrote:

> This is unfortunately due to the way we set handle default values in
> Python. I agree it doesn't follow the principle of least astonishment.
>
> Maybe the best thing to do here is to put the actual default values in the
> Python API for csv (and json, parquet, etc), rather than using None in
> Python. This would require us to duplicate default values twice (once in
> data source options, and another in the Python API), but that's probably OK
> given they shouldn't change all the time.
>
> Ticket https://issues.apache.org/jira/browse/SPARK-15585
>
>
>
>
> On Thu, May 26, 2016 at 3:35 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> in spark 1.6.1 we used:
>>  sqlContext.read
>>       .format("com.databricks.spark.csv")
>>       .delimiter("~")
>>       .option("quote", null)
>>
>> this effectively turned off quoting, which is a necessity for certain
>> data formats where quoting is not supported and "\"" is a valid character
>> itself in the data.
>>
>> in spark 2.0.0-SNAPSHOT we did same thing:
>>  sqlContext.read
>>       .format("csv")
>>       .delimiter("~")
>>       .option("quote", null)
>>
>> but this did not work, we got weird blowups where spark was trying to
>> parse thousands of lines as if it is one record. the reason was that a
>> (valid) quote character ("\"") was present in the data. for example
>> a~b"c~d
>>
>> as it turns out setting quote to null does not turn of quoting anymore.
>> instead it means to use the default quote character.
>>
>> does anyone know how to turn off quoting now?
>>
>> our current workaround is:
>>  sqlContext.read
>>       .format("csv")
>>       .delimiter("~")
>>       .option("quote", "☃")
>>
>> (we assume there are no unicode snowman's in our data...)
>>
>>
>>
>

Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

Reply via email to