Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

Reynold Xin Thu, 26 May 2016 16:19:29 -0700

Yup - but the reason we did the null handling that way was for Python,
which also affects Scala.



On Thu, May 26, 2016 at 4:17 PM, Koert Kuipers <ko...@tresata.com> wrote:

> ok, thanks for creating ticket.
>
> just to be clear: my example was in scala
>
> On Thu, May 26, 2016 at 7:07 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> This is unfortunately due to the way we set handle default values in
>> Python. I agree it doesn't follow the principle of least astonishment.
>>
>> Maybe the best thing to do here is to put the actual default values in
>> the Python API for csv (and json, parquet, etc), rather than using None in
>> Python. This would require us to duplicate default values twice (once in
>> data source options, and another in the Python API), but that's probably OK
>> given they shouldn't change all the time.
>>
>> Ticket https://issues.apache.org/jira/browse/SPARK-15585
>>
>>
>>
>>
>> On Thu, May 26, 2016 at 3:35 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> in spark 1.6.1 we used:
>>>  sqlContext.read
>>>       .format("com.databricks.spark.csv")
>>>       .delimiter("~")
>>>       .option("quote", null)
>>>
>>> this effectively turned off quoting, which is a necessity for certain
>>> data formats where quoting is not supported and "\"" is a valid character
>>> itself in the data.
>>>
>>> in spark 2.0.0-SNAPSHOT we did same thing:
>>>  sqlContext.read
>>>       .format("csv")
>>>       .delimiter("~")
>>>       .option("quote", null)
>>>
>>> but this did not work, we got weird blowups where spark was trying to
>>> parse thousands of lines as if it is one record. the reason was that a
>>> (valid) quote character ("\"") was present in the data. for example
>>> a~b"c~d
>>>
>>> as it turns out setting quote to null does not turn of quoting anymore.
>>> instead it means to use the default quote character.
>>>
>>> does anyone know how to turn off quoting now?
>>>
>>> our current workaround is:
>>>  sqlContext.read
>>>       .format("csv")
>>>       .delimiter("~")
>>>       .option("quote", "☃")
>>>
>>> (we assume there are no unicode snowman's in our data...)
>>>
>>>
>>>
>>
>

Re: changed behavior for csv datasource and quoting in spark 2.0.0-SNAPSHOT

Reply via email to