Re: [Spark2] Error writing "complex" type to CSV

Efe Selcuk Fri, 19 Aug 2016 10:55:35 -0700

Okay so this is partially PEBKAC. I just noticed that there's a debugging
field at the end that's another case class with its own simple fields -
*that's* the struct that was showing up in the error, not the entry itself.


This raises a different question. What has changed that this is no longer
possible? The pull request said that it prints garbage. Was that some
regression in 2.0? The same code prints fine in 1.6.1. The field prints as
an array of the values of its fields.

On Thu, Aug 18, 2016 at 5:56 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

> Ah, BTW, there is an issue, SPARK-16216, about printing dates and
> timestamps here. So please ignore the integer values for dates
>
> 2016-08-19 9:54 GMT+09:00 Hyukjin Kwon <gurwls...@gmail.com>:
>
>> Ah, sorry, I should have read this carefully. Do you mind if I ask your
>> codes to test?
>>
>> I would like to reproduce.
>>
>>
>> I just tested this by myself but I couldn't reproduce as below (is this
>> what your doing, right?):
>>
>> case class ClassData(a: String, b: Date)
>>
>> val ds: Dataset[ClassData] = Seq(
>>   ("a", Date.valueOf("1990-12-13")),
>>   ("a", Date.valueOf("1990-12-13")),
>>   ("a", Date.valueOf("1990-12-13"))
>> ).toDF("a", "b").as[ClassData]
>> ds.write.csv("/tmp/data.csv")
>> spark.read.csv("/tmp/data.csv").show()
>>
>> prints as below:
>>
>> +---+----+
>> |_c0| _c1|
>> +---+----+
>> |  a|7651|
>> |  a|7651|
>> |  a|7651|
>> +---+----+
>>
>> 
>>
>> 2016-08-19 9:27 GMT+09:00 Efe Selcuk <efema...@gmail.com>:
>>
>>> Thanks for the response. The problem with that thought is that I don't
>>> think I'm dealing with a complex nested type. It's just a dataset where
>>> every record is a case class with only simple types as fields, strings and
>>> dates. There's no nesting.
>>>
>>> That's what confuses me about how it's interpreting the schema. The
>>> schema seems to be one complex field rather than a bunch of simple fields.
>>>
>>> On Thu, Aug 18, 2016, 5:07 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
>>>
>>>> Hi Efe,
>>>>
>>>> If my understanding is correct, supporting to write/read complex types
>>>> is not supported because CSV format can't represent the nested types in its
>>>> own format.
>>>>
>>>> I guess supporting them in writing in external CSV is rather a bug.
>>>>
>>>> I think it'd be great if we can write and read back CSV in its own
>>>> format but I guess we can't.
>>>>
>>>> Thanks!
>>>>
>>>> On 19 Aug 2016 6:33 a.m., "Efe Selcuk" <efema...@gmail.com> wrote:
>>>>
>>>>> We have an application working in Spark 1.6. It uses the databricks
>>>>> csv library for the output format when writing out.
>>>>>
>>>>> I'm attempting an upgrade to Spark 2. When writing with both the
>>>>> native DataFrameWriter#csv() method and with first specifying the
>>>>> "com.databricks.spark.csv" format (I suspect underlying format is the same
>>>>> but I don't know how to verify), I get the following error:
>>>>>
>>>>> java.lang.UnsupportedOperationException: CSV data source does not
>>>>> support struct<[bunch of field names and types]> data type
>>>>>
>>>>> There are 20 fields, mostly plain strings with a couple of dates. The
>>>>> source object is a Dataset[T] where T is a case class with various fields
>>>>> The line just looks like: someDataset.write.csv(outputPath)
>>>>>
>>>>> Googling returned this fairly recent pull request:
>>>>> https://mail-archives.apache.org/mod_mbox/spark-com
>>>>> mits/201605.mbox/%3c65d35a72bd05483392857098a2635...@git.apache.org%3E
>>>>>
>>>>> If I'm reading that correctly, the schema shows that each record has
>>>>> one field of this complex struct type? And the validation thinks it's
>>>>> something that it can't serialize. I would expect the schema to have a
>>>>> bunch of fields in it matching the case class, so maybe there's something
>>>>> I'm misunderstanding.
>>>>>
>>>>> Efe
>>>>>
>>>>
>>
>

Re: [Spark2] Error writing "complex" type to CSV

Reply via email to