Change when loading/storing String data using Parquet

Michael Armbrust Mon, 14 Jul 2014 15:57:06 -0700

I just wanted to send out a quick note about a change in the handling of
strings when loading / storing data using parquet and Spark SQL.  Before,
Spark SQL did not support binary data in Parquet, so all binary blobs were
implicitly treated as Strings.  9fe693
<https://github.com/apache/spark/commit/9fe693b5b6ed6af34ee1e800ab89c8a11991ea38>
fixes
this limitation by adding support for binary data.


However, data written out with a prior version of Spark SQL will be missing
the annotation telling us to interpret a given column as a String, so old
string data will now be loaded as binary data.  If you would like to use
the data as a string, you will need to add a CAST to convert the datatype.

New string data written out after this change, will correctly be loaded in
as a string as now we will include an annotation about the desired type.
 Additionally, this should now interoperate correctly with other systems
that write Parquet data (hive, thrift, etc).

Michael

Change when loading/storing String data using Parquet

Reply via email to