I just wanted to send out a quick note about a change in the handling of strings when loading / storing data using parquet and Spark SQL. Before, Spark SQL did not support binary data in Parquet, so all binary blobs were implicitly treated as Strings. 9fe693 <https://github.com/apache/spark/commit/9fe693b5b6ed6af34ee1e800ab89c8a11991ea38> fixes this limitation by adding support for binary data.
However, data written out with a prior version of Spark SQL will be missing the annotation telling us to interpret a given column as a String, so old string data will now be loaded as binary data. If you would like to use the data as a string, you will need to add a CAST to convert the datatype. New string data written out after this change, will correctly be loaded in as a string as now we will include an annotation about the desired type. Additionally, this should now interoperate correctly with other systems that write Parquet data (hive, thrift, etc). Michael