Character encoding corruption in Spark JDBC connector

Mark Bittmann Tue, 13 Sep 2016 12:02:27 -0700

Hello Spark community,

I'm reading from a MySQL database into a Spark dataframe using the JDBC
connector functionality, and I'm experiencing some character encoding
issues. The default encoding for MySQL strings is latin1, but the mysql
JDBC connector implementation of "ResultSet.getString()" will return an
mangled unicode encoding of the data for certain characters such as the
"all rights reserved" char. Instead, you can use "new
String(ResultSet.getBytes())" which will return the correctly encoded
string. I've confirmed this behavior with the mysql connector classes
(i.e., without using the Spark wrapper).


I can see here that the Spark JDBC connector uses getString(), though there
is a note to move to getBytes() for performance reasons:

https://github.com/apache/spark/blob/master/sql/core/
src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.
scala#L389

For some special chars, I can reverse the behavior with a UDF that applies
new String(badString.getBytes("Cp1252") , "UTF-8"), however for some
foreign characters the underlying byte array is irreversibly changed and
the data is corrupted.

I can submit an issue/PR to fix it going forward if "new
String(ResultSet.getBytes())" is the correct approach.

Meanwhile, can anyone offer any recommendations on how to correct this
behavior prior to it getting to a dataframe? I've tried every permutation
of the settings in the JDBC connection url (characterSetResults,
characterEncoding).

I'm on Spark 1.6.

Thanks!

Character encoding corruption in Spark JDBC connector

Reply via email to