Hello Spark community, I'm reading from a MySQL database into a Spark dataframe using the JDBC connector functionality, and I'm experiencing some character encoding issues. The default encoding for MySQL strings is latin1, but the mysql JDBC connector implementation of "ResultSet.getString()" will return an mangled unicode encoding of the data for certain characters such as the "all rights reserved" char. Instead, you can use "new String(ResultSet.getBytes())" which will return the correctly encoded string. I've confirmed this behavior with the mysql connector classes (i.e., without using the Spark wrapper).
I can see here that the Spark JDBC connector uses getString(), though there is a note to move to getBytes() for performance reasons: https://github.com/apache/spark/blob/master/sql/core/ src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils. scala#L389 For some special chars, I can reverse the behavior with a UDF that applies new String(badString.getBytes("Cp1252") , "UTF-8"), however for some foreign characters the underlying byte array is irreversibly changed and the data is corrupted. I can submit an issue/PR to fix it going forward if "new String(ResultSet.getBytes())" is the correct approach. Meanwhile, can anyone offer any recommendations on how to correct this behavior prior to it getting to a dataframe? I've tried every permutation of the settings in the JDBC connection url (characterSetResults, characterEncoding). I'm on Spark 1.6. Thanks!