Re: Character encoding corruption in Spark JDBC connector

Sean Owen Tue, 13 Sep 2016 12:42:07 -0700

Based on your description, this isn't a problem in Spark. It means
your JDBC connector isn't interpreting bytes from the database
according to the encoding in which they were written. It could be
Latin1, sure.


But if "new String(ResultSet.getBytes())" works, it's only because
your platform's default JVM encoding is Latin1 too. Really you need to
specify the encoding directly in that constructor, or else this will
not in general work on other platforms, no.

That's not the solution though; ideally you find the setting that lets
the JDBC connector read the data as intended.

On Tue, Sep 13, 2016 at 8:02 PM, Mark Bittmann <mbittm...@gmail.com> wrote:
> Hello Spark community,
>
> I'm reading from a MySQL database into a Spark dataframe using the JDBC
> connector functionality, and I'm experiencing some character encoding
> issues. The default encoding for MySQL strings is latin1, but the mysql JDBC
> connector implementation of "ResultSet.getString()" will return an mangled
> unicode encoding of the data for certain characters such as the "all rights
> reserved" char. Instead, you can use "new String(ResultSet.getBytes())"
> which will return the correctly encoded string. I've confirmed this behavior
> with the mysql connector classes (i.e., without using the Spark wrapper).
>
> I can see here that the Spark JDBC connector uses getString(), though there
> is a note to move to getBytes() for performance reasons:
>
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L389
>
> For some special chars, I can reverse the behavior with a UDF that applies
> new String(badString.getBytes("Cp1252") , "UTF-8"), however for some foreign
> characters the underlying byte array is irreversibly changed and the data is
> corrupted.
>
> I can submit an issue/PR to fix it going forward if "new
> String(ResultSet.getBytes())" is the correct approach.
>
> Meanwhile, can anyone offer any recommendations on how to correct this
> behavior prior to it getting to a dataframe? I've tried every permutation of
> the settings in the JDBC connection url (characterSetResults,
> characterEncoding).
>
> I'm on Spark 1.6.
>
> Thanks!

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Character encoding corruption in Spark JDBC connector

Reply via email to