[ https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Patrick Wendell updated SPARK-1849: ----------------------------------- Fix Version/s: (was: 0.9.1) 0.9.2 > Broken UTF-8 encoded data gets character replacements and thus can't be > "fixed" > ------------------------------------------------------------------------------- > > Key: SPARK-1849 > URL: https://issues.apache.org/jira/browse/SPARK-1849 > Project: Spark > Issue Type: Bug > Reporter: Harry Brundage > Fix For: 1.0.0, 0.9.2 > > Attachments: encoding_test > > > I'm trying to process a file which isn't valid UTF-8 data inside hadoop using > Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that > we should fix? It looks like {{HadoopRDD}} uses > {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I > believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement > character, \uFFFD. Some example code mimicking what {{sc.textFile}} does > underneath: > {code} > scala> sc.textFile(path).collect()(0) > res8: String = ?pple > scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], > classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes() > res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101) > scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], > classOf[Text]).map(pair => pair._2.getBytes).collect()(0) > res10: Array[Byte] = Array(-60, 112, 112, 108, 101) > {code} > In the above example, the first two snippets show the string representation > and byte representation of the example line of text. The string shows a > question mark for the replacement character and the bytes reveal the > replacement character has been swapped in by {{Text.toString}}. The third > snippet shows what happens if you call {{getBytes}} on the {{Text}} object > which comes back from hadoop land: we get the real bytes in the file out. > Now, I think this is a bug, though you may disagree. The text inside my file > is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to > rescue and re-encode into UTF-8, because I want my application to be smart > like that. I think Spark should give me the raw broken string so I can > re-encode, but I can't get at the original bytes in order to guess at what > the source encoding might be, as they have already been replaced. I'm dealing > with data from some CDN access logs which are to put it nicely diversely > encoded, but I think a use case Spark should fully support. So, my suggested > fix, which I'd like some guidance, is to change {{textFile}} to spit out > broken strings by not using {{Text}}'s UTF-8 encoding. > Further compounding this issue is that my application is actually in PySpark, > but we can talk about how bytes fly through to Scala land after this if we > agree that this is an issue at all. -- This message was sent by Atlassian JIRA (v6.2#6252)