[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be "fixed"

2014-10-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169549#comment-14169549
 ] 

Sean Owen commented on SPARK-1849:
--

Yes, I think there isn't a 'fix' here short of a quite different 
implementation. Hadoop's text support pretty deeply assumes UTF-8 (partly for 
speed) and the Spark implementation is just Hadoop's. This would have to 
justify rewriting all that. I think you have to treat this as binary data for 
now.

> Broken UTF-8 encoded data gets character replacements and thus can't be 
> "fixed"
> ---
>
> Key: SPARK-1849
> URL: https://issues.apache.org/jira/browse/SPARK-1849
> Project: Spark
>  Issue Type: Bug
>Reporter: Harry Brundage
> Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
> we should fix? It looks like {{HadoopRDD}} uses 
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation 
> and byte representation of the example line of text. The string shows a 
> question mark for the replacement character and the bytes reveal the 
> replacement character has been swapped in by {{Text.toString}}. The third 
> snippet shows what happens if you call {{getBytes}} on the {{Text}} object 
> which comes back from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file 
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
> rescue and re-encode into UTF-8, because I want my application to be smart 
> like that. I think Spark should give me the raw broken string so I can 
> re-encode, but I can't get at the original bytes in order to guess at what 
> the source encoding might be, as they have already been replaced. I'm dealing 
> with data from some CDN access logs which are to put it nicely diversely 
> encoded, but I think a use case Spark should fully support. So, my suggested 
> fix, which I'd like some guidance, is to change {{textFile}} to spit out 
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, 
> but we can talk about how bytes fly through to Scala land after this if we 
> agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be "fixed"

2014-05-16 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000580#comment-14000580
 ] 

Mridul Muralidharan commented on SPARK-1849:


You are missing my point ... textFile, for whatever reason, has been a
wrapper over TextInputFormat : which, as you observed assumes utf-8 (along
with quirks to handle badly encoded data to allow it to work at scale
reasonably well).

If you want to support specifying the encoding, there is no way around
writing your own InputFormat.

Now whether this becomes a new api contributed back to SparkContext or not
is a different question - but there is no way around the immediate problem
to unblock your requirement.



> Broken UTF-8 encoded data gets character replacements and thus can't be 
> "fixed"
> ---
>
> Key: SPARK-1849
> URL: https://issues.apache.org/jira/browse/SPARK-1849
> Project: Spark
>  Issue Type: Bug
>Reporter: Harry Brundage
> Fix For: 1.0.0, 0.9.1
>
> Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
> we should fix? It looks like {{HadoopRDD}} uses 
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation 
> and byte representation of the example line of text. The string shows a 
> question mark for the replacement character and the bytes reveal the 
> replacement character has been swapped in by {{Text.toString}}. The third 
> snippet shows what happens if you call {{getBytes}} on the {{Text}} object 
> which comes back from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file 
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
> rescue and re-encode into UTF-8, because I want my application to be smart 
> like that. I think Spark should give me the raw broken string so I can 
> re-encode, but I can't get at the original bytes in order to guess at what 
> the source encoding might be, as they have already been replaced. I'm dealing 
> with data from some CDN access logs which are to put it nicely diversely 
> encoded, but I think a use case Spark should fully support. So, my suggested 
> fix, which I'd like some guidance, is to change {{textFile}} to spit out 
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, 
> but we can talk about how bytes fly through to Scala land after this if we 
> agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be "fixed"

2014-05-16 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000397#comment-14000397
 ] 

Mridul Muralidharan commented on SPARK-1849:


Looks like textFile is probably the wrong api to use.
You cannot recover from badly encoded data ... Better would be to write your 
own InputFormat which does what you need.

> Broken UTF-8 encoded data gets character replacements and thus can't be 
> "fixed"
> ---
>
> Key: SPARK-1849
> URL: https://issues.apache.org/jira/browse/SPARK-1849
> Project: Spark
>  Issue Type: Bug
>Reporter: Harry Brundage
> Fix For: 1.0.0, 0.9.1
>
> Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
> we should fix? It looks like {{HadoopRDD}} uses 
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation 
> and byte representation of the example line of text. The third snippet shows 
> what happens if you call {{getBytes}} on the {{Text}} object which comes back 
> from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file 
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
> rescue and re-encode into UTF-8, because I want my application to be smart 
> like that. I think Spark should give me the raw broken string so I can 
> re-encode, but I can't get at the original bytes in order to guess at what 
> the source encoding might be, as they have already been replaced. I'm dealing 
> with data from some CDN access logs which are to put it nicely diversely 
> encoded, but I think a use case Spark should fully support. So, my suggested 
> fix, which I'd like some guidance, is to change {{textFile}} to spit out 
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, 
> but we can talk about how bytes fly through to Scala land after this if we 
> agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1849) Broken UTF-8 encoded data gets character replacements and thus can't be "fixed"

2014-05-16 Thread Harry Brundage (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14000460#comment-14000460
 ] 

Harry Brundage commented on SPARK-1849:
---

I disagree - the data isn't badly encoded, just not UTF-8 encoded, which when 
we're talking about data from the internet really isn't all that uncommon. You 
could extend my specific problem of some lines in the source file being a 
different encoding to a file entirely encoded in iso-8859-1, which is likely 
something Spark should deal with considering all the effort put into supporting 
Windows. I don't think asking users to drop down to writing custom 
{{InputFormat}}s to deal with the realities of large data is a good move if 
Spark wants to become the fast and general data processing engine for large 
scale data.

I could certainly use {{sc.hadoopFile}} to load in my data and work with the 
{{org.apache.hadoop.io.Text}} objects myself, but A) why force everyone dealing 
with this issue to go through the pain of figuring that out, and B) I'm in 
PySpark where I can't actually do that without fancy Py4J trickery. I think 
encoding issues should be in your face.

> Broken UTF-8 encoded data gets character replacements and thus can't be 
> "fixed"
> ---
>
> Key: SPARK-1849
> URL: https://issues.apache.org/jira/browse/SPARK-1849
> Project: Spark
>  Issue Type: Bug
>Reporter: Harry Brundage
> Fix For: 1.0.0, 0.9.1
>
> Attachments: encoding_test
>
>
> I'm trying to process a file which isn't valid UTF-8 data inside hadoop using 
> Spark via {{sc.textFile()}}. Is this possible, and if not, is this a bug that 
> we should fix? It looks like {{HadoopRDD}} uses 
> {{org.apache.hadoop.io.Text.toString}} on all the data it ever reads, which I 
> believe replaces invalid UTF-8 byte sequences with the UTF-8 replacement 
> character, \uFFFD. Some example code mimicking what {{sc.textFile}} does 
> underneath:
> {code}
> scala> sc.textFile(path).collect()(0)
> res8: String = ?pple
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.toString).collect()(0).getBytes()
> res9: Array[Byte] = Array(-17, -65, -67, 112, 112, 108, 101)
> scala> sc.hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], 
> classOf[Text]).map(pair => pair._2.getBytes).collect()(0)
> res10: Array[Byte] = Array(-60, 112, 112, 108, 101)
> {code}
> In the above example, the first two snippets show the string representation 
> and byte representation of the example line of text. The third snippet shows 
> what happens if you call {{getBytes}} on the {{Text}} object which comes back 
> from hadoop land: we get the real bytes in the file out.
> Now, I think this is a bug, though you may disagree. The text inside my file 
> is perfectly valid iso-8859-1 encoded bytes, which I would like to be able to 
> rescue and re-encode into UTF-8, because I want my application to be smart 
> like that. I think Spark should give me the raw broken string so I can 
> re-encode, but I can't get at the original bytes in order to guess at what 
> the source encoding might be, as they have already been replaced. I'm dealing 
> with data from some CDN access logs which are to put it nicely diversely 
> encoded, but I think a use case Spark should fully support. So, my suggested 
> fix, which I'd like some guidance, is to change {{textFile}} to spit out 
> broken strings by not using {{Text}}'s UTF-8 encoding.
> Further compounding this issue is that my application is actually in PySpark, 
> but we can talk about how bytes fly through to Scala land after this if we 
> agree that this is an issue at all. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)