[ https://issues.apache.org/jira/browse/SPARK-23725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528691#comment-16528691 ]
Maxim Gekk commented on SPARK-23725: ------------------------------------ [~hyukjin.kwon] I am working on the implementation and have faced to the problem that I cannot identify lineSep uniquely if encoding is not specified. For example, if a partitioned file contains: {code} 65 00 31 00 0a 00 6c 00 69 {code} I cannot strictly say what is the lineSep here. It could be *0x0a 0x00* if encoding is UTF-16LE in: {code} 00000000 6c 00 69 00 6e 00 65 00 31 00 0a 00 6c 00 69 00 |l.i.n.e.1...l.i.| 00000010 6e 00 65 00 32 00 |n.e.2.| 00000016 {code} or *0x00 0x0a* in UTF-16BE encoding in the text: {code} 00000000 00 6c 00 69 00 6e 00 65 00 31 00 0a 00 6c 00 69 |.l.i.n.e.1...l.i| 00000010 00 6e 00 65 00 32 |.n.e.2| 00000016 {code} So, to detect lineSep automatically we should require specified encoding. > Improve Hadoop's LineReader to support charsets different from UTF-8 > -------------------------------------------------------------------- > > Key: SPARK-23725 > URL: https://issues.apache.org/jira/browse/SPARK-23725 > Project: Spark > Issue Type: Sub-task > Components: SQL > Affects Versions: 2.4.0 > Reporter: Maxim Gekk > Priority: Minor > > If the record delimiter is not specified, Hadoop LineReader splits > lines/records by '\n', '\r' or/and '\r\n' in UTF-8 encoding: > [https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/LineReader.java#L173-L177] > . The implementation should be improved to support any charset. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org