Github user MaxGekk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20727#discussion_r172656702
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala
 ---
    @@ -42,7 +52,12 @@ class HadoopFileLinesReader(
           Array.empty)
         val attemptId = new TaskAttemptID(new TaskID(new JobID(), 
TaskType.MAP, 0), 0)
         val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId)
    -    val reader = new LineRecordReader()
    +    val reader = if (lineSeparator != "\n") {
    +      new LineRecordReader(lineSeparator.getBytes("UTF-8"))
    --- End diff --
    
    My suggestion is to pass Array[Byte] into the class. If charsets different 
from UTF-8 will be supported in the future, this place should be changed for 
sure. You can make this class more tolerant to input charsets right now. Just 
for an example, json reader (jackson json parser) is able to read json in any 
standard charsets. To fix its per-line mode, need to support lineSep in any 
charset and convert lineSep to array of byte before using the class.  If you 
restrict charset of lineSep to UTF-8, you just make the wall for other 
datasources. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to