Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/20727#discussion_r172656702 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala --- @@ -42,7 +52,12 @@ class HadoopFileLinesReader( Array.empty) val attemptId = new TaskAttemptID(new TaskID(new JobID(), TaskType.MAP, 0), 0) val hadoopAttemptContext = new TaskAttemptContextImpl(conf, attemptId) - val reader = new LineRecordReader() + val reader = if (lineSeparator != "\n") { + new LineRecordReader(lineSeparator.getBytes("UTF-8")) --- End diff -- My suggestion is to pass Array[Byte] into the class. If charsets different from UTF-8 will be supported in the future, this place should be changed for sure. You can make this class more tolerant to input charsets right now. Just for an example, json reader (jackson json parser) is able to read json in any standard charsets. To fix its per-line mode, need to support lineSep in any charset and convert lineSep to array of byte before using the class. If you restrict charset of lineSep to UTF-8, you just make the wall for other datasources.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org