Reading CSV with multiLine option invalidates encoding option.

Han-Cheol Cho Tue, 15 Aug 2017 21:34:01 -0700

Dear Spark ML members,


I experienced a trouble in using "multiLine" option to load CSV data with
Shift-JIS encoding.
When option("multiLine", true) is specified, option("encoding",
"encoding-name") just doesn't work anymore.


In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
method doesn't use parser.options.charset at all.

object MultiLineCSVDataSource extends CSVDataSource {
  override val isSplitable: Boolean = false

  override def readFile(
      conf: Configuration,
      file: PartitionedFile,
      parser: UnivocityParser,
      schema: StructType): Iterator[InternalRow] = {
    UnivocityParser.parseStream(
      CodecStreams.createInputStreamWithCloseResource(conf, file.filePath),
      parser.options.headerFlag,
      parser,
      schema)
  }
  ...

On the other hand, TextInputCSVDataSource.readFile() method uses it:

  override def readFile(
      conf: Configuration,
      file: PartitionedFile,
      parser: UnivocityParser,
      schema: StructType): Iterator[InternalRow] = {
    val lines = {
      val linesReader = new HadoopFileLinesReader(file, conf)
      Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ =>
linesReader.close()))
      linesReader.map { line =>
        new String(line.getBytes, 0, line.getLength,
parser.options.charset)    // <---- charset option is used here.
      }
    }

    val shouldDropHeader = parser.options.headerFlag && file.start == 0
    UnivocityParser.parseIterator(lines, shouldDropHeader, parser, schema)
  }


It seems like a bug.
Is there anyone who had the same problem before?


Best wishes,
Han-Cheol

-- 
==================================
Han-Cheol Cho, Ph.D.
Data scientist, Data Science Team, Data Laboratory
NHN Techorus Corp.

Homepage: https://sites.google.com/site/priancho/
==================================

Reading CSV with multiLine option invalidates encoding option.

Reply via email to