Dear Spark ML members,
I experienced a trouble in using "multiLine" option to load CSV data with Shift-JIS encoding. When option("multiLine", true) is specified, option("encoding", "encoding-name") just doesn't work anymore. In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile() method doesn't use parser.options.charset at all. object MultiLineCSVDataSource extends CSVDataSource { override val isSplitable: Boolean = false override def readFile( conf: Configuration, file: PartitionedFile, parser: UnivocityParser, schema: StructType): Iterator[InternalRow] = { UnivocityParser.parseStream( CodecStreams.createInputStreamWithCloseResource(conf, file.filePath), parser.options.headerFlag, parser, schema) } ... On the other hand, TextInputCSVDataSource.readFile() method uses it: override def readFile( conf: Configuration, file: PartitionedFile, parser: UnivocityParser, schema: StructType): Iterator[InternalRow] = { val lines = { val linesReader = new HadoopFileLinesReader(file, conf) Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ => linesReader.close())) linesReader.map { line => new String(line.getBytes, 0, line.getLength, parser.options.charset) // <---- charset option is used here. } } val shouldDropHeader = parser.options.headerFlag && file.start == 0 UnivocityParser.parseIterator(lines, shouldDropHeader, parser, schema) } It seems like a bug. Is there anyone who had the same problem before? Best wishes, Han-Cheol -- ================================== Han-Cheol Cho, Ph.D. Data scientist, Data Science Team, Data Laboratory NHN Techorus Corp. Homepage: https://sites.google.com/site/priancho/ ==================================