Dear Spark ML members,
I experienced a trouble in using "multiLine" option to load CSV data with
Shift-JIS encoding.
When option("multiLine", true) is specified, option("encoding",
"encoding-name") just doesn't work anymore.
In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
method doesn't use parser.options.charset at all.
object MultiLineCSVDataSource extends CSVDataSource {
override val isSplitable: Boolean = false
override def readFile(
conf: Configuration,
file: PartitionedFile,
parser: UnivocityParser,
schema: StructType): Iterator[InternalRow] = {
UnivocityParser.parseStream(
CodecStreams.createInputStreamWithCloseResource(conf, file.filePath),
parser.options.headerFlag,
parser,
schema)
}
...
On the other hand, TextInputCSVDataSource.readFile() method uses it:
override def readFile(
conf: Configuration,
file: PartitionedFile,
parser: UnivocityParser,
schema: StructType): Iterator[InternalRow] = {
val lines = {
val linesReader = new HadoopFileLinesReader(file, conf)
Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ =>
linesReader.close()))
linesReader.map { line =>
new String(line.getBytes, 0, line.getLength,
parser.options.charset) // <---- charset option is used here.
}
}
val shouldDropHeader = parser.options.headerFlag && file.start == 0
UnivocityParser.parseIterator(lines, shouldDropHeader, parser, schema)
}
It seems like a bug.
Is there anyone who had the same problem before?
Best wishes,
Han-Cheol
--
==================================
Han-Cheol Cho, Ph.D.
Data scientist, Data Science Team, Data Laboratory
NHN Techorus Corp.
Homepage: https://sites.google.com/site/priancho/
==================================