My apologies,
It was a problem of our Hadoop cluster.
When we tested the same code on another cluster (HDP-based), it worked
without any problem.
```scala
## make sjis text
cat a.txt
8月データだけでやってみよう
nkf -W -s a.txt >b.txt
cat b.txt
87n%G!<%?$@$1$G$d$C$F$_$h$&
nkf -s -w b.txt
8月データだけでやってみよう
hdfs dfs -put a.txt b.txt
## YARN mode test
spark.read.option("encoding", "utf-8").csv("a.txt").show(1)
+--------------+
| _c0|
+--------------+
|8月データだけでやってみよう|
+--------------+
spark.read.option("encoding", "sjis").csv("b.txt").show(1)
+--------------+
| _c0|
+--------------+
|8月データだけでやってみよう|
+--------------+
spark.read.option("encoding", "utf-8").option("multiLine",
true).csv("a.txt").show(1)
+--------------+
| _c0|
+--------------+
|8月データだけでやってみよう|
+--------------+
spark.read.option("encoding", "sjis").option("multiLine",
true).csv("b.txt").show(1)
+--------------+
| _c0|
+--------------+
|8月データだけでやってみよう|
+--------------+
```
I am still digging the root cause and will share it later :-)
Best wishes,
Han-Choel
On Wed, Aug 16, 2017 at 1:32 PM, Han-Cheol Cho <[email protected]> wrote:
> Dear Spark ML members,
>
>
> I experienced a trouble in using "multiLine" option to load CSV data with
> Shift-JIS encoding.
> When option("multiLine", true) is specified, option("encoding",
> "encoding-name") just doesn't work anymore.
>
>
> In CSVDataSource.scala file, I found that MultiLineCSVDataSource.readFile()
> method doesn't use parser.options.charset at all.
>
> object MultiLineCSVDataSource extends CSVDataSource {
> override val isSplitable: Boolean = false
>
> override def readFile(
> conf: Configuration,
> file: PartitionedFile,
> parser: UnivocityParser,
> schema: StructType): Iterator[InternalRow] = {
> UnivocityParser.parseStream(
> CodecStreams.createInputStreamWithCloseResource(conf,
> file.filePath),
> parser.options.headerFlag,
> parser,
> schema)
> }
> ...
>
> On the other hand, TextInputCSVDataSource.readFile() method uses it:
>
> override def readFile(
> conf: Configuration,
> file: PartitionedFile,
> parser: UnivocityParser,
> schema: StructType): Iterator[InternalRow] = {
> val lines = {
> val linesReader = new HadoopFileLinesReader(file, conf)
> Option(TaskContext.get()).foreach(_.addTaskCompletionListener(_ =>
> linesReader.close()))
> linesReader.map { line =>
> new String(line.getBytes, 0, line.getLength,
> parser.options.charset) // <---- charset option is used here.
> }
> }
>
> val shouldDropHeader = parser.options.headerFlag && file.start == 0
> UnivocityParser.parseIterator(lines, shouldDropHeader, parser, schema)
> }
>
>
> It seems like a bug.
> Is there anyone who had the same problem before?
>
>
> Best wishes,
> Han-Cheol
>
> --
> ==================================
> Han-Cheol Cho, Ph.D.
> Data scientist, Data Science Team, Data Laboratory
> NHN Techorus Corp.
>
> Homepage: https://sites.google.com/site/priancho/
> ==================================
>
--
==================================
Han-Cheol Cho, Ph.D.
Data scientist, Data Science Team, Data Laboratory
NHN Techorus Corp.
Homepage: https://sites.google.com/site/priancho/
==================================