For when multiLine is not set, we currently only support ascii-compatible
encodings, up to my knowledge, mainly due to line separator and as I
investigated in the comment.
For when multiLine is set, it appears encoding is not considered. I
actually meant encoding does not work at all in this case
Hi,
Thank you for your response.
I finally found the cause of this
When multiLine option is set, input file is read by
UnivocityParser.parseStream() method.
This method, in turn, calls convertStream() that initializes tokenizer with
tokenizer.beginParsing(inputStream) and parses records using
Hi,
Since the csv source currently supports ascii-compatible charset, so I
guess shift-jis also works well.
You could check Hyukjin's comment in
https://issues.apache.org/jira/browse/SPARK-21289 for more info.
On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho wrote:
> My
My apologies,
It was a problem of our Hadoop cluster.
When we tested the same code on another cluster (HDP-based), it worked
without any problem.
```scala
## make sjis text
cat a.txt
8月データだけでやってみよう
nkf -W -s a.txt >b.txt
cat b.txt
87n%G!<%?$@$1$G$d$C$F$_$h$&
nkf -s -w b.txt
8月データだけでやってみよう
hdfs
Dear Spark ML members,
I experienced a trouble in using "multiLine" option to load CSV data with
Shift-JIS encoding.
When option("multiLine", true) is specified, option("encoding",
"encoding-name") just doesn't work anymore.
In CSVDataSource.scala file, I found that