Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Hyukjin Kwon
For when multiLine is not set, we currently only support ascii-compatible encodings, up to my knowledge, mainly due to line separator and as I investigated in the comment. For when multiLine is set, it appears encoding is not considered. I actually meant encoding does not work at all in this case

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-17 Thread Han-Cheol Cho
Hi, Thank you for your response. I finally found the cause of this When multiLine option is set, input file is read by UnivocityParser.parseStream() method. This method, in turn, calls convertStream() that initializes tokenizer with tokenizer.beginParsing(inputStream) and parses records using

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-16 Thread Takeshi Yamamuro
Hi, Since the csv source currently supports ascii-compatible charset, so I guess shift-jis also works well. You could check Hyukjin's comment in https://issues.apache.org/jira/browse/SPARK-21289 for more info. On Wed, Aug 16, 2017 at 2:54 PM, Han-Cheol Cho wrote: > My

Re: Reading CSV with multiLine option invalidates encoding option.

2017-08-15 Thread Han-Cheol Cho
My apologies, It was a problem of our Hadoop cluster. When we tested the same code on another cluster (HDP-based), it worked without any problem. ```scala ## make sjis text cat a.txt 8月データだけでやってみよう nkf -W -s a.txt >b.txt cat b.txt 87n%G!<%?$@$1$G$d$C$F$_$h$& nkf -s -w b.txt 8月データだけでやってみよう hdfs

Reading CSV with multiLine option invalidates encoding option.

2017-08-15 Thread Han-Cheol Cho
Dear Spark ML members, I experienced a trouble in using "multiLine" option to load CSV data with Shift-JIS encoding. When option("multiLine", true) is specified, option("encoding", "encoding-name") just doesn't work anymore. In CSVDataSource.scala file, I found that