Github user MaxGekk commented on a diff in the pull request: https://github.com/apache/spark/pull/21247#discussion_r187780271 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala --- @@ -138,3 +121,40 @@ private[sql] class JSONOptions( factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars) } } + +private[sql] class JSONOptionsInRead( + @transient override val parameters: CaseInsensitiveMap[String], + defaultTimeZoneId: String, + defaultColumnNameOfCorruptRecord: String) + extends JSONOptions(parameters, defaultTimeZoneId, defaultColumnNameOfCorruptRecord) { + + def this( + parameters: Map[String, String], + defaultTimeZoneId: String, + defaultColumnNameOfCorruptRecord: String = "") = { + this( + CaseInsensitiveMap(parameters), + defaultTimeZoneId, + defaultColumnNameOfCorruptRecord) + } + + protected override def checkedEncoding(enc: String): String = { + // The following encodings are not supported in per-line mode (multiline is false) + // because they cause some problems in reading files with BOM which is supposed to + // present in the files with such encodings. After splitting input files by lines, + // only the first lines will have the BOM which leads to impossibility for reading + // the rest lines. Besides of that, the lineSep option must have the BOM in such + // encodings which can never present between lines. + val blacklist = Seq(Charset.forName("UTF-16"), Charset.forName("UTF-32")) + val isBlacklisted = blacklist.contains(Charset.forName(enc)) + require(multiLine || !isBlacklisted, --- End diff -- There is no reasons to blacklist `UTF-16` and `UTF-32` in write. I have checked the content of written JSON files on @gatorsmile 's [test](https://github.com/apache/spark/pull/21247/commits/97c4af76addc78a85ceb503a5db16f3285f18a5f). For example, for `UTF-16` ``` $ hexdump -C ...c000.json 00000000 fe ff 00 7b 00 22 00 5f 00 31 00 22 00 3a 00 22 |...{."._.1.".:."| 00000010 00 61 00 22 00 2c 00 22 00 5f 00 32 00 22 00 3a |.a.".,."._.2.".:| 00000020 00 31 00 7d 00 0a 00 7b 00 22 00 5f 00 31 00 22 |.1.}...{."._.1."| 00000030 00 3a 00 22 00 63 00 22 00 2c 00 22 00 5f 00 32 |.:.".c.".,."._.2| 00000040 00 22 00 3a 00 33 00 7d 00 0a |.".:.3.}..| 0000004a ``` It contains BOM `fe ff` at the beginning as it is expected, and written line separator doesn't contains BOM (look at the position 0x24-0x25) - `00 7d` **00 0a** `00 7b`.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org