[ https://issues.apache.org/jira/browse/SPARK-23194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16582025#comment-16582025 ]
Denis Bolshakov edited comment on SPARK-23194 at 8/16/18 6:33 AM: ------------------------------------------------------------------ [~cloud_fan], [~hyukjin.kwon], do you have any updates on this? Javadoc says: {code:java} @param options options to control how the json is parsed. accepts the same options and the * json data source. {code} In fact it's not exactly true. It' does not support `columnNameOfCorruptRecord` and `mode` options. `mode` option is not supported because it's overridden in the source code, so user's value is just ignored. `columnNameOfCorruptRecord` is not supported because there is no way to set PERMISSIVE mode. See: http://apache-spark-user-list.1001560.n3.nabble.com/from-json-function-td33209.html and https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568 It would be very nice to fix this or at least provide clear documentation for options in from_json function. The following snippet could be used to test (I've checked it on spark 2.0.2, 2.2.0, 2.3.0, 2.3.1) {code} import org.apache.spark.sql.functions._ val data = Seq( "{'number': 1}", "{'number': }" ) val schema = new StructType() .add($"number".int) .add($"_corrupt_record".string) val sourceDf = data.toDF("column") val jsonedDf = sourceDf .select(from_json( $"column", schema, Map("mode" -> "PERMISSIVE", "columnNameOfCorruptRecord" -> "_corrupt_record") ) as "data").selectExpr("data.number", "data._corrupt_record") jsonedDf.show() {code} Kind regards, Denis was (Author: bolshakov.de...@gmail.com): [~cloud_fan], [~hyukjin.kwon], do you have any updates on this? Javadoc says: {code:java} @param options options to control how the json is parsed. accepts the same options and the * json data source. {code} In fact it's not exactly true. It' does not support `columnNameOfCorruptRecord` and `mode` options. `mode` option is not supported because it's overridden in the source code, so user's value is just ignored. `columnNameOfCorruptRecord` is not supported because there is no way to set PERMISSIVE mode. See: http://apache-spark-user-list.1001560.n3.nabble.com/from-json-function-td33209.html and https://github.com/apache/spark/blob/e2ab7deae76d3b6f41b9ad4d0ece14ea28db40ce/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala#L568 It would be very nice to fix this or at least provide clear documentation for options in from_json function. Kind regards, Denis > from_json in FAILFAST mode doesn't fail fast, instead it just returns nulls > --------------------------------------------------------------------------- > > Key: SPARK-23194 > URL: https://issues.apache.org/jira/browse/SPARK-23194 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.3.0 > Reporter: Burak Yavuz > Priority: Major > > from_json accepts Json parsing options such as being PERMISSIVE to parsing > errors or failing fast. It seems from the code that even though the default > option is to fail fast, we catch that exception and return nulls. > > In order to not change behavior, we should remove that try-catch block and > change the default to permissive, but allow failfast mode to indeed fail. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org