[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15354 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r85630778 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -494,3 +495,46 @@ case class JsonToStruct(schema: StructType, options: Map[String, String], child: override def inputTypes: Seq[AbstractDataType] = StringType :: Nil } + +/** + * Converts a [[StructType]] to a json output string. + */ +case class StructToJson(options: Map[String, String], child: Expression) + extends Expression with CodegenFallback with ExpectsInputTypes { + override def nullable: Boolean = true + + @transient + lazy val writer = new CharArrayWriter() + + @transient + lazy val gen = +new JacksonGenerator(child.dataType.asInstanceOf[StructType], writer) + + override def dataType: DataType = StringType + override def children: Seq[Expression] = child :: Nil + + override def checkInputDataTypes(): TypeCheckResult = { +if (StructType.acceptsType(child.dataType)) { + try { --- End diff -- Ah, yes, makes sense but if `verifySchema` returns a boolean, we could not find which field and type are problematic. Maybe, I can make do one of the below: - this logic in `verifySchema` into `checkInputDataTypes` - `verifySchema` returns the unsupported fields. and types. - Just fix the exception message without the information of unsupported fields and types. If you pick one, I will follow (or please let me know if there is a better way)! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r85583061 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala --- @@ -494,3 +495,46 @@ case class JsonToStruct(schema: StructType, options: Map[String, String], child: override def inputTypes: Seq[AbstractDataType] = StringType :: Nil } + +/** + * Converts a [[StructType]] to a json output string. + */ +case class StructToJson(options: Map[String, String], child: Expression) + extends Expression with CodegenFallback with ExpectsInputTypes { + override def nullable: Boolean = true + + @transient + lazy val writer = new CharArrayWriter() + + @transient + lazy val gen = +new JacksonGenerator(child.dataType.asInstanceOf[StructType], writer) + + override def dataType: DataType = StringType + override def children: Seq[Expression] = child :: Nil + + override def checkInputDataTypes(): TypeCheckResult = { +if (StructType.acceptsType(child.dataType)) { + try { --- End diff -- Sorry, one final comment as I'm looking at this more closely. I don't think we should use exceptions for control flow in the common case. Specifically, `verifySchema` should work the same way as `acceptsType` above and return a boolean. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r85583042 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/functions.scala --- @@ -2936,6 +2936,51 @@ object functions { def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = from_json(e, DataType.fromJson(schema).asInstanceOf[StructType], options) + + /** + * (Scala-specific) Converts a column containing a [[StructType]] into a JSON string + * ([[http://jsonlines.org/ JSON Lines text format or newline-delimited JSON]]) with the + * specified schema. Throws an exception, in the case of an unsupported type. + * + * @param e a struct column. + * @param options options to control how the struct column is converted into a json string. + *accepts the same options and the json data source. + * + * @group collection_funcs + * @since 2.1.0 + */ + def to_json(e: Column, options: Map[String, String]): Column = withExpr { +StructToJson(options, e.expr) + } + + /** + * (Java-specific) Converts a column containing a [[StructType]] into a JSON string + * ([[http://jsonlines.org/ JSON Lines text format or newline-delimited JSON]]) with the + * specified schema. Throws an exception, in the case of an unsupported type. + * + * @param e a struct column. + * @param options options to control how the struct column is converted into a json string. + *accepts the same options and the json data source. + * + * @group collection_funcs + * @since 2.1.0 + */ + def to_json(e: Column, options: java.util.Map[String, String]): Column = +to_json(e, options.asScala.toMap) + + /** + * Converts a column containing a [[StructType]] into a JSON string + * ([[http://jsonlines.org/ JSON Lines text format or newline-delimited JSON]]) with the --- End diff -- I don't think that this case really follows "JSON lines". It is a string inside of a larger dataframe. There are no newlines involved. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r85248697 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonGenerator.scala --- @@ -123,8 +122,9 @@ private[sql] class JacksonGenerator( case _ => (row: SpecializedGetters, ordinal: Int) => val v = row.get(ordinal, dataType) -sys.error(s"Failed to convert value $v (class of ${v.getClass}}) " + - s"with the type of $dataType to JSON.") +throw new SparkSQLJsonProcessingException( + s"Failed to convert value $v (class of ${v.getClass}}) " + +s"with the type of $dataType to JSON.") --- End diff -- I would avoid this change since its throwing a private exception type to the user now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r85248522 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonUtils.scala --- @@ -29,4 +31,28 @@ object JacksonUtils { case x => x != stopOn } } + + /** + * Verify if the schema is supported in JSON parsing. + */ + def verifySchema(schema: StructType): Unit = { +def verifyType(dataType: DataType): Unit = dataType match { + case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | + DoubleType | StringType | TimestampType | DateType | BinaryType | _: DecimalType => + + case st: StructType => st.map(_.dataType).foreach(verifyType) + + case at: ArrayType => verifyType(at.elementType) + + case mt: MapType => verifyType(mt.keyType) + + case udt: UserDefinedType[_] => verifyType(udt.sqlType) + + case _ => +throw new UnsupportedOperationException( + s"JSON conversion does not support to process ${dataType.simpleString} type.") --- End diff -- `does not support to process` is a little hard to parse. Maybe `Unable to convert column ${name} of type ${dataType.simpleString} to JSON.` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r83350521 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala --- @@ -343,4 +343,23 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { null ) } + + test("to_json") { +val schema = StructType(StructField("a", IntegerType) :: Nil) +val struct = Literal.create(create_row(1), schema) +checkEvaluation( + StructToJson(Map.empty, struct), + """{"a":1}""" +) + } + + test("to_json - invalid type") { +val schema = StructType(StructField("a", CalendarIntervalType) :: Nil) --- End diff -- @marmbrus Do you mind if I ask to create another JIRA and deal with this problem for JSON/CSV for reading/writing paths? It seems I should add this logics separatlely from `JacksonGenerator` instance (as it seems initiated in tasks and it is used `DataSet.toJSON`, `StructToJson` and `write.json` and therefore, it seems I should add a separate test for each..) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r83110773 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala --- @@ -343,4 +343,23 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { null ) } + + test("to_json") { +val schema = StructType(StructField("a", IntegerType) :: Nil) +val struct = Literal.create(create_row(1), schema) +checkEvaluation( + StructToJson(Map.empty, struct), + """{"a":1}""" +) + } + + test("to_json - invalid type") { +val schema = StructType(StructField("a", CalendarIntervalType) :: Nil) --- End diff -- Yeah, I think it makes more sense to add a static check for this case. We know all of the types that we are able to handle. For consistency I would also add this to the `write.json` code path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user marmbrus commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82443073 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala --- @@ -343,4 +343,23 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper { null ) } + + test("to_json") { +val schema = StructType(StructField("a", IntegerType) :: Nil) +val struct = Literal.create(create_row(1), schema) +checkEvaluation( + StructToJson(Map.empty, struct), + """{"a":1}""" +) + } + + test("to_json - invalid type") { +val schema = StructType(StructField("a", CalendarIntervalType) :: Nil) --- End diff -- Hmm, I realize this is a little different than `from_json`, but it seems it would be better to eagerly throw an `AnalysisException` to say the schema contains an unsupported type. We know that ahead of time, and otherwise its kind of mysterious why all the values come out as `null`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82322230 --- Diff: python/pyspark/sql/functions.py --- @@ -1729,6 +1729,29 @@ def from_json(col, schema, options={}): return Column(jc) +@ignore_unicode_prefix +@since(2.1) +def to_json(col, options={}): +""" +Converts a column containing a [[StructType]] into a JSON string. Returns `null`, +in the case of an unsupported type. + +:param col: struct column +:param options: options to control converting. accepts the same options as the json datasource + +>>> from pyspark.sql import Row +>>> from pyspark.sql.types import * +>>> data = [(1, Row(name='Alice', age=2))] +>>> df = spark.createDataFrame(data, ("key", "value")) +>>> df.select(to_json(df.value).alias("json")).collect() +[Row(json=u'{"age":2,"name":"Alice"}')] +""" + +sc = SparkContext._active_spark_context +jc = sc._jvm.functions.to_json(_to_java_column(col), options) --- End diff -- actually nvm my original comment, the more I look at this file the less it seems the pattern is overly consistent and this same pattern is done elsewhere within the file. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82314175 --- Diff: python/pyspark/sql/functions.py --- @@ -1729,6 +1729,29 @@ def from_json(col, schema, options={}): return Column(jc) +@ignore_unicode_prefix +@since(2.1) +def to_json(col, options={}): +""" +Converts a column containing a [[StructType]] into a JSON string. Returns `null`, +in the case of an unsupported type. + +:param col: struct column +:param options: options to control converting. accepts the same options as the json datasource + +>>> from pyspark.sql import Row +>>> from pyspark.sql.types import * +>>> data = [(1, Row(name='Alice', age=2))] +>>> df = spark.createDataFrame(data, ("key", "value")) +>>> df.select(to_json(df.value).alias("json")).collect() +[Row(json=u'{"age":2,"name":"Alice"}')] +""" + +sc = SparkContext._active_spark_context +jc = sc._jvm.functions.to_json(_to_java_column(col), options) --- End diff -- @holdenk Thank you for your comment. Could you please a bit elaborate this comment? I am a bit not sure on what to fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82313950 --- Diff: python/pyspark/sql/functions.py --- @@ -1729,6 +1729,29 @@ def from_json(col, schema, options={}): return Column(jc) +@ignore_unicode_prefix +@since(2.1) +def to_json(col, options={}): +""" +Converts a column containing a [[StructType]] into a JSON string. Returns `null`, +in the case of an unsupported type. + +:param col: struct column --- End diff -- Sure, let me try to double check other comments as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82230165 --- Diff: python/pyspark/sql/functions.py --- @@ -1729,6 +1729,29 @@ def from_json(col, schema, options={}): return Column(jc) +@ignore_unicode_prefix +@since(2.1) +def to_json(col, options={}): +""" +Converts a column containing a [[StructType]] into a JSON string. Returns `null`, +in the case of an unsupported type. + +:param col: struct column --- End diff -- Would`:param col: name of column containing the struct` maybe be more consistent with the other pydocs for the functions? (I only skimmed a few though so if its the other way around thats cool). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/15354#discussion_r82229892 --- Diff: python/pyspark/sql/functions.py --- @@ -1729,6 +1729,29 @@ def from_json(col, schema, options={}): return Column(jc) +@ignore_unicode_prefix +@since(2.1) +def to_json(col, options={}): +""" +Converts a column containing a [[StructType]] into a JSON string. Returns `null`, +in the case of an unsupported type. + +:param col: struct column +:param options: options to control converting. accepts the same options as the json datasource + +>>> from pyspark.sql import Row +>>> from pyspark.sql.types import * +>>> data = [(1, Row(name='Alice', age=2))] +>>> df = spark.createDataFrame(data, ("key", "value")) +>>> df.select(to_json(df.value).alias("json")).collect() +[Row(json=u'{"age":2,"name":"Alice"}')] +""" + +sc = SparkContext._active_spark_context +jc = sc._jvm.functions.to_json(_to_java_column(col), options) --- End diff -- This is super minor, but there is a pretty consistent pattern for all of the other functions here (including `from_json`), it might be good to follow that same pattern for consistencies sake since there isn't an obvious reason why that wouldn't work here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15354: [SPARK-17764][SQL] Add `to_json` supporting to co...
GitHub user HyukjinKwon opened a pull request: https://github.com/apache/spark/pull/15354 [SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string ## What changes were proposed in this pull request? This PR proposes to add `to_json` function in contrast with `from_json` in Scala, Java and Python. It'd be useful if we can convert a same column from/to json. Also, some datasources do not support nested types. If we are forced to save a dataframe into those data sources, we might be able to work around by this function. The usage is as below: ```scala val df = Seq(Tuple1(Tuple1(1))).toDF("a") df.select(to_json($"a").as("json")).show() ``` ```bash ++ |json| ++ |{"_1":1}| ++ ``` ## How was this patch tested? Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite`. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HyukjinKwon/spark SPARK-17764 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15354.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15354 commit a33ac60902f233595158ed034b0fa49bcf9ac5ab Author: hyukjinkwon Date: 2016-10-05T00:56:08Z Initial implementation commit d382b6364fd80a90375573e3fb68b9db2bf3cffb Author: hyukjinkwon Date: 2016-10-05T02:42:00Z Fix minor comment nits commit eec0cd32bde8564a080da425be48986055523e8c Author: hyukjinkwon Date: 2016-10-05T02:50:45Z Add a missing dot --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org