[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars
[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782961#comment-16782961 ] Hyukjin Kwon commented on SPARK-26964: -- I resolved it as Later mainly due to no feedback. I think it's fine to reopen. You can try to open a PR and fix it if the change is small. Otherwise, I doubt if this is worth. > to_json/from_json do not match JSON spec due to not supporting scalars > -- > > Key: SPARK-26964 > URL: https://issues.apache.org/jira/browse/SPARK-26964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Huon Wilson >Priority: Major > > Spark SQL's {{to_json}} and {{from_json}} currently support arrays and > objects, but not the scalar/primitive types. This doesn't match the JSON spec > on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a > JSON document ({{json: element}}) consists of a value surrounded by > whitespace ({{element: ws value ws}}), where a value is an object or array > _or_ a number or string etc.: > {code:none} > value > object > array > string > number > "true" > "false" > "null" > {code} > Having {{to_json}} and {{from_json}} support scalars would make them flexible > enough for a library I'm working on, where an arbitrary (user-supplied) > column needs to be turned into JSON. > NB. these newer specs differ to the original [RFC4627| > https://tools.ietf.org/html/rfc4627] (which is now obsolete) that > (essentially) had {{value: object | array}}. > This is related to SPARK-24391 and SPARK-25252, which added support for > arrays of scalars. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars
[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782958#comment-16782958 ] Huon Wilson commented on SPARK-26964: - I see. Could you say why you're resolving it as Later? I'm not quite sure I understand how the error handling for corrupt records differs between this and the existing functionality in {{from_json}}, e.g. the corrupt record handling for decoding {{"x"}} as {{int}} seems to already exist (in the form of {{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and {{FailureSafeParser}} catching them) because the same error occurs when decoding {{\{"value":"x"\}}} as {{struct}}. Along those lines, we're now using the following code to map arbitrary values to their JSON strings, and back. It involves wrapping the values in a struct, and using string manipulation to pull out the true JSON string. {code:scala} import java.util.regex.Pattern // ... object JsonHacks { // FIXME: massive hack working-around (a) the requirement to make an // explicit map for storage (would be nicer to just dump // columns in directly, and (b) to_json/from_json not supporting scalars // (https://issues.apache.org/jira/browse/SPARK-26964) private val TempName = "value" private val Prefix = "{\"" + TempName + "\":" private val Suffix = "}" // remove the prefix only when it is at the start of the string, and the // suffix only at the end private val StripRegexp = s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$" def valueToJson(col: Column): Column = { // Nest the column in a struct so that to_json can work ... val structJson = to_json(struct(col as TempName)) // ... but, because of this nesting, to_json(...) gives "{}" (not // null) if col is null, while this function needs to preserve that // null-ness. val nullOrStruct = when(col.isNull, null).otherwise(structJson) // Strip off the struct wrapping to pull out the JSON-ified `col` regexp_replace(nullOrStruct, StripRegexp, "") } def valueFromJson( col: Column, dataType: DataType, nullable: Boolean ): Column = { // from_json only works with a struct, so that's what we're going to be // parsing. val json_schema = new StructType().add(TempName, dataType, nullable) // To be able to parse into a struct, the JSON column needs to be wrapped // in what was stripped off above. val structJson = concat(lit(Prefix), col, lit(Suffix)) // Now we're finally ready to parse val parsedStruct = from_json(structJson, json_schema) // ... and extract the field to get the actual parsed column. parsedStruct(TempName) } } {code} > to_json/from_json do not match JSON spec due to not supporting scalars > -- > > Key: SPARK-26964 > URL: https://issues.apache.org/jira/browse/SPARK-26964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Huon Wilson >Priority: Major > > Spark SQL's {{to_json}} and {{from_json}} currently support arrays and > objects, but not the scalar/primitive types. This doesn't match the JSON spec > on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a > JSON document ({{json: element}}) consists of a value surrounded by > whitespace ({{element: ws value ws}}), where a value is an object or array > _or_ a number or string etc.: > {code:none} > value > object > array > string > number > "true" > "false" > "null" > {code} > Having {{to_json}} and {{from_json}} support scalars would make them flexible > enough for a library I'm working on, where an arbitrary (user-supplied) > column needs to be turned into JSON. > NB. these newer specs differ to the original [RFC4627| > https://tools.ietf.org/html/rfc4627] (which is now obsolete) that > (essentially) had {{value: object | array}}. > This is related to SPARK-24391 and SPARK-25252, which added support for > arrays of scalars. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars
[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774859#comment-16774859 ] Hyukjin Kwon commented on SPARK-26964: -- I know it's practically fine that JSON is pretty good to store as binary or string column, it's fine. I want to be very sure primitive support is something absolutely required and useful. {quote} Looking at the source code, it seems like all of these types have support in JacksonGenerator and JacksonParser, and so most of the work will be surfacing that, rather than entirely new code. Is there something you expect to be more intricate than additions to JsonToStructs and StructsToJson (and tests)? I'm considering having a look at this myself, but if your intuition implies that this is going to be a dead end, I will not. {quote} The core logic itself can be reused but surfacing the codes is a problem. By exposing primitive array, map, the community faced a lot of corner case problems. For instance, about how we're going to handle corrupt record (Spark provides some options to handle those records). One PR had to be reverted lately, for instance, see https://github.com/apache/spark/pull/23665 . I guess it still needs considerable amount of codes. (see when we added MapType into one of both functions ,https://github.com/apache/spark/pull/18875) . One thing I am pretty sure of is that It would need some efforts to make the codes and get this into codebase - so I am being cautious here. > to_json/from_json do not match JSON spec due to not supporting scalars > -- > > Key: SPARK-26964 > URL: https://issues.apache.org/jira/browse/SPARK-26964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Huon Wilson >Priority: Major > > Spark SQL's {{to_json}} and {{from_json}} currently support arrays and > objects, but not the scalar/primitive types. This doesn't match the JSON spec > on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a > JSON document ({{json: element}}) consists of a value surrounded by > whitespace ({{element: ws value ws}}), where a value is an object or array > _or_ a number or string etc.: > {code:none} > value > object > array > string > number > "true" > "false" > "null" > {code} > Having {{to_json}} and {{from_json}} support scalars would make them flexible > enough for a library I'm working on, where an arbitrary (user-supplied) > column needs to be turned into JSON. > NB. these newer specs differ to the original [RFC4627| > https://tools.ietf.org/html/rfc4627] (which is now obsolete) that > (essentially) had {{value: object | array}}. > This is related to SPARK-24391 and SPARK-25252, which added support for > arrays of scalars. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars
[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774783#comment-16774783 ] Huon Wilson commented on SPARK-26964: - We wish to store columns as columns within a {{binary}}-based database (HBase), meaning encoding individual fields. JSON represents a non-horrible way of encoding values into the database, e.g. it allows handling from many languages/environments (and is even human readable), and is very convenient to handle with DataFrames. I don't know of another way to extract byte representations of individual columns that satisfies those constraints. Looking at the source code, it seems like all of these types have support in JacksonGenerator and JacksonParser, and so most of the work will be surfacing that, rather than entirely new code. Is there something you expect to be more intricate than additions to JsonToStructs and StructsToJson (and tests)? I'm considering having a look at this myself, but if your intuition implies that this is going to be a dead end, I will not. > to_json/from_json do not match JSON spec due to not supporting scalars > -- > > Key: SPARK-26964 > URL: https://issues.apache.org/jira/browse/SPARK-26964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Huon Wilson >Priority: Major > > Spark SQL's {{to_json}} and {{from_json}} currently support arrays and > objects, but not the scalar/primitive types. This doesn't match the JSON spec > on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a > JSON document ({{json: element}}) consists of a value surrounded by > whitespace ({{element: ws value ws}}), where a value is an object or array > _or_ a number or string etc.: > {code:none} > value > object > array > string > number > "true" > "false" > "null" > {code} > Having {{to_json}} and {{from_json}} support scalars would make them flexible > enough for a library I'm working on, where an arbitrary (user-supplied) > column needs to be turned into JSON. > NB. these newer specs differ to the original [RFC4627| > https://tools.ietf.org/html/rfc4627] (which is now obsolete) that > (essentially) had {{value: object | array}}. > This is related to SPARK-24391 and SPARK-25252, which added support for > arrays of scalars. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars
[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774767#comment-16774767 ] Hyukjin Kwon commented on SPARK-26964: -- Can you describe the usecase in more details? Adding primitive types there requires considerable amount of codes to maintain. I want to see how much it's worth. > to_json/from_json do not match JSON spec due to not supporting scalars > -- > > Key: SPARK-26964 > URL: https://issues.apache.org/jira/browse/SPARK-26964 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.2, 2.4.0 >Reporter: Huon Wilson >Priority: Major > > Spark SQL's {{to_json}} and {{from_json}} currently support arrays and > objects, but not the scalar/primitive types. This doesn't match the JSON spec > on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a > JSON document ({{json: element}}) consists of a value surrounded by > whitespace ({{element: ws value ws}}), where a value is an object or array > _or_ a number or string etc.: > {code:none} > value > object > array > string > number > "true" > "false" > "null" > {code} > Having {{to_json}} and {{from_json}} support scalars would make them flexible > enough for a library I'm working on, where an arbitrary (user-supplied) > column needs to be turned into JSON. > NB. these newer specs differ to the original [RFC4627| > https://tools.ietf.org/html/rfc4627] (which is now obsolete) that > (essentially) had {{value: object | array}}. > This is related to SPARK-24391 and SPARK-25252, which added support for > arrays of scalars. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org