[ https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782958#comment-16782958 ]
Huon Wilson commented on SPARK-26964: ------------------------------------- I see. Could you say why you're resolving it as Later? I'm not quite sure I understand how the error handling for corrupt records differs between this and the existing functionality in {{from_json}}, e.g. the corrupt record handling for decoding {{"x"}} as {{int}} seems to already exist (in the form of {{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and {{FailureSafeParser}} catching them) because the same error occurs when decoding {{\{"value":"x"\}}} as {{struct<value:int>}}. Along those lines, we're now using the following code to map arbitrary values to their JSON strings, and back. It involves wrapping the values in a struct, and using string manipulation to pull out the true JSON string. {code:scala} import java.util.regex.Pattern // ... object JsonHacks { // FIXME: massive hack working-around (a) the requirement to make an // explicit map<string, binary> for storage (would be nicer to just dump // columns in directly, and (b) to_json/from_json not supporting scalars // (https://issues.apache.org/jira/browse/SPARK-26964) private val TempName = "value" private val Prefix = "{\"" + TempName + "\":" private val Suffix = "}" // remove the prefix only when it is at the start of the string, and the // suffix only at the end private val StripRegexp = s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$" def valueToJson(col: Column): Column = { // Nest the column in a struct so that to_json can work ... val structJson = to_json(struct(col as TempName)) // ... but, because of this nesting, to_json(...) gives "{}" (not // null) if col is null, while this function needs to preserve that // null-ness. val nullOrStruct = when(col.isNull, null).otherwise(structJson) // Strip off the struct wrapping to pull out the JSON-ified `col` regexp_replace(nullOrStruct, StripRegexp, "") } def valueFromJson( col: Column, dataType: DataType, nullable: Boolean ): Column = { // from_json only works with a struct, so that's what we're going to be // parsing. val json_schema = new StructType().add(TempName, dataType, nullable) // To be able to parse into a struct, the JSON column needs to be wrapped // in what was stripped off above. val structJson = concat(lit(Prefix), col, lit(Suffix)) // Now we're finally ready to parse val parsedStruct = from_json(structJson, json_schema) // ... and extract the field to get the actual parsed column. parsedStruct(TempName) } } {code} > to_json/from_json do not match JSON spec due to not supporting scalars > ---------------------------------------------------------------------- > > Key: SPARK-26964 > URL: https://issues.apache.org/jira/browse/SPARK-26964 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.2, 2.4.0 > Reporter: Huon Wilson > Priority: Major > > Spark SQL's {{to_json}} and {{from_json}} currently support arrays and > objects, but not the scalar/primitive types. This doesn't match the JSON spec > on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a > JSON document ({{json: element}}) consists of a value surrounded by > whitespace ({{element: ws value ws}}), where a value is an object or array > _or_ a number or string etc.: > {code:none} > value > object > array > string > number > "true" > "false" > "null" > {code} > Having {{to_json}} and {{from_json}} support scalars would make them flexible > enough for a library I'm working on, where an arbitrary (user-supplied) > column needs to be turned into JSON. > NB. these newer specs differ to the original [RFC4627| > https://tools.ietf.org/html/rfc4627] (which is now obsolete) that > (essentially) had {{value: object | array}}. > This is related to SPARK-24391 and SPARK-25252, which added support for > arrays of scalars. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org