[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782958#comment-16782958
 ] 

Huon Wilson commented on SPARK-26964:
-------------------------------------

I see. Could you say why you're resolving it as Later? I'm not quite sure I 
understand how the error handling for corrupt records differs between this and 
the existing functionality in {{from_json}}, e.g. the corrupt record handling 
for decoding {{"x"}} as {{int}} seems to already exist (in the form of 
{{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and 
{{FailureSafeParser}} catching them) because the same error occurs when 
decoding {{\{"value":"x"\}}} as {{struct<value:int>}}.

Along those lines, we're now using the following code to map arbitrary values 
to their JSON strings, and back. It involves wrapping the values in a struct, 
and using string manipulation to pull out the true JSON string.

{code:scala}
import java.util.regex.Pattern
// ...

object JsonHacks {
  // FIXME: massive hack working-around (a) the requirement to make an
  // explicit map<string, binary> for storage (would be nicer to just dump
  // columns in directly, and (b) to_json/from_json not supporting scalars
  // (https://issues.apache.org/jira/browse/SPARK-26964)
  private val TempName = "value"
  private val Prefix = "{\"" + TempName + "\":"
  private val Suffix = "}"
  // remove the prefix only when it is at the start of the string, and the
  // suffix only at the end
  private val StripRegexp =
    s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"

 def valueToJson(col: Column): Column = {
    // Nest the column in a struct so that to_json can work ...
    val structJson = to_json(struct(col as TempName))
    // ... but, because of this nesting, to_json(...) gives "{}" (not
    // null) if col is null, while this function needs to preserve that
    // null-ness.
    val nullOrStruct = when(col.isNull, null).otherwise(structJson)

    // Strip off the struct wrapping to pull out the JSON-ified `col`
    regexp_replace(nullOrStruct, StripRegexp, "")
  }
 def valueFromJson(
    col: Column,
    dataType: DataType,
    nullable: Boolean
  ): Column = {
    // from_json only works with a struct, so that's what we're going to be
    // parsing.
    val json_schema = new StructType().add(TempName, dataType, nullable)

    // To be able to parse into a struct, the JSON column needs to be wrapped
    // in what was stripped off above.
    val structJson = concat(lit(Prefix), col, lit(Suffix))
    // Now we're finally ready to parse
    val parsedStruct = from_json(structJson, json_schema)
    // ... and extract the field to get the actual parsed column.
    parsedStruct(TempName)
  }
}
{code}

> to_json/from_json do not match JSON spec due to not supporting scalars
> ----------------------------------------------------------------------
>
>                 Key: SPARK-26964
>                 URL: https://issues.apache.org/jira/browse/SPARK-26964
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.2, 2.4.0
>            Reporter: Huon Wilson
>            Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
>     object
>     array
>     string
>     number
>     "true"
>     "false"
>     "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to