[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-03-03 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782961#comment-16782961
 ] 

Hyukjin Kwon commented on SPARK-26964:
--

I resolved it as Later mainly due to no feedback. I think it's fine to reopen. 
You can try to open a PR and fix it if the change is small. Otherwise, I doubt 
if this is worth.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-03-03 Thread Huon Wilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16782958#comment-16782958
 ] 

Huon Wilson commented on SPARK-26964:
-

I see. Could you say why you're resolving it as Later? I'm not quite sure I 
understand how the error handling for corrupt records differs between this and 
the existing functionality in {{from_json}}, e.g. the corrupt record handling 
for decoding {{"x"}} as {{int}} seems to already exist (in the form of 
{{JacksonParser.parse}} converting exceptions into {{BadRecordException}}s, and 
{{FailureSafeParser}} catching them) because the same error occurs when 
decoding {{\{"value":"x"\}}} as {{struct}}.

Along those lines, we're now using the following code to map arbitrary values 
to their JSON strings, and back. It involves wrapping the values in a struct, 
and using string manipulation to pull out the true JSON string.

{code:scala}
import java.util.regex.Pattern
// ...

object JsonHacks {
  // FIXME: massive hack working-around (a) the requirement to make an
  // explicit map for storage (would be nicer to just dump
  // columns in directly, and (b) to_json/from_json not supporting scalars
  // (https://issues.apache.org/jira/browse/SPARK-26964)
  private val TempName = "value"
  private val Prefix = "{\"" + TempName + "\":"
  private val Suffix = "}"
  // remove the prefix only when it is at the start of the string, and the
  // suffix only at the end
  private val StripRegexp =
s"^${Pattern.quote(Prefix)}|${Pattern.quote(Suffix)}$$"

 def valueToJson(col: Column): Column = {
// Nest the column in a struct so that to_json can work ...
val structJson = to_json(struct(col as TempName))
// ... but, because of this nesting, to_json(...) gives "{}" (not
// null) if col is null, while this function needs to preserve that
// null-ness.
val nullOrStruct = when(col.isNull, null).otherwise(structJson)

// Strip off the struct wrapping to pull out the JSON-ified `col`
regexp_replace(nullOrStruct, StripRegexp, "")
  }
 def valueFromJson(
col: Column,
dataType: DataType,
nullable: Boolean
  ): Column = {
// from_json only works with a struct, so that's what we're going to be
// parsing.
val json_schema = new StructType().add(TempName, dataType, nullable)

// To be able to parse into a struct, the JSON column needs to be wrapped
// in what was stripped off above.
val structJson = concat(lit(Prefix), col, lit(Suffix))
// Now we're finally ready to parse
val parsedStruct = from_json(structJson, json_schema)
// ... and extract the field to get the actual parsed column.
parsedStruct(TempName)
  }
}
{code}

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774859#comment-16774859
 ] 

Hyukjin Kwon commented on SPARK-26964:
--

I know it's practically fine that JSON is pretty good to store as binary or 
string column, it's fine. I want to be very sure primitive support is something 
absolutely required and useful.

{quote}
Looking at the source code, it seems like all of these types have support in 
JacksonGenerator and JacksonParser, and so most of the work will be surfacing 
that, rather than entirely new code. Is there something you expect to be more 
intricate than additions to JsonToStructs and StructsToJson (and tests)? I'm 
considering having a look at this myself, but if your intuition implies that 
this is going to be a dead end, I will not.
{quote}

The core logic itself can be reused but surfacing the codes is a problem. By 
exposing primitive array, map, the community faced a lot of corner case 
problems. For instance, about how we're going to handle corrupt record (Spark 
provides some options to handle those records). One PR had to be reverted 
lately, for instance, see https://github.com/apache/spark/pull/23665 .  I guess 
it still needs considerable amount of codes. (see when we added MapType into 
one of both functions ,https://github.com/apache/spark/pull/18875) . One thing 
I am pretty sure of is that It would need some efforts to make the codes and 
get this into codebase - so I am being cautious here. 



> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Huon Wilson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774783#comment-16774783
 ] 

Huon Wilson commented on SPARK-26964:
-

We wish to store columns as columns within a {{binary}}-based database (HBase), 
meaning encoding individual fields. JSON represents a non-horrible way of 
encoding values into the database, e.g. it allows handling from many 
languages/environments (and is even human readable), and is very convenient to 
handle with DataFrames. I don't know of another way to extract byte 
representations of individual columns that satisfies those constraints.

Looking at the source code, it seems like all of these types have support in 
JacksonGenerator and JacksonParser, and so most of the work will be surfacing 
that, rather than entirely new code. Is there something you expect to be more 
intricate than additions to JsonToStructs and StructsToJson (and tests)? I'm 
considering having a look at this myself, but if your intuition implies that 
this is going to be a dead end, I will not.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26964) to_json/from_json do not match JSON spec due to not supporting scalars

2019-02-21 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774767#comment-16774767
 ] 

Hyukjin Kwon commented on SPARK-26964:
--

Can you describe the usecase in more details? Adding primitive types there 
requires considerable amount of codes to maintain. I want to see how much it's 
worth.

> to_json/from_json do not match JSON spec due to not supporting scalars
> --
>
> Key: SPARK-26964
> URL: https://issues.apache.org/jira/browse/SPARK-26964
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>
> Spark SQL's {{to_json}} and {{from_json}} currently support arrays and 
> objects, but not the scalar/primitive types. This doesn't match the JSON spec 
> on https://www.json.org/ or [RFC8259|https://tools.ietf.org/html/rfc8259]: a 
> JSON document ({{json: element}}) consists of a value surrounded by 
> whitespace ({{element: ws value ws}}), where a value is an object or array 
> _or_ a number or string etc.:
> {code:none}
> value
> object
> array
> string
> number
> "true"
> "false"
> "null"
> {code}
> Having {{to_json}} and {{from_json}} support scalars would make them flexible 
> enough for a library I'm working on, where an arbitrary (user-supplied) 
> column needs to be turned into JSON.
> NB. these newer specs differ to the original [RFC4627| 
> https://tools.ietf.org/html/rfc4627] (which is now obsolete) that 
> (essentially) had {{value: object | array}}.
> This is related to SPARK-24391 and SPARK-25252, which added support for 
> arrays of scalars.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org