[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

Nakul Jindal (JIRA) Mon, 07 Dec 2015 19:22:25 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046269#comment-15046269
 ]


Nakul Jindal commented on SPARK-11046:
--------------------------------------

I am trying to understand the benefit of doing it using JSON as opposed to the 
format that it currently is in.

We have 3 cases:


Case 1 - Leave things the way they are.
Here is what we have currently:
Let us say, our type is 
array <map <string, struct<a:integer,b:long,c:string> >>

- The R function structField.character (in schema.R) is passed this exact string
- In turn it calls checkType to recursively validate the schema string
- The scala function SQLUtils.getSQLDataType (in SQLUtils.scala), recursively 
converts this to an object of type DataType

Case 2 - Expect the user to specify the input schema in JSON
If we converted the schema format to JSON, it would look like this:
{
  "type": "array",
  "elementType": {
    "type": "map",
    "keyType": "string",
    "valueType": {
      "type": "struct",
      "fields": [{
        "name": "a",
        "type": "integer",
        "nullable": true,
        "metadata": {}
      }, {
        "name": "b",
        "type": "long",
        "nullable": true,
        "metadata": {}
      }, {
        "name": "c",
        "type": "string",
        "nullable": true,
        "metadata": {}
      }]
    },
    "valueContainsNull": false
  },
  "containsNull": true
}
(based on what DataType.fromJson expects).
which is placing way too much burden on the sparkR user.

- I am not entirely sure about this, but I think we do not want to or cannot 
(or simply haven't implemented) a way to communicate exceptions encountered in 
the scala code back to R.
- We'd need to write a way to validate the JSON schema in R code (or use a JSON 
parsing library to do it in some way).
- The code in SQLUtils.getSQLDataType will now be significantly reduced as we 
can just call DataType.fromJson.

Case 3 - Convert the schema to JSON in R code before calling the JVM function 
org.apache.spark.sql.api.r.SQLUtils.createStructField
- This is essentially moving the work done in SQLUtils.getSQLDataType to R 
code. This IMHO is significantly more complicated to write and maintain.

TLDR: At the cost of inconvenience to the sparkR user, we will convert 
specifying the schema from its current (IMHO - simple) form to JSON.

[~shivaram], [~sunrui] - Any thoughts?


> Pass schema from R to JVM using JSON format
> -------------------------------------------
>
>                 Key: SPARK-11046
>                 URL: https://issues.apache.org/jira/browse/SPARK-11046
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 1.5.1
>            Reporter: Sun Rui
>            Priority: Minor
>
> Currently, SparkR passes a DataFrame schema from R to JVM backend using 
> regular expression. However, Spark now supports schmea using JSON format.   
> So enhance SparkR to use schema in JSON format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11046) Pass schema from R to JVM using JSON format

Reply via email to