[ https://issues.apache.org/jira/browse/SPARK-11046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15046269#comment-15046269 ]
Nakul Jindal commented on SPARK-11046: -------------------------------------- I am trying to understand the benefit of doing it using JSON as opposed to the format that it currently is in. We have 3 cases: Case 1 - Leave things the way they are. Here is what we have currently: Let us say, our type is array <map <string, struct<a:integer,b:long,c:string> >> - The R function structField.character (in schema.R) is passed this exact string - In turn it calls checkType to recursively validate the schema string - The scala function SQLUtils.getSQLDataType (in SQLUtils.scala), recursively converts this to an object of type DataType Case 2 - Expect the user to specify the input schema in JSON If we converted the schema format to JSON, it would look like this: { "type": "array", "elementType": { "type": "map", "keyType": "string", "valueType": { "type": "struct", "fields": [{ "name": "a", "type": "integer", "nullable": true, "metadata": {} }, { "name": "b", "type": "long", "nullable": true, "metadata": {} }, { "name": "c", "type": "string", "nullable": true, "metadata": {} }] }, "valueContainsNull": false }, "containsNull": true } (based on what DataType.fromJson expects). which is placing way too much burden on the sparkR user. - I am not entirely sure about this, but I think we do not want to or cannot (or simply haven't implemented) a way to communicate exceptions encountered in the scala code back to R. - We'd need to write a way to validate the JSON schema in R code (or use a JSON parsing library to do it in some way). - The code in SQLUtils.getSQLDataType will now be significantly reduced as we can just call DataType.fromJson. Case 3 - Convert the schema to JSON in R code before calling the JVM function org.apache.spark.sql.api.r.SQLUtils.createStructField - This is essentially moving the work done in SQLUtils.getSQLDataType to R code. This IMHO is significantly more complicated to write and maintain. TLDR: At the cost of inconvenience to the sparkR user, we will convert specifying the schema from its current (IMHO - simple) form to JSON. [~shivaram], [~sunrui] - Any thoughts? > Pass schema from R to JVM using JSON format > ------------------------------------------- > > Key: SPARK-11046 > URL: https://issues.apache.org/jira/browse/SPARK-11046 > Project: Spark > Issue Type: Improvement > Components: SparkR > Affects Versions: 1.5.1 > Reporter: Sun Rui > Priority: Minor > > Currently, SparkR passes a DataFrame schema from R to JVM backend using > regular expression. However, Spark now supports schmea using JSON format. > So enhance SparkR to use schema in JSON format. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org