[GitHub] [spark] davidrabinowitz commented on pull request #30071: [SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator

GitBox Tue, 03 Nov 2020 12:24:24 -0800


davidrabinowitz commented on pull request #30071:
URL: https://github.com/apache/spark/pull/30071#issuecomment-721355772



   @HyukjinKwon
   
   Should I create another PR aimed at master?
   
   In order to test it first you need to create a table in BigQuery in the 
following manner:
   ```
   bq load --source_format NEWLINE_DELIMITED_JSON <TABLE> vector_test.data.json 
vector_test.schema.json
   ```
   The files are:
   
   - vector_test.data.json:
   ```
   {"name":"row1","num":"1","vector":{"type":"1","indices":[],"values":[1,2,3]}}
   {"name":"row2","num":"2","vector":{"type":"1","indices":[],"values":[4,5,6]}}
   {"name":"row3","num":"3","vector":{"type":"1","indices":[],"values":[7,8,9]}}
   ```
   
   - vector_test.schema.json:
   ```
   [
     {
       "mode": "NULLABLE",
       "name": "name",
       "type": "STRING"
     },
     {
       "mode": "NULLABLE",
       "name": "num",
       "type": "INTEGER"
     },
     {
       "description": "{spark.type=vector}",
       "fields": [
         {
           "mode": "NULLABLE",
           "name": "type",
           "type": "INTEGER"
         },
         {
           "mode": "NULLABLE",
           "name": "size",
           "type": "INTEGER"
         },
         {
           "mode": "REPEATED",
           "name": "indices",
           "type": "INTEGER"
         },
         {
           "mode": "REPEATED",
           "name": "values",
           "type": "FLOAT"
         }
       ],
       "mode": "NULLABLE",
       "name": "vector",
       "type": "RECORD"
     }
   ]
   ```
   A GCP account is needed for that, but the amount of data and operation are 
well in the free tier.
   
   Run `spark-shell  --packages 
com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.3` and enter 
the following commands:
   ```
   val df = 
spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("<TABLE>")
   df.schema()
   df.show()
   ```
   
   Notice that when the format is changed to `bigquery` another path is used 
which does not rely on the code generator and hence does not suffer from this 
issue.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] davidrabinowitz commented on pull request #30071: [SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator

Reply via email to