[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

Dongjoon Hyun (JIRA) Mon, 06 Aug 2018 15:52:06 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570880#comment-16570880
 ]


Dongjoon Hyun edited comment on SPARK-24924 at 8/6/18 10:50 PM:
----------------------------------------------------------------

Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. 
So, it will be the next releases, not the currently existing ones.

Spark hides Spark-generated metadata. You can see them via `hive` CLI like the 
following.

1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too.
{code:java}
hive> show tables;
OK
Time taken: 1.163 seconds
{code}
2. Apache Spark 2.3.1 Result (See `Provider` field)
{code:java}
scala> spark.version
res1: String = 2.3.1
scala> 
spark.range(10).write.format("com.databricks.spark.avro").saveAsTable("t")
scala> sql("desc formatted t").show(false)
+----------------------------+---------------------------------------------------------+-------+
|col_name                    |data_type                                         
       |comment|
+----------------------------+---------------------------------------------------------+-------+
|id                          |bigint                                            
       |null   |
|                            |                                                  
       |       |
|# Detailed Table Information|                                                  
       |       |
|Database                    |default                                           
       |       |
|Table                       |t                                                 
       |       |
|Owner                       |dongjoon                                          
       |       |
|Created Time                |Mon Aug 06 15:41:40 PDT 2018                      
       |       |
|Last Access                 |Wed Dec 31 16:00:00 PST 1969                      
       |       |
|Created By                  |Spark 2.3.1                                       
       |       |
|Type                        |MANAGED                                           
       |       |
|Provider                    |com.databricks.spark.avro                         
       |       |
|Table Properties            |[transient_lastDdlTime=1533595300]                
       |       |
|Location                    |file:/user/hive/warehouse/t                       
       |       |
|Serde Library               
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       |       |
|InputFormat                 |org.apache.hadoop.mapred.SequenceFileInputFormat  
       |       |
|OutputFormat                
|org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|       |
|Storage Properties          |[serialization.format=1]                          
       |       |
+----------------------------+---------------------------------------------------------+-------+
{code}
3. Apache Hive 1.2.2 CLI Result (See `Table Parameters`)
{code:java}
hive> describe formatted t;
OK
# col_name              data_type               comment

col                     array<string>           from deserializer

# Detailed Table Information
Database:               default
Owner:                  dongjoon
CreateTime:             Mon Aug 06 15:41:40 PDT 2018
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               
file:/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t
Table Type:             MANAGED_TABLE
Table Parameters:
        spark.sql.create.version        2.3.1
        spark.sql.sources.provider      com.databricks.spark.avro
        spark.sql.sources.schema.numParts       1
        spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
        transient_lastDdlTime   1533595300

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        path                    file:/user/hive/warehouse/t
        serialization.format    1
Time taken: 1.373 seconds, Fetched: 31 row(s)
{code}


was (Author: dongjoon):
Yep. It will work if those 3rd-party packages are rebuilt on Apache Spark 2.4. 
So, it will be the next releases, not the currently existing ones.

Spark hides Spark-generated metadata. You can see them via `hive` CLI like the 
following.

1. Run Apache Hive 1.2.2 CLI and check tables; This initialize metastores, too.
{code}
hive> show tables;
OK
Time taken: 1.163 seconds
{code}

2. Apache Spark 2.3.1 Result
{code}
scala> spark.version
res1: String = 2.3.1
scala> 
spark.range(10).write.format("com.databricks.spark.avro").saveAsTable("t")
scala> sql("desc formatted t").show(false)
+----------------------------+---------------------------------------------------------+-------+
|col_name                    |data_type                                         
       |comment|
+----------------------------+---------------------------------------------------------+-------+
|id                          |bigint                                            
       |null   |
|                            |                                                  
       |       |
|# Detailed Table Information|                                                  
       |       |
|Database                    |default                                           
       |       |
|Table                       |t                                                 
       |       |
|Owner                       |dongjoon                                          
       |       |
|Created Time                |Mon Aug 06 15:41:40 PDT 2018                      
       |       |
|Last Access                 |Wed Dec 31 16:00:00 PST 1969                      
       |       |
|Created By                  |Spark 2.3.1                                       
       |       |
|Type                        |MANAGED                                           
       |       |
|Provider                    |com.databricks.spark.avro                         
       |       |
|Table Properties            |[transient_lastDdlTime=1533595300]                
       |       |
|Location                    |file:/user/hive/warehouse/t                       
       |       |
|Serde Library               
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       |       |
|InputFormat                 |org.apache.hadoop.mapred.SequenceFileInputFormat  
       |       |
|OutputFormat                
|org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|       |
|Storage Properties          |[serialization.format=1]                          
       |       |
+----------------------------+---------------------------------------------------------+-------+
{code}

3. Apache Hive 1.2.2 CLI Result
{code}
hive> describe formatted t;
OK
# col_name              data_type               comment

col                     array<string>           from deserializer

# Detailed Table Information
Database:               default
Owner:                  dongjoon
CreateTime:             Mon Aug 06 15:41:40 PDT 2018
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               
file:/Users/dongjoon/spark-release/spark-2.3.1-bin-hadoop2.7/spark-warehouse/t
Table Type:             MANAGED_TABLE
Table Parameters:
        spark.sql.create.version        2.3.1
        spark.sql.sources.provider      com.databricks.spark.avro
        spark.sql.sources.schema.numParts       1
        spark.sql.sources.schema.part.0 
{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}
        transient_lastDdlTime   1533595300

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        path                    file:/user/hive/warehouse/t
        serialization.format    1
Time taken: 1.373 seconds, Fetched: 31 row(s)
{code}

> Add mapping for built-in Avro data source
> -----------------------------------------
>
>                 Key: SPARK-24924
>                 URL: https://issues.apache.org/jira/browse/SPARK-24924
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: Dongjoon Hyun
>            Assignee: Dongjoon Hyun
>            Priority: Minor
>             Fix For: 2.4.0
>
>
> This issue aims to the followings.
>  # Like `com.databricks.spark.csv` mapping, we had better map 
> `com.databricks.spark.avro` to built-in Avro data source.
>  # Remove incorrect error message, `Please find an Avro package at ...`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24924) Add mapping for built-in Avro data source

Reply via email to