[jira] [Comment Edited] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

Kevin Appel (Jira) Wed, 31 Aug 2022 07:15:24 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-31001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598397#comment-17598397
 ]


Kevin Appel edited comment on SPARK-31001 at 8/31/22 2:14 PM:
--------------------------------------------------------------

Its is defined in here:
[https://github.com/apache/spark/blob/55ee406df9933ca522bc98c2d2ccc0245e97ff67/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala]
  /**
   * The key to use for storing partitionBy columns as options.
   */
  val PARTITIONING_COLUMNS_KEY = "__partition_columns"
 
I started off doing something like: df.write.partitionBy("id").option("path", 
"/user/kevin/ktest1").saveAsTable("kevin.ktest1")
 
Just to see if this works and it does, so somehow the df is having the schema 
with partition by it can pass to the saveAsTable and that is able to make the 
external table correctly.
 
Then inside the 
[https://github.com/apache/spark/blob/36dd531a93af55ce5c2bfd8d275814ccb2846962/python/pyspark/sql/catalog.py#L705]
 it has an extra item
**options : dict, optional extra options to specify in the table.
 
I started to look for partition options and i found from the delta link:
.option("__partition_columns", """["join_dim_date_id"]""")
 
>From there I had built that into a dictionary and send that into the function 
>and it worked to declare the schema correct with the partition by, then the 
>second command goes and scans all the partitions and after that it seems to be 
>working.
 
Something like this is also working:
spark.catalog.createTable("kevin.ktest1", "/user/kevin/ktest1", 
__partition_columns="['id']")
spark.sql("alter table kevin.ktest1 recover partitions")
 
Whether or not this is the right go forward solution, hopefully some of the 
spark experts could chime in; this __ variable name is meant for private 
variables


was (Author: kevinappel):
Its is defined in here:
https://github.com/apache/spark/blob/55ee406df9933ca522bc98c2d2ccc0245e97ff67/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceUtils.scala
  /**
   * The key to use for storing partitionBy columns as options.
   */
  val PARTITIONING_COLUMNS_KEY = "__partition_columns"
 
I started off doing something like: df.write.partitionBy("id").option("path", 
"/user/kevin/ktest1").saveAsTable("kevin.ktest1")
 
Just to see if this works and it does, so somehow the df is having the schema 
with partition by it can pass to the saveAsTable and that is able to make the 
external table correctly.
 
Then inside the 
[https://github.com/apache/spark/blob/36dd531a93af55ce5c2bfd8d275814ccb2846962/python/pyspark/sql/catalog.py#L705]
 it has an extra item
**options : dict, optional extra options to specify in the table.
 
I started to look for partition options and i found from the delta link:
.option("__partition_columns", """["join_dim_date_id"]""")
 
>From there I had built that into a dictionary and send that into the function 
>and it worked to declare the schema correct with the partition by, then the 
>second command goes and scans all the partitions and after that it seems to be 
>working.
 
Something like this is also working:
spark.catalog.createTable("kevin.ktest1", "/user/kevin/ktest1", 
__partition_columns":"['id']")
spark.sql("alter table kevin.ktest1 recover partitions")
 
Whether or not this is the right go forward solution, hopefully some of the 
spark experts could chime in; this __ variable name is meant for private 
variables

> Add ability to create a partitioned table via catalog.createTable()
> -------------------------------------------------------------------
>
>                 Key: SPARK-31001
>                 URL: https://issues.apache.org/jira/browse/SPARK-31001
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Nicholas Chammas
>            Priority: Minor
>
> There doesn't appear to be a way to create a partitioned table using the 
> Catalog interface.
> In SQL, however, you can do this via {{{}CREATE TABLE ... PARTITIONED BY{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31001) Add ability to create a partitioned table via catalog.createTable()

Reply via email to