[jira] [Updated] (SPARK-39474) Streamline the options for the `.write` method of a Spark DataFrame

Chris Mahoney (Jira) Tue, 14 Jun 2022 19:53:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-39474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Mahoney updated SPARK-39474:
----------------------------------
    Description: 
Hi Team!

I'd like to set up a much easier way to optimize my {{delta}} tables. 
Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}.

Let me show you the differences:

*Current:*

First, run:
{code:java}
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a': [1,2,3,4],
         'b': ['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').save('./folder'){code}
 Then, once it's saved, run:
{code:java}
CREATE TABLE df USING DELTA LOCATION './folder' {code}
 Then, once the table is loaded, run:
{code:java}
OPTIMIZE df
--or
OPTIMIZE df ZORDER BY (b) {code}
As you can see, there are many steps needed.

*Future:*
I'd like to be able to do something like this:
{code:java}
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a':[1,2,3,4],
         'b':['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
#or
df.write.mode('overwrite').format('delta').options(optimize=True, 
zorder_by=('b')).save('./folder') {code}
As you can see, it's much more streamlined, and keeps the code to a 
higher-level.

Thank you.

 
References:
 * 
[https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html]
 * 
[https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3]
 * 
[https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/]
 * [https://docs.databricks.com/delta/optimizations/file-mgmt.html]
 * 
[https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html]
 * 
[https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10]
 * 
[https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving].

 

  was:
Hi Team!

I'd like to set up a much easier way to optimize my {{delta}} tables. 
Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}.

Let me show you the differences:

*Current:*

First, run:
{code:java}
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a': [1,2,3,4],
         'b': ['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').save('./folder'){code}
 Then, once it's saved, run:
{code:java}
CREATE TABLE df USING DELTA LOCATION './folder' {code}
 Then, once the table is loaded, run:
{code:java}
OPTIMIZE df
--or
OPTIMIZE df ZORDER BY (b) {code}
As you can see, there are many steps needed.

*Future:*
I'd like to be able to do something like this:
{code:java}
import pandas as pd
df = spark.createDataFrame(
    pd.DataFrame(
        {'a':[1,2,3,4],
         'b':['a','b','c','d']}
    )
)
df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
#or
df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder')
 {code}
As you can see, it's much more streamlined, and keeps the code to a 
higher-level.

Thank you.

 
References:
 * https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html
 * 
https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3
 * 
https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/
 * https://docs.databricks.com/delta/optimizations/file-mgmt.html
 * 
https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html
 * 
https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10
 * 
https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving.

 


> Streamline the options for the `.write` method of a Spark DataFrame
> -------------------------------------------------------------------
>
>                 Key: SPARK-39474
>                 URL: https://issues.apache.org/jira/browse/SPARK-39474
>             Project: Spark
>          Issue Type: Wish
>          Components: PySpark
>    Affects Versions: 3.2.1
>            Reporter: Chris Mahoney
>            Priority: Minor
>             Fix For: 3.2.1
>
>
> Hi Team!
> I'd like to set up a much easier way to optimize my {{delta}} tables. 
> Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}.
> Let me show you the differences:
> *Current:*
> First, run:
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
>     pd.DataFrame(
>         {'a': [1,2,3,4],
>          'b': ['a','b','c','d']}
>     )
> )
> df.write.mode('overwrite').format('delta').save('./folder'){code}
>  Then, once it's saved, run:
> {code:java}
> CREATE TABLE df USING DELTA LOCATION './folder' {code}
>  Then, once the table is loaded, run:
> {code:java}
> OPTIMIZE df
> --or
> OPTIMIZE df ZORDER BY (b) {code}
> As you can see, there are many steps needed.
> *Future:*
> I'd like to be able to do something like this:
> {code:java}
> import pandas as pd
> df = spark.createDataFrame(
>     pd.DataFrame(
>         {'a':[1,2,3,4],
>          'b':['a','b','c','d']}
>     )
> )
> df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder')
> #or
> df.write.mode('overwrite').format('delta').options(optimize=True, 
> zorder_by=('b')).save('./folder') {code}
> As you can see, it's much more streamlined, and keeps the code to a 
> higher-level.
> Thank you.
>  
> References:
>  * 
> [https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html]
>  * 
> [https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3]
>  * 
> [https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/]
>  * [https://docs.databricks.com/delta/optimizations/file-mgmt.html]
>  * 
> [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html]
>  * 
> [https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10]
>  * 
> [https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving].
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39474) Streamline the options for the `.write` method of a Spark DataFrame

Reply via email to