[ https://issues.apache.org/jira/browse/SPARK-39474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Mahoney updated SPARK-39474: ---------------------------------- Description: Hi Team! I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`. Let me show you the differences: *Current:* First, run: {code:java} import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a': [1,2,3,4], 'b': ['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').save('./folder'){code} Then, once it's saved, run: {code:java} CREATE TABLE df USING DELTA LOCATION './folder' {code} Then, once the table is loaded, run: {code:java} OPTIMIZE df --or OPTIMIZE df ZORDER BY (b) {code} As you can see, there are many steps needed. *{*}Future:{*}* I'd like to be able to do something like this: {code:java} import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a':[1,2,3,4], 'b':['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder') #or df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code} As you can see, it's much more streamlined, and keeps the code to a higher-level. Thank you. was: Hi Team! I'd like to set up a much easier way to optimize my `delta` tables. Specifically, I am referring to the `sql` command `OPTIMIZE <table`. Let me show you the differences: **Current:** First, run: ```python import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a': [1,2,3,4], 'b': ['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').save('./folder') ``` Then, once it's saved, run: ```sql CREATE TABLE df USING DELTA LOCATION './folder' ``` Then, once the table is loaded, run: ```sql OPTIMIZE df --or OPTIMIZE df ZORDER BY (b) ``` As you can see, there are many steps needed. **Future:** I'd like to be able to do something like this: ```python import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a':[1,2,3,4], 'b':['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder') #or df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') ``` As you can see, it's much more streamlined, and keeps the code to a higher-level. Thank you. > Streamline the options for the `.write` method of a Spark DataFrame > ------------------------------------------------------------------- > > Key: SPARK-39474 > URL: https://issues.apache.org/jira/browse/SPARK-39474 > Project: Spark > Issue Type: Wish > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Chris Mahoney > Priority: Minor > Fix For: 3.2.1 > > > Hi Team! > I'd like to set up a much easier way to optimize my `delta` tables. > Specifically, I am referring to the `sql` command `OPTIMIZE <table`. > Let me show you the differences: > *Current:* > First, run: > > {code:java} > import pandas as pd > df = spark.createDataFrame( > pd.DataFrame( > {'a': [1,2,3,4], > 'b': ['a','b','c','d']} > ) > ) > df.write.mode('overwrite').format('delta').save('./folder'){code} > Then, once it's saved, run: > > {code:java} > CREATE TABLE df USING DELTA LOCATION './folder' {code} > Then, once the table is loaded, run: > {code:java} > OPTIMIZE df > --or > OPTIMIZE df ZORDER BY (b) {code} > As you can see, there are many steps needed. > *{*}Future:{*}* > I'd like to be able to do something like this: > {code:java} > import pandas as pd > df = spark.createDataFrame( > pd.DataFrame( > {'a':[1,2,3,4], > 'b':['a','b','c','d']} > ) > ) > df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder') > #or > df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') > {code} > As you can see, it's much more streamlined, and keeps the code to a > higher-level. > > Thank you. -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org