[ https://issues.apache.org/jira/browse/SPARK-39474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Mahoney updated SPARK-39474: ---------------------------------- Description: Hi Team! I'd like to set up a much easier way to optimize my {{delta}} tables. Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}. Let me show you the differences: *Current:* First, run: {code:java} import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a': [1,2,3,4], 'b': ['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').save('./folder'){code} Then, once it's saved, run: {code:java} CREATE TABLE df USING DELTA LOCATION './folder' {code} Then, once the table is loaded, run: {code:java} OPTIMIZE df --or OPTIMIZE df ZORDER BY (b) {code} As you can see, there are many steps needed. *Future:* I'd like to be able to do something like this: {code:java} import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a':[1,2,3,4], 'b':['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder') #or df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code} As you can see, it's much more streamlined, and keeps the code to a higher-level. Thank you. References: * https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html * https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3 * https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/ * https://docs.databricks.com/delta/optimizations/file-mgmt.html * https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html * https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10 * https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving. was: Hi Team! I'd like to set up a much easier way to optimize my {{delta}} tables. Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}. Let me show you the differences: *Current:* First, run: {code:java} import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a': [1,2,3,4], 'b': ['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').save('./folder'){code} Then, once it's saved, run: {code:java} CREATE TABLE df USING DELTA LOCATION './folder' {code} Then, once the table is loaded, run: {code:java} OPTIMIZE df --or OPTIMIZE df ZORDER BY (b) {code} As you can see, there are many steps needed. *Future:* I'd like to be able to do something like this: {code:java} import pandas as pd df = spark.createDataFrame( pd.DataFrame( {'a':[1,2,3,4], 'b':['a','b','c','d']} ) ) df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder') #or df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') {code} As you can see, it's much more streamlined, and keeps the code to a higher-level. Thank you. References: * [https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html] * [https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3] * [https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/] * [https://docs.databricks.com/delta/optimizations/file-mgmt.html] * [https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html] * [https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10] * [https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving] > Streamline the options for the `.write` method of a Spark DataFrame > ------------------------------------------------------------------- > > Key: SPARK-39474 > URL: https://issues.apache.org/jira/browse/SPARK-39474 > Project: Spark > Issue Type: Wish > Components: PySpark > Affects Versions: 3.2.1 > Reporter: Chris Mahoney > Priority: Minor > Fix For: 3.2.1 > > > Hi Team! > I'd like to set up a much easier way to optimize my {{delta}} tables. > Specifically, I am referring to the {{sql}} command {{{}OPTIMIZE <table>{}}}. > Let me show you the differences: > *Current:* > First, run: > {code:java} > import pandas as pd > df = spark.createDataFrame( > pd.DataFrame( > {'a': [1,2,3,4], > 'b': ['a','b','c','d']} > ) > ) > df.write.mode('overwrite').format('delta').save('./folder'){code} > Then, once it's saved, run: > {code:java} > CREATE TABLE df USING DELTA LOCATION './folder' {code} > Then, once the table is loaded, run: > {code:java} > OPTIMIZE df > --or > OPTIMIZE df ZORDER BY (b) {code} > As you can see, there are many steps needed. > *Future:* > I'd like to be able to do something like this: > {code:java} > import pandas as pd > df = spark.createDataFrame( > pd.DataFrame( > {'a':[1,2,3,4], > 'b':['a','b','c','d']} > ) > ) > df.write.mode('overwrite').format('delta').options(optimize=True).save('./folder') > #or > df.write.mode('overwrite').format('delta').options(optimize=True,('b')).save('./folder') > {code} > As you can see, it's much more streamlined, and keeps the code to a > higher-level. > Thank you. > > References: * > https://docs.azuredatabricks.net/_static/notebooks/delta/optimize-python.html > * > https://medium.com/@debusinha2009/cheatsheet-on-understanding-zorder-and-optimize-for-your-delta-tables-1556282221d3 > * > https://www.cloudiqtech.com/partition-optimize-and-zorder-delta-tables-in-azure-databricks/ > * https://docs.databricks.com/delta/optimizations/file-mgmt.html > * > https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-optimize.html > * > https://stackoverflow.com/questions/65320949/parquet-vs-delta-format-in-azure-data-lake-gen-2-store?_sm_au_=iVV4WjsV0q7WQktrJfsTkK7RqJB10 > * > https://www.i-programmer.info/news/197-data-mining/12582-databricks-delta-adds-faster-parquet-import.html#:~:text=Databricks%20says%20Delta%20is%2010,data%20management%2C%20and%20query%20serving. > -- This message was sent by Atlassian Jira (v8.20.7#820007) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org