[I] Writing to multiple GeoParquet files will not output _metadata [sedona]

via GitHub Fri, 29 Mar 2024 07:43:20 -0700


jwass opened a new issue, #1296:
URL: https://github.com/apache/sedona/issues/1296


   ## Expected behavior
   
   When writing out a GeoParquet dataframe that results in multiple files, the 
_metadata summary file will not be created when configured to do so.
   
   ```
   import sedona
   from sedona.spark import *
   sedona = SedonaContext.create(spark)
   
   print("spark version: {}".format(spark.version))
   print("sedona version: {}".format(sedona.version))
   spark.conf.set("parquet.summary.metadata.level", "ALL")
   
   def write_geoparquet(df, path):
       df.write.format("geoparquet") \
           .option("geoparquet.version", "1.0.0") \
           .option("geoparquet.crs", "") \
           .option("compression", "zstd") \
           .option("parquet.block.size", 16 * 1024 * 1024) \
           .option("maxRecordsPerFile", 10000000) \
           .mode("overwrite").save(path)
   
   df = sedona.read.format("geoparquet").option("mergeSchema", 
"true").load(input_path)
   write_geoparquet(df, output_path)
   ```
   
   If the number of records exceeds maxRecordsPerFile so that more than one 
file is written, the `_metadata` and `_common_metadata` files will not be 
written. When there are fewer records that only one file is written, then 
`_metadata` and `_common_metadata` will be created.
   
   However if I change the above to write parquet instead of geoparquet:
   def write_parquet(df, path):
       df.write.format("parquet") \
           .option("compression", "zstd") \
           .option("parquet.block.size", 16 * 1024 * 1024) \
           .option("maxRecordsPerFile", 10000000) \
           .mode("overwrite").save(path)
   
   write_parquet(df, output_path)
   ```
   
   Then `_metadata` and `_common_metadata` will be written even with multiple 
files. Is there a setting or other way to enable writing the common metadata 
files?
   
   I'd like to write these files as reading in full datasets from pyarrow or 
others will not need to fully scan all files which can be time-consuming for 
large datasets.
   
   ## Settings
   
   Sedona version = 3.4.1
   Apache Spark version = 3.4.1
   
   Environment = Databricks
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Writing to multiple GeoParquet files will not output _metadata [sedona]

Reply via email to