Hi

I have many thousands of small DataFrames that I would like to save to the
one Parquet file to avoid the HDFS 'small files' problem. My understanding
is that there is a 1:1 relationship between DataFrames and Parquet files if
a single partition is used.

Is it possible to have multiple DataFrames within the one Parquet File
using PySpark?
Or is the only way to achieve this to union the DataFrames into one?

Thanks,
Peter

Reply via email to