Hi I have many thousands of small DataFrames that I would like to save to the one Parquet file to avoid the HDFS 'small files' problem. My understanding is that there is a 1:1 relationship between DataFrames and Parquet files if a single partition is used.
Is it possible to have multiple DataFrames within the one Parquet File using PySpark? Or is the only way to achieve this to union the DataFrames into one? Thanks, Peter