How did you end up with thousands of df? Are you using streaming?  In that
case you can do foreachRDD and keep merging incoming rdds to single rdd and
then save it through your own checkpoint mechanism.

If not, please share your use case.
On 11 May 2015 00:38, "Peter Aberline" <peter.aberl...@gmail.com> wrote:

> Hi
>
> I have many thousands of small DataFrames that I would like to save to the
> one Parquet file to avoid the HDFS 'small files' problem. My understanding
> is that there is a 1:1 relationship between DataFrames and Parquet files if
> a single partition is used.
>
> Is it possible to have multiple DataFrames within the one Parquet File
> using PySpark?
> Or is the only way to achieve this to union the DataFrames into one?
>
> Thanks,
> Peter
>
>
>

Reply via email to