Re: Multiple DataFrames per Parquet file?
You can union all the df together, then call repartition(). On Sun, May 10, 2015 at 8:34 AM, Peter Aberline wrote: > Hi > > Thanks for the quick response. > > No I'm not using Streaming. Each DataFrame represents tabular data read from > a CSV file. They have the same schema. > > There is also the option of appending each DF to the parquet file, but then > I can't maintain them as separate DF when reading back in without filtering. > > I'll rethink maintaining each CSV file as a single DF. > > Thanks, > Peter > > > On 10 May 2015 at 15:51, ayan guha wrote: >> >> How did you end up with thousands of df? Are you using streaming? In that >> case you can do foreachRDD and keep merging incoming rdds to single rdd and >> then save it through your own checkpoint mechanism. >> >> If not, please share your use case. >> >> On 11 May 2015 00:38, "Peter Aberline" wrote: >>> >>> Hi >>> >>> I have many thousands of small DataFrames that I would like to save to >>> the one Parquet file to avoid the HDFS 'small files' problem. My >>> understanding is that there is a 1:1 relationship between DataFrames and >>> Parquet files if a single partition is used. >>> >>> Is it possible to have multiple DataFrames within the one Parquet File >>> using PySpark? >>> Or is the only way to achieve this to union the DataFrames into one? >>> >>> Thanks, >>> Peter >>> >>> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Multiple DataFrames per Parquet file?
Hi In that case read entire folder as a rdd and give some reasonable number of partitions. Best Ayan On 11 May 2015 01:35, "Peter Aberline" wrote: > Hi > > Thanks for the quick response. > > No I'm not using Streaming. Each DataFrame represents tabular data read > from a CSV file. They have the same schema. > > There is also the option of appending each DF to the parquet file, but > then I can't maintain them as separate DF when reading back in without > filtering. > > I'll rethink maintaining each CSV file as a single DF. > > Thanks, > Peter > > > On 10 May 2015 at 15:51, ayan guha wrote: > >> How did you end up with thousands of df? Are you using streaming? In >> that case you can do foreachRDD and keep merging incoming rdds to single >> rdd and then save it through your own checkpoint mechanism. >> >> If not, please share your use case. >> On 11 May 2015 00:38, "Peter Aberline" wrote: >> >>> Hi >>> >>> I have many thousands of small DataFrames that I would like to save to >>> the one Parquet file to avoid the HDFS 'small files' problem. My >>> understanding is that there is a 1:1 relationship between DataFrames and >>> Parquet files if a single partition is used. >>> >>> Is it possible to have multiple DataFrames within the one Parquet File >>> using PySpark? >>> Or is the only way to achieve this to union the DataFrames into one? >>> >>> Thanks, >>> Peter >>> >>> >>> >
Re: Multiple DataFrames per Parquet file?
Hi Thanks for the quick response. No I'm not using Streaming. Each DataFrame represents tabular data read from a CSV file. They have the same schema. There is also the option of appending each DF to the parquet file, but then I can't maintain them as separate DF when reading back in without filtering. I'll rethink maintaining each CSV file as a single DF. Thanks, Peter On 10 May 2015 at 15:51, ayan guha wrote: > How did you end up with thousands of df? Are you using streaming? In that > case you can do foreachRDD and keep merging incoming rdds to single rdd and > then save it through your own checkpoint mechanism. > > If not, please share your use case. > On 11 May 2015 00:38, "Peter Aberline" wrote: > >> Hi >> >> I have many thousands of small DataFrames that I would like to save to >> the one Parquet file to avoid the HDFS 'small files' problem. My >> understanding is that there is a 1:1 relationship between DataFrames and >> Parquet files if a single partition is used. >> >> Is it possible to have multiple DataFrames within the one Parquet File >> using PySpark? >> Or is the only way to achieve this to union the DataFrames into one? >> >> Thanks, >> Peter >> >> >>
Re: Multiple DataFrames per Parquet file?
How did you end up with thousands of df? Are you using streaming? In that case you can do foreachRDD and keep merging incoming rdds to single rdd and then save it through your own checkpoint mechanism. If not, please share your use case. On 11 May 2015 00:38, "Peter Aberline" wrote: > Hi > > I have many thousands of small DataFrames that I would like to save to the > one Parquet file to avoid the HDFS 'small files' problem. My understanding > is that there is a 1:1 relationship between DataFrames and Parquet files if > a single partition is used. > > Is it possible to have multiple DataFrames within the one Parquet File > using PySpark? > Or is the only way to achieve this to union the DataFrames into one? > > Thanks, > Peter > > >