Re: Multiple DataFrames per Parquet file?

2015-05-17 Thread Davies Liu
You can union all the df together, then call repartition().

On Sun, May 10, 2015 at 8:34 AM, Peter Aberline
 wrote:
> Hi
>
> Thanks for the quick response.
>
> No I'm not using Streaming. Each DataFrame represents tabular data read from
> a CSV file. They have the same schema.
>
> There is also the option of appending each DF to the parquet file, but then
> I can't maintain them as separate DF when reading back in without filtering.
>
> I'll rethink maintaining each CSV file as a single DF.
>
> Thanks,
> Peter
>
>
> On 10 May 2015 at 15:51, ayan guha  wrote:
>>
>> How did you end up with thousands of df? Are you using streaming?  In that
>> case you can do foreachRDD and keep merging incoming rdds to single rdd and
>> then save it through your own checkpoint mechanism.
>>
>> If not, please share your use case.
>>
>> On 11 May 2015 00:38, "Peter Aberline"  wrote:
>>>
>>> Hi
>>>
>>> I have many thousands of small DataFrames that I would like to save to
>>> the one Parquet file to avoid the HDFS 'small files' problem. My
>>> understanding is that there is a 1:1 relationship between DataFrames and
>>> Parquet files if a single partition is used.
>>>
>>> Is it possible to have multiple DataFrames within the one Parquet File
>>> using PySpark?
>>> Or is the only way to achieve this to union the DataFrames into one?
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
Hi

In that case read entire folder as a rdd and give some reasonable number of
partitions.

Best
Ayan
On 11 May 2015 01:35, "Peter Aberline"  wrote:

> Hi
>
> Thanks for the quick response.
>
> No I'm not using Streaming. Each DataFrame represents tabular data read
> from a CSV file. They have the same schema.
>
> There is also the option of appending each DF to the parquet file, but
> then I can't maintain them as separate DF when reading back in without
> filtering.
>
> I'll rethink maintaining each CSV file as a single DF.
>
> Thanks,
> Peter
>
>
> On 10 May 2015 at 15:51, ayan guha  wrote:
>
>> How did you end up with thousands of df? Are you using streaming?  In
>> that case you can do foreachRDD and keep merging incoming rdds to single
>> rdd and then save it through your own checkpoint mechanism.
>>
>> If not, please share your use case.
>> On 11 May 2015 00:38, "Peter Aberline"  wrote:
>>
>>> Hi
>>>
>>> I have many thousands of small DataFrames that I would like to save to
>>> the one Parquet file to avoid the HDFS 'small files' problem. My
>>> understanding is that there is a 1:1 relationship between DataFrames and
>>> Parquet files if a single partition is used.
>>>
>>> Is it possible to have multiple DataFrames within the one Parquet File
>>> using PySpark?
>>> Or is the only way to achieve this to union the DataFrames into one?
>>>
>>> Thanks,
>>> Peter
>>>
>>>
>>>
>


Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread Peter Aberline
Hi

Thanks for the quick response.

No I'm not using Streaming. Each DataFrame represents tabular data read
from a CSV file. They have the same schema.

There is also the option of appending each DF to the parquet file, but then
I can't maintain them as separate DF when reading back in without filtering.

I'll rethink maintaining each CSV file as a single DF.

Thanks,
Peter


On 10 May 2015 at 15:51, ayan guha  wrote:

> How did you end up with thousands of df? Are you using streaming?  In that
> case you can do foreachRDD and keep merging incoming rdds to single rdd and
> then save it through your own checkpoint mechanism.
>
> If not, please share your use case.
> On 11 May 2015 00:38, "Peter Aberline"  wrote:
>
>> Hi
>>
>> I have many thousands of small DataFrames that I would like to save to
>> the one Parquet file to avoid the HDFS 'small files' problem. My
>> understanding is that there is a 1:1 relationship between DataFrames and
>> Parquet files if a single partition is used.
>>
>> Is it possible to have multiple DataFrames within the one Parquet File
>> using PySpark?
>> Or is the only way to achieve this to union the DataFrames into one?
>>
>> Thanks,
>> Peter
>>
>>
>>


Re: Multiple DataFrames per Parquet file?

2015-05-10 Thread ayan guha
How did you end up with thousands of df? Are you using streaming?  In that
case you can do foreachRDD and keep merging incoming rdds to single rdd and
then save it through your own checkpoint mechanism.

If not, please share your use case.
On 11 May 2015 00:38, "Peter Aberline"  wrote:

> Hi
>
> I have many thousands of small DataFrames that I would like to save to the
> one Parquet file to avoid the HDFS 'small files' problem. My understanding
> is that there is a 1:1 relationship between DataFrames and Parquet files if
> a single partition is used.
>
> Is it possible to have multiple DataFrames within the one Parquet File
> using PySpark?
> Or is the only way to achieve this to union the DataFrames into one?
>
> Thanks,
> Peter
>
>
>