Re: Union of RDDs without the overhead of Union

Rishi Mishra Tue, 02 Feb 2016 21:25:25 -0800

Agree with Koert that UnionRDD should have a narrow dependencies .
Although union of two RDDs increases the number of tasks to be executed (
rdd1.partitions + rdd2.partitions) .
If your two RDDs have same number of partitions , you can also use
zipPartitions, which causes lesser number of tasks, hence less overhead.


On Wed, Feb 3, 2016 at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i am surprised union introduces a stage. UnionRDD should have only narrow
> dependencies.
>
> On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> well the "hadoop" way is to save to a/b and a/c and read from a/* :)
>>
>> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote:
>>
>>> Hi Spark users and developers,
>>>
>>> anyone knows how to union two RDDs without the overhead of it?
>>>
>>> say rdd1.union(rdd2).saveTextFile(..)
>>> This requires a stage to union the 2 rdds before saveAsTextFile (2
>>> stages). Is there a way to skip the union step but have the contents of the
>>> two rdds save to the same output text file?
>>>
>>> Thank you!
>>>
>>> Jerry
>>>
>>
>>
>


-- 
Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

Re: Union of RDDs without the overhead of Union

Reply via email to