Agree with Koert that UnionRDD should have a narrow dependencies . Although union of two RDDs increases the number of tasks to be executed ( rdd1.partitions + rdd2.partitions) . If your two RDDs have same number of partitions , you can also use zipPartitions, which causes lesser number of tasks, hence less overhead.
On Wed, Feb 3, 2016 at 9:58 AM, Koert Kuipers <ko...@tresata.com> wrote: > i am surprised union introduces a stage. UnionRDD should have only narrow > dependencies. > > On Tue, Feb 2, 2016 at 11:25 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> well the "hadoop" way is to save to a/b and a/c and read from a/* :) >> >> On Tue, Feb 2, 2016 at 11:05 PM, Jerry Lam <chiling...@gmail.com> wrote: >> >>> Hi Spark users and developers, >>> >>> anyone knows how to union two RDDs without the overhead of it? >>> >>> say rdd1.union(rdd2).saveTextFile(..) >>> This requires a stage to union the 2 rdds before saveAsTextFile (2 >>> stages). Is there a way to skip the union step but have the contents of the >>> two rdds save to the same output text file? >>> >>> Thank you! >>> >>> Jerry >>> >> >> > -- Regards, Rishitesh Mishra, SnappyData . (http://www.snappydata.io/) https://in.linkedin.com/in/rishiteshmishra