Re: Union of 2 RDD's only returns the first one

Aureliano Buendia Wed, 22 Jan 2014 16:45:47 -0800

On Thu, Jan 23, 2014 at 12:37 AM, Patrick Wendell <pwend...@gmail.com>wrote:


> What is the ++ operator here? Is this something you defined?
>

No, it's an alias for union defined in RDD.scala:

def ++(other: RDD[T]): RDD[T] = this.union(other)


>
> Another issue is that RDD's are not ordered, so when you union two
> together it doesn't have a well defined ordering.
>
> If you do want to do this you could coalesce into one partition, then
> call MapPartitions and return an iterator that first adds your header
> and then the rest of the file, then call saveAsTextFile. Keep in mind
> this will only work if you coalesce into a single partition.
>

Thanks! I'll give this a try.


>
> myRdd.coalesce(1)
> .map(_.mkString(",")))
> .mapPartitions(it => (Seq("col1,col2,col3") ++ it).iterator)
> .saveAsTextFile("out.csv")
>
> - Patrick
>
> On Wed, Jan 22, 2014 at 11:12 AM, Aureliano Buendia
> <buendia...@gmail.com> wrote:
> > Hi,
> >
> > I'm trying to find a way to create a csv header when using
> saveAsTextFile,
> > and I came up with this:
> >
> > (sc.makeRDD(Array("col1,col2,col3"), 1) ++
> > myRdd.coalesce(1).map(_.mkString(",")))
> >       .saveAsTextFile("out.csv")
> >
> > But it only saves the header part. Why is that the union method does not
> > return both RDD's?
>

Re: Union of 2 RDD's only returns the first one

Reply via email to