If you takes time to actually learn Scala starting from its fundamental
concepts AND quite importantly get familiar with general functional
programming concepts, you'd immediately realize the things that you'd
really miss going back to Java (8).
On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła
Hi,
This is an ugly solution because it requires pulling out a row:
val rdd: RDD[Row] = ...
ctx.createDataFrame(rdd, rdd.first().schema)
Is there a better alternative to get a DataFrame from an RDD[Row] since
toDF won't work as Row is not a Product ?
Thanks,
Marius
it as a broadcast variable. Then run a map operation to perform the
join and whatever else you need to do. This will remove a shuffle stage but
you will still have to collect the joined RDD and broadcast it. All depends
on the size of your data if it’s worth it or not.
From: Marius Danciu
Date
Hi all,
If I have something like:
rdd.join(...).mapPartitionToPair(...)
It looks like mapPartitionToPair runs in a different stage then join. Is
there a way to piggyback this computation inside the join stage ? ... such
that each result partition after join is passed to
the mapPartitionToPair
Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com
wrote:
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned
share a bit more information on
your partitioner, and what properties you need for your f, that might
help.
thanks,
Imran
On Tue, Apr 28, 2015 at 7:10 AM, Marius Danciu marius.dan...@gmail.com
wrote:
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex
repartitionAndSortWithinPartitions to do it in one shot.
Thanks,
Silvio
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 8:10 AM
To: user
Subject: Spark partitioning question
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex(...)
.mapPartitionsToPair(...)
.groupByKey()
.sortByKey(comparator)
.partitionBy(myPartitioner)
.mapPartitionsWithIndex(...)
.mapPartitionsToPair( *f* )
The input
Anyone ?
On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu marius.dan...@gmail.com
wrote:
Hello anyone,
I have a question regarding the sort shuffle. Roughly I'm doing something
like:
rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2)
The problem is that in f2 I don't see