Re: Java 8 vs Scala

2015-07-17 Thread Marius Danciu
If you takes time to actually learn Scala starting from its fundamental concepts AND quite importantly get familiar with general functional programming concepts, you'd immediately realize the things that you'd really miss going back to Java (8). On Fri, Jul 17, 2015 at 8:14 AM Wojciech Pituła

DataFrame from RDD[Row]

2015-07-16 Thread Marius Danciu
Hi, This is an ugly solution because it requires pulling out a row: val rdd: RDD[Row] = ... ctx.createDataFrame(rdd, rdd.first().schema) Is there a better alternative to get a DataFrame from an RDD[Row] since toDF won't work as Row is not a Product ? Thanks, Marius

Re: Optimizations

2015-07-03 Thread Marius Danciu
it as a broadcast variable. Then run a map operation to perform the join and whatever else you need to do. This will remove a shuffle stage but you will still have to collect the joined RDD and broadcast it. All depends on the size of your data if it’s worth it or not. From: Marius Danciu Date

Optimizations

2015-07-03 Thread Marius Danciu
Hi all, If I have something like: rdd.join(...).mapPartitionToPair(...) It looks like mapPartitionToPair runs in a different stage then join. Is there a way to piggyback this computation inside the join stage ? ... such that each result partition after join is passed to the mapPartitionToPair

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
Turned out that is was sufficient do to repartitionAndSortWithinPartitions ... so far so good ;) On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com wrote: Hi Imran, Yes that's what MyPartitioner does. I do see (using traces from MyPartitioner) that the key is partitioned

Re: Spark partitioning question

2015-05-05 Thread Marius Danciu
share a bit more information on your partitioner, and what properties you need for your f, that might help. thanks, Imran On Tue, Apr 28, 2015 at 7:10 AM, Marius Danciu marius.dan...@gmail.com wrote: Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex

Re: Spark partitioning question

2015-04-28 Thread Marius Danciu
repartitionAndSortWithinPartitions to do it in one shot. Thanks, Silvio From: Marius Danciu Date: Tuesday, April 28, 2015 at 8:10 AM To: user Subject: Spark partitioning question Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex

Spark partitioning question

2015-04-28 Thread Marius Danciu
Hello all, I have the following Spark (pseudo)code: rdd = mapPartitionsWithIndex(...) .mapPartitionsToPair(...) .groupByKey() .sortByKey(comparator) .partitionBy(myPartitioner) .mapPartitionsWithIndex(...) .mapPartitionsToPair( *f* ) The input

Re: Shuffle question

2015-04-22 Thread Marius Danciu
Anyone ? On Tue, Apr 21, 2015 at 3:38 PM Marius Danciu marius.dan...@gmail.com wrote: Hello anyone, I have a question regarding the sort shuffle. Roughly I'm doing something like: rdd.mapPartitionsToPair(f1).groupByKey().mapPartitionsToPair(f2) The problem is that in f2 I don't see