Hi,
I have code that does the following using RDDs,
val outputPartitionCount = 300
val part = new MyOwnPartitioner(outputPartitionCount)
val finalRdd = myRdd.repartitionAndSortWithinPartitions(part)
where myRdd is correctly formed as key, value pairs. I am looking convert
this to use Dataset/Dat
I haven't worked with datasets but would this help
https://stackoverflow.com/questions/37513667/how-to-create-a-spark-dataset-from-an-rdd
?
On Jun 23, 2017 5:43 PM, "Keith Chapman" wrote:
> Hi,
>
> I have code that does the following using RDDs,
>
> val outputPartitionCount = 300
> val part = ne
Hi Chapman,
You can use "cluster by" to do what you want.
https://deepsense.io/optimize-spark-with-distribute-by-and-cluster-by/
2017-06-24 17:48 GMT+07:00 Saliya Ekanayake :
> I haven't worked with datasets but would this help https://stackoverflow.
> com/questions/37513667/how-to-create-a-spark
Thanks for the pointer Saliya, I'm looking got an equivalent api in
dataset/dataframe for repartitionAndSortWithinPartitions, I've already
converted most of the RDD's to Dataframes.
Regards,
Keith.
http://keith-chapman.com
On Sat, Jun 24, 2017 at 3:48 AM, Saliya Ekanayake wrote:
> I haven't wo
Hi Nguyen,
This looks promising and seems like I could achieve it using cluster by.
Thanks for the pointer.
Regards,
Keith.
http://keith-chapman.com
On Sat, Jun 24, 2017 at 5:27 AM, nguyen duc Tuan
wrote:
> Hi Chapman,
> You can use "cluster by" to do what you want.
> https://deepsense.io/opt
Dataset/DataFrame has repartition (which can be used to partition by key)
and sortWithinPartitions.
see for example usage here:
https://github.com/tresata/spark-sorted/blob/master/src/main/scala/com/tresata/spark/sorted/sql/GroupSortedDataset.scala#L18
On Fri, Jun 23, 2017 at 5:43 PM, Keith Chapm