*Datasets*
val viEvents = viEventsRaw.map { vi => (vi.get(14).asInstanceOf[Long], vi) }
val lstgItem = listings.map { lstg => (lstg.getItemId().toLong, lstg) }
What is the difference between
1)
lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions =
1200, rdd = viEvents)).map { }
2)
lstgItem.join(viEvents).map { }
3)
lstgItem.join(viEvents,1200).map { }
4)
lstgItem.join(viEvents, new org.apache.spark.HashPartitioner(1200).map { }
5)
viEvents.join(lstgItem).map { }
Which is better in case when i run the join and one task runs for ever
because its gets 1000x times more number of records when compared to
others. Before running the join i do reparition () and still i see this
behavior. So which is best or at least one that completes ? In my case i am
not even able to get it running for 1G. 25G or 100G datasets.
Regards,
Deepak
--
Deepak