*Datasets*

val viEvents = viEventsRaw.map { vi => (vi.get(14).asInstanceOf[Long], vi) }

val lstgItem = listings.map { lstg => (lstg.getItemId().toLong, lstg) }

What is the difference between

1)

lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions =
1200, rdd = viEvents)).map { }


2)

lstgItem.join(viEvents).map { }

3)

lstgItem.join(viEvents,1200).map { }


4)

lstgItem.join(viEvents, new org.apache.spark.HashPartitioner(1200).map { }

5)

viEvents.join(lstgItem).map { }


Which is better in case when i run the join and one task runs for ever
because its gets 1000x times more number of records when compared to
others. Before running the join i do reparition () and still i see this
behavior. So which is best or at least one that completes ? In my case i am
not even able to get it running for 1G. 25G or 100G datasets.
Regards,
Deepak
-- 
Deepak

Reply via email to