*Datasets* val viEvents = viEventsRaw.map { vi => (vi.get(14).asInstanceOf[Long], vi) }
val lstgItem = listings.map { lstg => (lstg.getItemId().toLong, lstg) } What is the difference between 1) lstgItem.join(viEvents, new org.apache.spark.RangePartitioner(partitions = 1200, rdd = viEvents)).map { } 2) lstgItem.join(viEvents).map { } 3) lstgItem.join(viEvents,1200).map { } 4) lstgItem.join(viEvents, new org.apache.spark.HashPartitioner(1200).map { } 5) viEvents.join(lstgItem).map { } Which is better in case when i run the join and one task runs for ever because its gets 1000x times more number of records when compared to others. Before running the join i do reparition () and still i see this behavior. So which is best or at least one that completes ? In my case i am not even able to get it running for 1G. 25G or 100G datasets. Regards, Deepak -- Deepak