substitute mapPartitions by distinct

2016-05-04 Thread Batselem
Hi, I am trying to remove duplicates from a set of RDD tuples in an iterative algorithm. I have discovered that it is possible to substitute RDD mapPartitions for RDD distinct. First I partitioned the RDD and distinct it locally using mapPartitions transformation. I expect it will be much faster

GC problem while filtering

2014-12-16 Thread Batselem
Hi I am trying to filter a large table with 3 columns. My goal is to filter this bigtable using multi clauses. I filtered bigtable 3 times but the first filtering took about 50 seconds to complete whereas the second and third filter transformation took about 5 seconds. I wonder if it is because of