Hi, I am trying to remove duplicates from a set of RDD tuples in an iterative
algorithm. I have discovered that it is possible to substitute RDD
mapPartitions for RDD distinct.
First I partitioned the RDD and distinct it locally using mapPartitions
transformation. I expect it will be much faster
Hi I am trying to filter a large table with 3 columns. My goal is to filter
this bigtable using multi clauses. I filtered bigtable 3 times but the first
filtering took about 50 seconds to complete whereas the second and third
filter transformation took about 5 seconds. I wonder if it is because of