Hi, I have two RDD's with csv data as below :
RDD-1 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d8,791191603 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb42,19229261643 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,290542385 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,922926164 RDD-2 101970_17038953,546853f9-cf07-4700-b202-00f21e7c56d9,7911160 101970_5851048323,218f5485-e58c-4200-a473-348ddb858578,2954238 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9226164 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,92292164 101970_5854301839,fbcf5485-e696-4100-9468-a17ec7c5bb41,9226164 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb40,929164 101970_5854301838,fbcf5485-e696-4100-9468-a17ec7c5bb39,26164 I need to filter RDD-2 to include only those records where the first column value in RDD-2 matches any of the first column values in RDD-1 Currently , I am broadcasting the first column values from RDD-1 as a list and then filtering RDD-2 based on that list. val rdd1broadcast = sc.broadcast(rdd1.map { uu => uu.split(",")(0) }.collect().toSet) val rdd2filtered = rdd2.filter{ h => rdd1broadcast.value.contains(h.split(",")(0)) } This will result in data with first column "101970_5854301838" (last two records) to be filtered out from RDD-2. Is this is the best way to accomplish this ? I am worried that for large data volume , the broadcast step may become an issue. Appreciate any other suggestion. ----------- Thanks Himanish