Hi, I have a very simple use case: I have an rdd as following:
d = [[1,2,3,4],[1,5,2,3],[2,3,4,5]] Now, I want to remove all the duplicates from a column and return the remaining frame.. For example: If i want to remove the duplicate based on column 1. Then basically I would remove either row 1 or row 2 in my final result.. because the column 1 of both first and second row is the same element (1) .. and hence the duplicate.. So, a possible result is: output = [[1,2,3,4],[2,3,4,5]] How do I do this in spark? Thanks
