Hi All,

I have a 5GB CSV dataset having 69 columns..I need to find the count of
distinct values in each column. What is the optimized way to find the same
using spark scala?

Example CSV format :

a,b,c,d
a,c,b,a
b,b,c,d
b,b,c,a
c,b,b,a

Output expecting :

(a,2),(b,2),(c,1) #- First column distinct count
(b,4),(c,1)       #- Second column distinct count
(c,3),(b,2)       #- Third column distinct count
(d,2),(a,3)       #- Fourth column distinct count


Thanks in Advance

Reply via email to