Count of distinct values in each column

Devi P.V Wed, 29 Jul 2015 06:45:25 -0700

Hi All,

I have a 5GB CSV dataset having 69 columns..I need to find the count of
distinct values in each column. What is the optimized way to find the same
using spark scala?


Example CSV format :

a,b,c,d
a,c,b,a
b,b,c,d
b,b,c,a
c,b,b,a

Output expecting :

(a,2),(b,2),(c,1) #- First column distinct count
(b,4),(c,1)       #- Second column distinct count
(c,3),(b,2)       #- Third column distinct count
(d,2),(a,3)       #- Fourth column distinct count


Thanks in Advance

Count of distinct values in each column

Reply via email to