Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Thanks guys. @Filipp Zhinkin Yes, we might have couple of string columns which will have 15million+ unique values which need to be mapped to indices. @Nick Pentreath We are on 2.0.2 though I will check it out. Is it better from hashing collision perspective or can handle large volume of data as

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Nick Pentreath
Also check out FeatureHasher in Spark 2.3.0 which is designed to handle this use case in a more natural way than HashingTF (and handles multiple columns at once). On Tue, 10 Apr 2018 at 16:00, Filipp Zhinkin wrote: > Hi Shahab, > > do you actually need to have a few

Re: StringIndexer with high cardinality huge data

2018-04-10 Thread Filipp Zhinkin
Hi Shahab, do you actually need to have a few columns with such a huge amount of categories whose value depends on original value's frequency? If no, then you may use value's hash code as a category or combine all columns into a single vector using HashingTF. Regards, Filipp. On Tue, Apr 10,

StringIndexer with high cardinality huge data

2018-04-10 Thread Shahab Yunus
Is the StringIndexer keeps all the mapped label to indices in the memory of the driver machine? It seems to be unless I am missing something. What if our data that needs to be