I'm already using the SparseVector class. ~200 labels
On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng <men...@gmail.com> wrote: > How many labels does your dataset have? -Xiangrui > > On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote: > > Which version of mllib are you using? For Spark 1.0, mllib will > > support sparse feature vector which will improve performance a lot > > when computing the distance between points and centroid. > > > > Sincerely, > > > > DB Tsai > > ------------------------------------------------------- > > My Blog: https://www.dbtsai.com > > LinkedIn: https://www.linkedin.com/in/dbtsai > > > > > > On Sat, Apr 26, 2014 at 5:49 AM, John King <usedforprinting...@gmail.com> > wrote: > >> I'm just wondering are the SparkVector calculations really taking into > >> account the sparsity or just converting to dense? > >> > >> > >> On Fri, Apr 25, 2014 at 10:06 PM, John King < > usedforprinting...@gmail.com> > >> wrote: > >>> > >>> I've been trying to use the Naive Bayes classifier. Each example in the > >>> dataset is about 2 million features, only about 20-50 of which are > non-zero, > >>> so the vectors are very sparse. I keep running out of memory though, > even > >>> for about 1000 examples on 30gb RAM while the entire dataset is 4 > million > >>> examples. And I would also like to note that I'm using the sparse > vector > >>> class. > >> > >> >