How big is your problem and how many labels? -Xiangrui

On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai <dbt...@stanford.edu> wrote:
> Hi Xiangrui,
>
> We also run into this issue at Alpine Data Labs. We ended up using LRU cache
> to store the counts, and splitting those least used counts to distributed
> cache in HDFS.
>
>
> Sincerely,
>
> DB Tsai
> -------------------------------------------------------
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> Even the features are sparse, the conditional probabilities are stored
>> in a dense matrix. With 200 labels and 2 million features, you need to
>> store at least 4e8 doubles on the driver node. With multiple
>> partitions, you may need more memory on the driver. Could you try
>> reducing the number of partitions and giving driver more ram and see
>> whether it can help? -Xiangrui
>>
>> On Sun, Apr 27, 2014 at 3:33 PM, John King <usedforprinting...@gmail.com>
>> wrote:
>> > I'm already using the SparseVector class.
>> >
>> > ~200 labels
>> >
>> >
>> > On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng <men...@gmail.com>
>> > wrote:
>> >>
>> >> How many labels does your dataset have? -Xiangrui
>> >>
>> >> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai <dbt...@stanford.edu> wrote:
>> >> > Which version of mllib are you using? For Spark 1.0, mllib will
>> >> > support sparse feature vector which will improve performance a lot
>> >> > when computing the distance between points and centroid.
>> >> >
>> >> > Sincerely,
>> >> >
>> >> > DB Tsai
>> >> > -------------------------------------------------------
>> >> > My Blog: https://www.dbtsai.com
>> >> > LinkedIn: https://www.linkedin.com/in/dbtsai
>> >> >
>> >> >
>> >> > On Sat, Apr 26, 2014 at 5:49 AM, John King
>> >> > <usedforprinting...@gmail.com> wrote:
>> >> >> I'm just wondering are the SparkVector calculations really taking
>> >> >> into
>> >> >> account the sparsity or just converting to dense?
>> >> >>
>> >> >>
>> >> >> On Fri, Apr 25, 2014 at 10:06 PM, John King
>> >> >> <usedforprinting...@gmail.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> I've been trying to use the Naive Bayes classifier. Each example in
>> >> >>> the
>> >> >>> dataset is about 2 million features, only about 20-50 of which are
>> >> >>> non-zero,
>> >> >>> so the vectors are very sparse. I keep running out of memory
>> >> >>> though,
>> >> >>> even
>> >> >>> for about 1000 examples on 30gb RAM while the entire dataset is 4
>> >> >>> million
>> >> >>> examples. And I would also like to note that I'm using the sparse
>> >> >>> vector
>> >> >>> class.
>> >> >>
>> >> >>
>> >
>> >
>
>

Reply via email to