Re: Hash-coded Vectorization and bogus information

Ted Dunning Mon, 13 Feb 2012 18:38:52 -0800

On Tue, Feb 14, 2012 at 2:25 AM, Lance Norskog <goks...@gmail.com> wrote:


> ...
> OnlineLogisticRegression allocates DenseVector/DenseMatrix objects- if
> it used RandomSparse Vector/Matrix could it operate on million-term
> sparse arrays?
>

Not likely.

The feature vectors that come in are sparse and the updates to the internal
coefficients are sparse but over time many of the internal coefficients are
updated leading to a non-sparse model.  If enough coefficients become
non-zero, then the memory used by the sparse representation would exceed
the dense representation.  Also, speed would be significantly impaired.

You can definitely try it out, however.


> The problem is that seq2sparse has several text-processing options
> which seq2encoded does not, so it is harder to experiment with the sgd
> version of asf emails examples. So, if I want to use those text
> options it sounds like I need to run seq2sparse and then write a
> sparse2dense random projector job?


No.  The feature vector passed to the OnlineLearning can be sparse.   It is
just the internal data that is dense, not the training vectors.

Subject to memory limits (16B / vocabulary item * (k-1)) you should be
entirely able to use SGD on input vectors that are not hashed.

Re: Hash-coded Vectorization and bogus information

Reply via email to