Ok, I think I understand things better now.

For Spark's current implementation, you would need to map those features as
you mention. You could also use say StringIndexer -> OneHotEncoder or
VectorIndexer. You could create a Pipeline to deal with the mapping and
training (e.g.
http://spark.apache.org/docs/latest/ml-guide.html#example-pipeline).
Pipeline supports persistence.

But it depends on your scoring use case too - a Spark pipeline can be saved
and then reloaded, but you need all of Spark dependencies in your serving
app which is often not ideal. If you're doing bulk scoring offline, then it
may suit.

Honestly though, for that data size I'd certainly go with something like
Liblinear :) Spark will ultimately scale better with # training examples
for very large scale problems. However there are definitely limitations on
model dimension and sparse weight vectors currently. There are potential
solutions to these but they haven't been implemented as yet.

On Fri, 11 Mar 2016 at 18:35 Daniel Siegmann <daniel.siegm...@teamaol.com>
wrote:

> On Fri, Mar 11, 2016 at 5:29 AM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
>> Would you mind letting us know the # training examples in the datasets?
>> Also, what do your features look like? Are they text, categorical etc? You
>> mention that most rows only have a few features, and all rows together have
>> a few 10,000s features, yet your max feature value is 20 million. How are
>> your constructing your feature vectors to get a 20 million size? The only
>> realistic way I can see this situation occurring in practice is with
>> feature hashing (HashingTF).
>>
>
> The sub-sample I'm currently training on is about 50K rows, so ... small.
>
> The features causing this issue are numeric (int) IDs for ... lets call
> it "Thing". For each Thing in the record, we set the feature Thing.id to
> a value of 1.0 in our vector (which is of course a SparseVector). I'm not
> sure how IDs are generated for Things, but they can be large numbers.
>
> The largest Thing ID is around 20 million, so that ends up being the size
> of the vector. But in fact there are fewer than 10,000 unique Thing IDs in
> this data. The mean number of features per record in what I'm currently
> training against is 41, while the maximum for any given record was 1754.
>
> It is possible to map the features into a small set (just need to
> zipWithIndex), but this is undesirable because of the added complexity (not
> just for the training, but also anything wanting to score against the
> model). It might be a little easier if this could be encapsulated within
> the model object itself (perhaps via composition), though I'm not sure how
> feasible that is.
>
> But I'd rather not bother with dimensionality reduction at all - since we
> can train using liblinear in just a few minutes, it doesn't seem necessary.
>
>
>>
>> MultivariateOnlineSummarizer uses dense arrays, but it should be
>> possible to enable sparse data. Though in theory, the result will tend to
>> be dense anyway, unless you have very many entries in the input feature
>> vector that never occur and are actually zero throughout the data set
>> (which it seems is the case with your data?). So I doubt whether using
>> sparse vectors for the summarizer would improve performance in general.
>>
>
> Yes, that is exactly my case - the vast majority of entries in the input
> feature vector will *never* occur. Presumably that means most of the
> values in the aggregators' arrays will be zero.
>
>
>>
>> LR doesn't accept a sparse weight vector, as it uses dense vectors for
>> coefficients and gradients currently. When using L1 regularization, it
>> could support sparse weight vectors, but the current implementation doesn't
>> do that yet.
>>
>
> Good to know it is theoretically possible to implement. I'll have to give
> it some thought. In the meantime I guess I'll experiment with coalescing
> the data to minimize the communication overhead.
>
> Thanks again.
>

Reply via email to