Re: Contributing an algorithm for samsara

Dmitriy Lyubimov Fri, 17 Feb 2017 13:46:24 -0800

in particular, this is the samsara implementation of double-weighed als :
https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626



On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> Jim,
>
> if ALS is of interest, and as far as weighed ALS is concerned (since we
> already have trivial regularized ALS in the "decompositions" package),
> here's uncommitted samsara-compatible patch from a while back:
> https://issues.apache.org/jira/browse/MAHOUT-1365
>
> it combines weights on both data points (a.k.a "implicit feedback" als)
> and regularization rates  (paper references are given). We combine both
> approaches in one (which is novel, i guess, but yet simple enough).
> Obviously the final solver can also be used as pure reg rate regularized if
> wanted, making it equivalent to one of the papers.
>
> You may know implicit feedback paper from mllib's implicit als, but unlike
> it was done over there (as a use case sort problem that takes input before
> even features were extracted), we split the problem into pure algebraic
> solver (double-weighed ALS math) and leave the feature extraction outside
> of this issue per se (it can be added as a separate adapter).
>
> The reason for that is that the specific use-case oriented implementation
> does not necessarily leave the space for feature extraction that is
> different from described use case of partially consumed streamed videos in
> the paper. (e.g., instead of videos one could count visits or clicks or
> add-to-cart events which may need additional hyperparameter found for them
> as part of feature extraction and converting observations into "weghts").
>
> The biggest problem with these ALS methods however is that all
> hyperparameters require multidimensional crossvalidation and optimization.
> I think i mentioned it before as list of desired solutions, as it stands,
> Mahout does not have hyperarameter fitting routine.
>
> In practice, when using these kind of ALS, we have a case of
> multidimensional hyperparameter optimization. One of them comes from the
> fitter (reg rate, or base reg rate in case of weighed regularization), and
> the others come from feature extraction process. E.g., in original paper
> they introduce (at least) 2 formulas to extract measure weighs from the
> streaming video observations, and each of them had one parameter, alhpa,
> which in context of the whole problem becomes effectively yet another
> hyperparameter to fit. In other use cases when your confidence measurement
> may be coming from different sources and observations, the confidence
> extraction may actually have even more hyperparameters to fit than just
> one. And when we have a multidimensional case, simple approaches (like grid
> or random search) become either cost prohibitive or ineffective, due to the
> curse of dimensionality.
>
> At the time i was contributing that method, i was using it in conjunction
> with multidimensional bayesian optimizer, but the company that i wrote it
> for did not have it approved for contribution (unlike weighed als) at that
> time.
>
> Anyhow, perhaps you could read the algebra in both ALS papers there and
> ask questions, and we could worry about hyperparameter optimization a bit
> later and performance a bit later.
>
> On the feature extraction front (as in implicit feedback als per Koren
> etc.), this is an ideal use case for more general R-like formula approach,
> which is also on desired list of things to have.
>
> So i guess we have 3 problems really here:
> (1) double-weighed ALS
> (2) bayesian optimization and crossvalidation in an n-dimensional
> hyperparameter space
> (3) feature extraction per (preferrably R-like) formula.
>
>
> -d
>
>
> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap....@outlook.com>
> wrote:
>
>> +1 to glms
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>> -------- Original message --------
>> From: Trevor Grant <trevor.d.gr...@gmail.com>
>> Date: 02/17/2017 6:56 AM (GMT-08:00)
>> To: dev@mahout.apache.org
>> Subject: Re: Contributing an algorithm for samsara
>>
>> Jim is right, and I would take it one further and say, it would be best to
>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model ,
>> from there a Logistic regression is a trivial extension.
>>
>> Buyer beware- GLMs will be a bit of work- doable, but that would be
>> jumping
>> in neck first for both Jim and Saikat...
>>
>> MAHOUT-1928 and MAHOUT-1929
>>
>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec
>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND%
>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C%
>> 20priority%20DESC%2C%20created%20ASC
>>
>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs
>> are
>> in there.
>>
>> If you have an algorithm you are particularly intimate with, or explicitly
>> need/want- feel free to open a JIRA and assign to yourself.
>>
>> There is also a case to be made for implementing the ALS...
>>
>> 1) It's a much better 'beginner' project.
>> 2) Mahout has some world class Recommenders, a toy ALS implementation
>> might
>> help us think through how the other reccomenders (e.g. CCO) will 'fit'
>> into
>> the framework. E.g. ALS being the toy-prototype reccomender that helps us
>> think through building out that section of the framework.
>>
>>
>>
>> Trevor Grant
>> Data Scientist
>> https://github.com/rawkintrevo
>> http://stackexchange.com/users/3002022/rawkintrevo
>> http://trevorgrant.org
>>
>> *"Fortunate is he, who is able to know the causes of things."  -Virgil*
>>
>>
>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <j...@jagunet.com> wrote:
>>
>> > My own thoughts are that logistic regression seems a more "generalized"
>> > and hence more useful algo to be factored in... At least in the
>> > use cases that I've been toying with.
>> >
>> > So I'd like to help out with that if wanted...
>> >
>> > > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sxk1...@hotmail.com>
>> wrote:
>> > >
>> > > Trevor et al,
>> > >
>> > > I'd like to contribute an algorithm or two in samsara using spark as I
>> > would like to do a compare and contrast with mahout with R server for a
>> > data science pipeline, machine learning repo that I'm working on, in
>> > looking at the list of algorithms (https://mahout.apache.org/
>> > users/basics/algorithms.html) is there an algorithm for spark that would
>> > be beneficial for the community, my use cases would typically be around
>> > clustering or real time machine learning for building recommendations on
>> > the fly.    The algorithms I see that could potentially be useful are:
>> 1)
>> > Matrix Factorization with ALS 2) Logistic regression with SVD.
>> > >
>> > > Apache Mahout: Scalable machine learning and data mining<
>> > https://mahout.apache.org/users/basics/algorithms.html>
>> > > mahout.apache.org
>> > > Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O
>> > Flink; Mahout Math-Scala Core Library and Scala DSL
>> > >
>> > >
>> > >
>> > > Any thoughts/guidance or recommendations would be very helpful.
>> > > Thanks in advance.
>> >
>> >
>>
>
>

Re: Contributing an algorithm for samsara

Reply via email to