in particular, this is the samsara implementation of double-weighed als : https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626
On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > Jim, > > if ALS is of interest, and as far as weighed ALS is concerned (since we > already have trivial regularized ALS in the "decompositions" package), > here's uncommitted samsara-compatible patch from a while back: > https://issues.apache.org/jira/browse/MAHOUT-1365 > > it combines weights on both data points (a.k.a "implicit feedback" als) > and regularization rates (paper references are given). We combine both > approaches in one (which is novel, i guess, but yet simple enough). > Obviously the final solver can also be used as pure reg rate regularized if > wanted, making it equivalent to one of the papers. > > You may know implicit feedback paper from mllib's implicit als, but unlike > it was done over there (as a use case sort problem that takes input before > even features were extracted), we split the problem into pure algebraic > solver (double-weighed ALS math) and leave the feature extraction outside > of this issue per se (it can be added as a separate adapter). > > The reason for that is that the specific use-case oriented implementation > does not necessarily leave the space for feature extraction that is > different from described use case of partially consumed streamed videos in > the paper. (e.g., instead of videos one could count visits or clicks or > add-to-cart events which may need additional hyperparameter found for them > as part of feature extraction and converting observations into "weghts"). > > The biggest problem with these ALS methods however is that all > hyperparameters require multidimensional crossvalidation and optimization. > I think i mentioned it before as list of desired solutions, as it stands, > Mahout does not have hyperarameter fitting routine. > > In practice, when using these kind of ALS, we have a case of > multidimensional hyperparameter optimization. One of them comes from the > fitter (reg rate, or base reg rate in case of weighed regularization), and > the others come from feature extraction process. E.g., in original paper > they introduce (at least) 2 formulas to extract measure weighs from the > streaming video observations, and each of them had one parameter, alhpa, > which in context of the whole problem becomes effectively yet another > hyperparameter to fit. In other use cases when your confidence measurement > may be coming from different sources and observations, the confidence > extraction may actually have even more hyperparameters to fit than just > one. And when we have a multidimensional case, simple approaches (like grid > or random search) become either cost prohibitive or ineffective, due to the > curse of dimensionality. > > At the time i was contributing that method, i was using it in conjunction > with multidimensional bayesian optimizer, but the company that i wrote it > for did not have it approved for contribution (unlike weighed als) at that > time. > > Anyhow, perhaps you could read the algebra in both ALS papers there and > ask questions, and we could worry about hyperparameter optimization a bit > later and performance a bit later. > > On the feature extraction front (as in implicit feedback als per Koren > etc.), this is an ideal use case for more general R-like formula approach, > which is also on desired list of things to have. > > So i guess we have 3 problems really here: > (1) double-weighed ALS > (2) bayesian optimization and crossvalidation in an n-dimensional > hyperparameter space > (3) feature extraction per (preferrably R-like) formula. > > > -d > > > On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap....@outlook.com> > wrote: > >> +1 to glms >> >> >> >> Sent from my Verizon Wireless 4G LTE smartphone >> >> >> -------- Original message -------- >> From: Trevor Grant <trevor.d.gr...@gmail.com> >> Date: 02/17/2017 6:56 AM (GMT-08:00) >> To: dev@mahout.apache.org >> Subject: Re: Contributing an algorithm for samsara >> >> Jim is right, and I would take it one further and say, it would be best to >> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model , >> from there a Logistic regression is a trivial extension. >> >> Buyer beware- GLMs will be a bit of work- doable, but that would be >> jumping >> in neck first for both Jim and Saikat... >> >> MAHOUT-1928 and MAHOUT-1929 >> >> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec >> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND% >> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C% >> 20priority%20DESC%2C%20created%20ASC >> >> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs >> are >> in there. >> >> If you have an algorithm you are particularly intimate with, or explicitly >> need/want- feel free to open a JIRA and assign to yourself. >> >> There is also a case to be made for implementing the ALS... >> >> 1) It's a much better 'beginner' project. >> 2) Mahout has some world class Recommenders, a toy ALS implementation >> might >> help us think through how the other reccomenders (e.g. CCO) will 'fit' >> into >> the framework. E.g. ALS being the toy-prototype reccomender that helps us >> think through building out that section of the framework. >> >> >> >> Trevor Grant >> Data Scientist >> https://github.com/rawkintrevo >> http://stackexchange.com/users/3002022/rawkintrevo >> http://trevorgrant.org >> >> *"Fortunate is he, who is able to know the causes of things." -Virgil* >> >> >> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <j...@jagunet.com> wrote: >> >> > My own thoughts are that logistic regression seems a more "generalized" >> > and hence more useful algo to be factored in... At least in the >> > use cases that I've been toying with. >> > >> > So I'd like to help out with that if wanted... >> > >> > > On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sxk1...@hotmail.com> >> wrote: >> > > >> > > Trevor et al, >> > > >> > > I'd like to contribute an algorithm or two in samsara using spark as I >> > would like to do a compare and contrast with mahout with R server for a >> > data science pipeline, machine learning repo that I'm working on, in >> > looking at the list of algorithms (https://mahout.apache.org/ >> > users/basics/algorithms.html) is there an algorithm for spark that would >> > be beneficial for the community, my use cases would typically be around >> > clustering or real time machine learning for building recommendations on >> > the fly. The algorithms I see that could potentially be useful are: >> 1) >> > Matrix Factorization with ALS 2) Logistic regression with SVD. >> > > >> > > Apache Mahout: Scalable machine learning and data mining< >> > https://mahout.apache.org/users/basics/algorithms.html> >> > > mahout.apache.org >> > > Mahout 0.12.0 Features by EngineĀ¶ Single Machine MapReduce Spark H2O >> > Flink; Mahout Math-Scala Core Library and Scala DSL >> > > >> > > >> > > >> > > Any thoughts/guidance or recommendations would be very helpful. >> > > Thanks in advance. >> > >> > >> > >