> On Feb 25, 2017, at 5:41 PM, Saikat Kanjilal <sxk1...@hotmail.com> wrote: > > Dmitry, > > I have skimmed through the current samsara implementation and your input > below and have some initial questions, for starters I would like to take > advantage of the work you've already done and bring that into production state
+1. It looks v. impressive. > , given that, here are some thoughts/questions: > > > 1) What work does the pull request below still need done, unit tests, > integration tests , seems like the implementation is complete from reading > the code but I'm coming into this new so not sure here? > > 2) It seems to be that your points 2 and 3 could be written as generic mahout > modules that can be used by all algorithms as appropriate, what do you think? Would it make sense to keep them as-is, and "pull them out", as it were, should they prove to be wanted/needed by the other algo users? > > 3) On the feature extraction per R like formula can you elaborate more here, > are you talking about feature extraction using R like dataframes and > operators? > > > > More later as I read through the papers. > > > ________________________________ > From: Dmitriy Lyubimov <dlie...@gmail.com> > Sent: Friday, February 17, 2017 1:45 PM > To: dev@mahout.apache.org > Subject: Re: Contributing an algorithm for samsara > > in particular, this is the samsara implementation of double-weighed als : > https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626 > MAHOUT-1365 Implicit feedback ALS-WR by dlyubimov · Pull Request #14 · > apache/mahout · > GitHub<https://github.com/apache/mahout/pull/14/files#diff-0fbeb8b848ed0c5e3f782c72569cf626> > github.com > mahout - Mirror of Apache Mahout > > > > > > On Fri, Feb 17, 2017 at 1:33 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > >> Jim, >> >> if ALS is of interest, and as far as weighed ALS is concerned (since we >> already have trivial regularized ALS in the "decompositions" package), >> here's uncommitted samsara-compatible patch from a while back: >> https://issues.apache.org/jira/browse/MAHOUT-1365 > [MAHOUT-1365] Weighted ALS-WR iterator for Spark - ASF > JIRA<https://issues.apache.org/jira/browse/MAHOUT-1365> > issues.apache.org > Given preference P and confidence C distributed sparse matrices, compute > ALS-WR solution for implicit feedback (Spark Bagel version). Following > Hu-Koren-Volynsky ... > > > >> >> it combines weights on both data points (a.k.a "implicit feedback" als) >> and regularization rates (paper references are given). We combine both >> approaches in one (which is novel, i guess, but yet simple enough). >> Obviously the final solver can also be used as pure reg rate regularized if >> wanted, making it equivalent to one of the papers. >> >> You may know implicit feedback paper from mllib's implicit als, but unlike >> it was done over there (as a use case sort problem that takes input before >> even features were extracted), we split the problem into pure algebraic >> solver (double-weighed ALS math) and leave the feature extraction outside >> of this issue per se (it can be added as a separate adapter). >> >> The reason for that is that the specific use-case oriented implementation >> does not necessarily leave the space for feature extraction that is >> different from described use case of partially consumed streamed videos in >> the paper. (e.g., instead of videos one could count visits or clicks or >> add-to-cart events which may need additional hyperparameter found for them >> as part of feature extraction and converting observations into "weghts"). >> >> The biggest problem with these ALS methods however is that all >> hyperparameters require multidimensional crossvalidation and optimization. >> I think i mentioned it before as list of desired solutions, as it stands, >> Mahout does not have hyperarameter fitting routine. >> >> In practice, when using these kind of ALS, we have a case of >> multidimensional hyperparameter optimization. One of them comes from the >> fitter (reg rate, or base reg rate in case of weighed regularization), and >> the others come from feature extraction process. E.g., in original paper >> they introduce (at least) 2 formulas to extract measure weighs from the >> streaming video observations, and each of them had one parameter, alhpa, >> which in context of the whole problem becomes effectively yet another >> hyperparameter to fit. In other use cases when your confidence measurement >> may be coming from different sources and observations, the confidence >> extraction may actually have even more hyperparameters to fit than just >> one. And when we have a multidimensional case, simple approaches (like grid >> or random search) become either cost prohibitive or ineffective, due to the >> curse of dimensionality. >> >> At the time i was contributing that method, i was using it in conjunction >> with multidimensional bayesian optimizer, but the company that i wrote it >> for did not have it approved for contribution (unlike weighed als) at that >> time. >> >> Anyhow, perhaps you could read the algebra in both ALS papers there and >> ask questions, and we could worry about hyperparameter optimization a bit >> later and performance a bit later. >> >> On the feature extraction front (as in implicit feedback als per Koren >> etc.), this is an ideal use case for more general R-like formula approach, >> which is also on desired list of things to have. >> >> So i guess we have 3 problems really here: >> (1) double-weighed ALS >> (2) bayesian optimization and crossvalidation in an n-dimensional >> hyperparameter space >> (3) feature extraction per (preferrably R-like) formula. >> >> >> -d >> >> >> On Fri, Feb 17, 2017 at 10:11 AM, Andrew Palumbo <ap....@outlook.com> >> wrote: >> >>> +1 to glms >>> >>> >>> >>> Sent from my Verizon Wireless 4G LTE smartphone >>> >>> >>> -------- Original message -------- >>> From: Trevor Grant <trevor.d.gr...@gmail.com> >>> Date: 02/17/2017 6:56 AM (GMT-08:00) >>> To: dev@mahout.apache.org >>> Subject: Re: Contributing an algorithm for samsara >>> >>> Jim is right, and I would take it one further and say, it would be best to >>> implement GLMs https://en.wikipedia.org/wiki/Generalized_linear_model , > [http://upload.wikimedia.org/wikipedia/commons/thumb/3/37/Biologist_and_statistician_Ronald_Fisher.jpg/200px-Biologist_and_statistician_Ronald_Fisher.jpg]<https://en.wikipedia.org/wiki/Generalized_linear_model> > > Generalized linear model - > Wikipedia<https://en.wikipedia.org/wiki/Generalized_linear_model> > en.wikipedia.org > Part of a series on Statistics: Regression analysis; Models; Linear > regression; Simple regression; Ordinary least squares; Polynomial regression; > General linear model > > > >>> from there a Logistic regression is a trivial extension. >>> >>> Buyer beware- GLMs will be a bit of work- doable, but that would be >>> jumping >>> in neck first for both Jim and Saikat... >>> >>> MAHOUT-1928 and MAHOUT-1929 >>> >>> https://issues.apache.org/jira/browse/MAHOUT-1925?jql=projec >>> t%20%3D%20MAHOUT%20AND%20component%20%3D%20Algorithms%20AND% >>> 20resolution%20%3D%20Unresolved%20ORDER%20BY%20due%20ASC%2C% >>> 20priority%20DESC%2C%20created%20ASC >>> >>> ^^ currently open JIRAs around Algorithms- you'll see Logistic and GLMs >>> are >>> in there. >>> >>> If you have an algorithm you are particularly intimate with, or explicitly >>> need/want- feel free to open a JIRA and assign to yourself. >>> >>> There is also a case to be made for implementing the ALS... >>> >>> 1) It's a much better 'beginner' project. >>> 2) Mahout has some world class Recommenders, a toy ALS implementation >>> might >>> help us think through how the other reccomenders (e.g. CCO) will 'fit' >>> into >>> the framework. E.g. ALS being the toy-prototype reccomender that helps us >>> think through building out that section of the framework. >>> >>> >>> >>> Trevor Grant >>> Data Scientist >>> https://github.com/rawkintrevo > [https://avatars3.githubusercontent.com/u/5852441?v=3&s=400]<https://github.com/rawkintrevo> > > rawkintrevo (Trevor Grant) · GitHub<https://github.com/rawkintrevo> > github.com > rawkintrevo has 22 repositories available. Follow their code on GitHub. > > > >>> http://stackexchange.com/users/3002022/rawkintrevo > User rawkintrevo - Stack > Exchange<http://stackexchange.com/users/3002022/rawkintrevo> > stackexchange.com > Fortuna Audaces Iuvat ~Chance Favors the Bold. top accounts reputation > activity favorites subscriptions. Top Questions > > > >>> http://trevorgrant.org > [https://s0.wp.com/i/blank.jpg]<http://trevorgrant.org/> > > The musings of rawkintrevo<http://trevorgrant.org/> > trevorgrant.org > Hot-rodder, opera enthusiast, mad data scientist; a man for all seasons. > > > >>> >>> *"Fortunate is he, who is able to know the causes of things." -Virgil* >>> >>> >>> On Fri, Feb 17, 2017 at 7:59 AM, Jim Jagielski <j...@jagunet.com> wrote: >>> >>>> My own thoughts are that logistic regression seems a more "generalized" >>>> and hence more useful algo to be factored in... At least in the >>>> use cases that I've been toying with. >>>> >>>> So I'd like to help out with that if wanted... >>>> >>>>> On Feb 9, 2017, at 3:59 PM, Saikat Kanjilal <sxk1...@hotmail.com> >>> wrote: >>>>> >>>>> Trevor et al, >>>>> >>>>> I'd like to contribute an algorithm or two in samsara using spark as I >>>> would like to do a compare and contrast with mahout with R server for a >>>> data science pipeline, machine learning repo that I'm working on, in >>>> looking at the list of algorithms (https://mahout.apache.org/ >>>> users/basics/algorithms.html) is there an algorithm for spark that would >>>> be beneficial for the community, my use cases would typically be around >>>> clustering or real time machine learning for building recommendations on >>>> the fly. The algorithms I see that could potentially be useful are: >>> 1) >>>> Matrix Factorization with ALS 2) Logistic regression with SVD. >>>>> >>>>> Apache Mahout: Scalable machine learning and data mining< >>>> https://mahout.apache.org/users/basics/algorithms.html> >>>>> mahout.apache.org >>>>> Mahout 0.12.0 Features by Engine¶ Single Machine MapReduce Spark H2O >>>> Flink; Mahout Math-Scala Core Library and Scala DSL >>>>> >>>>> >>>>> >>>>> Any thoughts/guidance or recommendations would be very helpful. >>>>> Thanks in advance.