Re: PyMahout (incore) (alpha v0.1)
Well done Trevor. -peng On Thu, Jan 7, 2021 at 04:45 Trevor Grant wrote: > Hey all, > > I made a branch for a thing I'm toying with. PyMahout. > > See https://github.com/rawkintrevo/pymahout/tree/trunk > > Right now, its sort of dumb- it just makes a couple of random incore > matrices... but it _does_ make them. > > Next I want to show I can do something with DRMs. > > Once I know its all possible- Ill make a batch of JIRA tickets and we can > start implementing a python like package so that in theory in a pyspark > workbook you could > > ```jupyter > !pip install pymahout > > > import pymhout > > # do pymahot things here... in python. > > ``` > > So if you're interested in helping /playing- reach out on here or direct- > if there is a bunch of interest I can commit all of this to a branch as we > play with it. > > Thanks! > tg >
Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair
Congrats Andrew! On Thu, Jul 19, 2018 at 04:01 Andrew Musselman wrote: > Thanks Andy, looking forward to it! Thank you too for your support and > dedication the past two years; here's to continued progress! > > Best > Andrew > > On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo > wrote: > > Please join me in congratulating Andrew Musselman as the new Chair of > > the > > Apache Mahout Project Management Committee. I would like to thank > > Andrew > > for stepping up, all of us who have worked with him over the years > > know his > > dedication to the project to be invaluable. I look forward to Andrew > > taking taking the project into the future. > > > > Thank you, > > > > Andy >
Re: Any idea which approaches to non-liniear svm are easily parallelizable?
And I agree with Ted, the non-linearity induced by most kernel functions are overcomplex and can easily overfit. Deep learning is a more reliable abstraction. On 10/22/2014 01:42 PM, peng wrote: Is the kernel projection referring to online/incremental incomplete Cholesky decomposition? Sorry I haven't used SVM for a long time and didn't keep up with SotA. If that's true, I haven't find an out-of-the-box implementation, but this should be easy. Yours Peng On 10/22/2014 01:32 PM, Dmitriy Lyubimov wrote: Andrew, thanks a bunch for the pointers! On Wed, Oct 22, 2014 at 10:14 AM, Andrew Palumbo ap@outlook.com wrote: If you do want to stick with SVM- This is a question that I keep coming back myself to and unfortunately have forgotten more (and lost more literature) than I’ve retained. I believe that the most easily parallelizable sections of libSVM for small datasets are (for C-SVC(R), RBF and polynomial kernels): 1.The Kernel projections 2.The Hyper-Paramater grid search for C, \gamma (I believe this is now included in LibSVM- I havent looked at it in a while) 3.For multi-class SVC: the concurrent computation of each SVM for each one-against-one class vote. I’m unfamiliar with any easily parallizable method for QP itself. Unfortunately for (2), (3) this involves broadcasting the entire dataset out to each node of a cluster (or working in a shared memory environment) so may not be practical depending on the size of your data set. I’ve only ever implemented (2) for relatively small datasets using MPI and and a with pure java socket implementation. Other approaches (further from simple LibSVM), which are more applicable to large datasets (I’m less familiar with these): 4.Divide and conquer the QP/SMO problem and solve (As I’ve said, I’m unfamiliar with this and I don’t know of any standard) 5.Break the training set into subsets and solve. For (5) there are several approaches, two that I know of are ensemble approaches and those that accumulate Support Vectors from each partition and heuristically keep/reject them until the model converges. As well I’ve read some just read some research on implementing this in map a MapReduce style[2]. I came across this paper [1] last night which you may find interesting as well which is an interesting comparison of some SVM parallelization strategies, particularly it discusses (1) for a shared memory environment and for offloading work to GPUs (using OpenMP and CUDA). It also cites several other nice papers discussing SVM parallelization strategies especially for (5). Also then goes on to discuss more purely linear algebra approach to optimizing SVMs (sec. 5) Also regarding (5) you may be interested in [2] (something I’ve only looked over briefly). [1] http://arxiv.org/pdf/1404.1066v1.pdf [2] http://arxiv.org/pdf/1301.0082.pdf From: ted.dunn...@gmail.com Date: Tue, 21 Oct 2014 17:32:22 -0700 Subject: Re: Any idea which approaches to non-liniear svm are easily parallelizable? To: dev@mahout.apache.org Last I heard, the best methods pre-project and do linear SVM. Beyond that, I would guess that deep learning techniques would subsume non-linear SVM pretty easily. The best parallel implementation I know for that is in H2O. On Tue, Oct 21, 2014 at 4:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: in particular, from libSVM -- http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf ? thanks. -d
Re: Any idea which approaches to non-liniear svm are easily parallelizable?
Yes I am, In fact, my question is just about whether approximation is used to make the total workload of computing the matrix sub-quadratic to the training set size. On 10/22/2014 02:21 PM, Andrew Palumbo wrote: Peng, I'm not sure if you were referring to what I wrote: 1.The Kernel projections if so- I was talking about parallelizing the computation of eg. the RBF Kernals Date: Wed, 22 Oct 2014 13:42:45 -0400 From: pc...@uowmail.edu.au To: dev@mahout.apache.org CC: dlie...@gmail.com Subject: Re: Any idea which approaches to non-liniear svm are easily parallelizable? Is the kernel projection referring to online/incremental incomplete Cholesky decomposition? Sorry I haven't used SVM for a long time and didn't keep up with SotA. If that's true, I haven't find an out-of-the-box implementation, but this should be easy. Yours Peng On 10/22/2014 01:32 PM, Dmitriy Lyubimov wrote: Andrew, thanks a bunch for the pointers! On Wed, Oct 22, 2014 at 10:14 AM, Andrew Palumbo ap@outlook.com wrote: If you do want to stick with SVM- This is a question that I keep coming back myself to and unfortunately have forgotten more (and lost more literature) than I’ve retained. I believe that the most easily parallelizable sections of libSVM for small datasets are (for C-SVC(R), RBF and polynomial kernels): 2.The Hyper-Paramater grid search for C, \gamma (I believe this is now included in LibSVM- I havent looked at it in a while) 3.For multi-class SVC: the concurrent computation of each SVM for each one-against-one class vote. I’m unfamiliar with any easily parallizable method for QP itself. Unfortunately for (2), (3) this involves broadcasting the entire dataset out to each node of a cluster (or working in a shared memory environment) so may not be practical depending on the size of your data set. I’ve only ever implemented (2) for relatively small datasets using MPI and and a with pure java socket implementation. Other approaches (further from simple LibSVM), which are more applicable to large datasets (I’m less familiar with these): 4.Divide and conquer the QP/SMO problem and solve (As I’ve said, I’m unfamiliar with this and I don’t know of any standard) 5.Break the training set into subsets and solve. For (5) there are several approaches, two that I know of are ensemble approaches and those that accumulate Support Vectors from each partition and heuristically keep/reject them until the model converges. As well I’ve read some just read some research on implementing this in map a MapReduce style[2]. I came across this paper [1] last night which you may find interesting as well which is an interesting comparison of some SVM parallelization strategies, particularly it discusses (1) for a shared memory environment and for offloading work to GPUs (using OpenMP and CUDA). It also cites several other nice papers discussing SVM parallelization strategies especially for (5). Also then goes on to discuss more purely linear algebra approach to optimizing SVMs (sec. 5) Also regarding (5) you may be interested in [2] (something I’ve only looked over briefly). [1] http://arxiv.org/pdf/1404.1066v1.pdf [2] http://arxiv.org/pdf/1301.0082.pdf From: ted.dunn...@gmail.com Date: Tue, 21 Oct 2014 17:32:22 -0700 Subject: Re: Any idea which approaches to non-liniear svm are easily parallelizable? To: dev@mahout.apache.org Last I heard, the best methods pre-project and do linear SVM. Beyond that, I would guess that deep learning techniques would subsume non-linear SVM pretty easily. The best parallel implementation I know for that is in H2O. On Tue, Oct 21, 2014 at 4:12 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: in particular, from libSVM -- http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf ? thanks. -d
Re: Upgrade to Spark 1.1.0?
From my experience 1.1.0 is quite stable, plus some performance improvements that totally worth the effort. On 10/19/2014 06:30 PM, Ted Dunning wrote: On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel p...@occamsmachete.com wrote: Getting off the dubious Spark 1.0.1 version is turning out to be a bit of work. Does anyone object to upgrading our Spark dependency? I’m not sure if Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading your Spark cluster. It is going to have to happen sooner or later. Sooner may actually be less total pain.
Re: Why is mahout moving to spark?
No it's not, spark is a superset of mapreduce. Besides the 'Hadoop MapReduce' here should denotes a specific implementation rather than an architecture On 10/15/2014 03:44 PM, thejas prasad wrote: Hey all, I am curious why mahout is moving away from spark? I mean it is here The Mahout community decided to move its codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. But why did this happen? And also is there a place I see all the previous emails, in the user/dev list ? Thanks, Thejas
Re: Upgrade to spark 1.0.x
+1 1.0.0 is recommended. Many release after 1.0.1 has a short test cycle and 1.0.2 apparently reverted many fix for causing more serious problem. On 14-08-09 04:51 PM, Ted Dunning wrote: +1 Until we release a version that uses spark, we should stay with what helps us. Once a release goes out then tracking whichever version of spark that the big distros put out becomes more important. On Sat, Aug 9, 2014 at 9:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: +1 Seems like we ought to keep up to the bleeding edge until the next Mahout release, that’s when the pain of upgrade gets spread much wider. In fact if Spark gets moved to Scala 2.11 before our release we probably should consider upgrading Scala too.
Spark-shell web UI powered by IPython-notebook, maybe useful in promoting Mahout DRM environment.
Dudes, For those who can't afford the glamourous DataBricks Spark Cloud, or got pissed by its incompatibility with Mahout DRM, you may consider this project as an alternative: https://github.com/tribbloid/ISpark Support for Mahout DRM is still being implemented and will be delivered in a few days. Yours Peng
Re: VOTE: moving commits to git-wp.o.a github PR features.
+1 On Sat 17 May 2014 02:18:56 PM EDT, Gokhan Capan wrote: +1 Sent from my iPhone On May 16, 2014, at 21:38, Dmitriy Lyubimov dlie...@gmail.com wrote: Hi, I would like to initiate a procedural vote moving to git as our primary commit system, and using github PRs as described in Jake Farrel's email to @dev [1] [1] https://blogs.apache.org/infra/entry/improved_integration_between_apache_and If voting succeeds, i will file a ticket with infra to commence necessary changes and to move our project to git-wp as primary source for commits as well as add github integration features [1]. (I assume pure git commits will be required after that's done, with no svn commits allowed). The motivation is to engage GIT and github PR features as described, and avoid git mirror history messes like we've seen associated with authors.txt file fluctations. PMC and committers have binding votes, so please vote. Lazy consensus with minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time for weekend (i.e. Tuesday afternoon PST) . here is my +1 -d
Re: Plan for 1.0
I have Saturday, Sunday and EDT 1700+. On Wed 19 Mar 2014 12:30:49 PM EDT, Andrew Musselman wrote: Friday afternoon Pacific Time is good for me too. On Mar 19, 2014, at 12:14 AM, Saikat Kanjilal sxk1...@hotmail.com wrote: I'm on pacific standard time and am free Sundays late afternoon Sent from my iPhone On Mar 19, 2014, at 12:13 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: i am on vacation, so most of the pacific daylight ranges on any day should work for me. On Wed, Mar 19, 2014 at 12:07 AM, Sebastian Schelter s...@apache.org wrote: Friday would also work for me. On 03/19/2014 08:05 AM, Suneel Marthi wrote: Same here, travel next week and in Amsterdam the first week of April. I avoid Sundays or weekends for obvious reasons. How bout this Friday? Sent from my iPhone On Mar 19, 2014, at 3:02 AM, Sebastian Schelter s...@apache.org wrote: Would some time on sunday work? I'll be traveling the next two weeks starting from Tuesday. Best, Sebastian On 03/19/2014 07:55 AM, Suneel Marthi wrote: I had a hangout setup for 0.9, not sure if its still valid; I can check on that or can set one up now. When would people wanna have it? Mondays and Wednesdays don't work for me. Would Tuesdays 6pm Eastern Time work ? On Wednesday, March 19, 2014 2:45 AM, Sebastian Schelter s...@apache.org wrote: Hi Saikat, 1) I think that Mahout-1248 and 1249 are still very important features that I would love to see in the codebase as they would highly improve the usability of our ALS code. 2) I think the last discussion item regarding h2o was to find a way to compare it against existing or spark related algorithm implementation to get a better picture of programming model and performance. I also don't feel that a final decision has been reached about this. 3) We should have the hangout, can someone step up and organize it? Best, Sebastian On 03/19/2014 04:45 AM, Saikat Kanjilal wrote: Hi Guys, I read through the email threads with the weigh ins for the inclusion of H2O as well as spark and wanted to circle back on the plan for folks to meet around 1.0, so a few questions: 1) How does the inclusion of H2O and spark weigh in importance versus the current JIRA items that are existing for potentially new feature work to be done in mahout (in my case JIRA 1248/1249) 2) From reading all the responses it doesn't seem like there's full consensus on what the next steps are for h2o and how that relates to the roadmap around 1.0, please correct me if I'm misunderstanding, can someone outline whether any concrete decisions have been made on whether or not mahout 1.0 will include h2o bindings 3) Are we moving forward with the google hangout , I didnt receive anything about this yet Thanks in advance.
Re: contributing to mahout
Hi Hardik, I'm forwarding previous thread about 1.0 release plan to you. As a spark user you will see many things to be done. Yours Peng On Thu 06 Mar 2014 12:57:20 PM EST, Sebastian Schelter wrote: Hi Hardik, at the moment, we are heavily working on polishing our documentation. A very welcome contribution would be a nice to read writeup how to use an algorithm mahout to solve an exemplary problem. E.g. taking a movie ratings dataset and showing how to compute recommendations on it. Best, Sebastian On 03/06/2014 06:54 PM, Hardik Pandya wrote: Hi all, I am new to mahout and wanted to contribute into mahout dev community, any initial pointers for new comers like me appreciated Thanks, Hardik Pandya On Thu, Mar 6, 2014 at 12:44 PM, Hardik Pandya smarty.ju...@gmail.comwrote: Hi all, I am new to mahout and wanted to contribute into mahout dev community, any initial pointers for new comers like me appreciated Thanks, Hardik Pandya
Re: [jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout
Wow, waiting this for a long time, finally fixed. On Sun 02 Mar 2014 05:01:26 PM EST, Suneel Marthi (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-1178: -- Fix Version/s: (was: Backlog) 1.0 GSOC 2013: Improve Lucene support in Mahout --- Key: MAHOUT-1178 URL: https://issues.apache.org/jira/browse/MAHOUT-1178 Project: Mahout Issue Type: New Feature Reporter: Dan Filimon Assignee: Gokhan Capan Labels: gsoc2013, mentor Fix For: 1.0 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch [via Ted Dunning] It should be possible to view a Lucene index as a matrix. This would require that we standardize on a way to convert documents to rows. There are many choices, the discussion of which should be deferred to the actual work on the project, but there are a few obvious constraints: a) it should be possible to get the same result as dumping the term vectors for each document each to a line and converting that result using standard Mahout methods. b) numeric fields ought to work somehow. c) if there are multiple text fields that ought to work sensibly as well. Two options include dumping multiple matrices or to convert the fields into a single row of a single matrix. d) it should be possible to refer back from a row of the matrix to find the correct document. THis might be because we remember the Lucene doc number or because a field is named as holding a unique id. e) named vectors and matrices should be used if plausible. -- This message was sent by Atlassian JIRA (v6.2#6252)
Re: Mahout 1.0 goals
Hi Dr Dunning, I'm reluctant to admit that my feeling is similar to many of Sean's customers. as a user of mahout and lucene-solr, I see a lot of similarities in their cases: lucene | mahout indexing takes text as sparse vectors and build inverted index | training takes data as sparse vectors and build model inverted index exist in memory/HDFS | model exist in memory/HDFS use by input text and return match with scores | use by input test data and return scores/labels do model selection by comparing ordinal number of scores with ground truth | do model selection by comparing scores/labels with ground truth Then lucene/solr/elasticsearch evolved to become most successful flagship products (as buggy and incomplete as it is, it still gain wide usage which mahout never achieved). Yet mahout still looks like being assembled by glue and duct tape. The major difficulties I encountered are: 1. Components are not interchangable: e.g. the data and model presentation for single-node CF is vastly different from MR CF. New feature sometimes add backward-incompatible presentation. This drastically demoralized user seeking to integrate with it and expecting improvement. 2. Components have strong dependency on others: e.g. Cross-validation of CF can only use in-memory DataModel, which SlopeOneRecommender cannot update properly (its removed but you got my point). Such design never draw enough attention apart from an 'won't fix' solution. 3. Many models can only be used internally, cannot be exported or reused in other applications. This is true in solr as well but its restful api is very universal and many etl tools has been built for it. In contrast mahout has a very hard learning curve for non-java developers. its not bad t see mahout as a service on top of a library, if it doesn't take too much effort. Yours Peng On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote: Ravi, Good points. On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla ravi.mummu...@gmail.comwrote: - Natively support Windows (guidance, etc. No documentation exists today, for instance) There is a bit of demand for that. - Faster time to first application (from discovery to first application currently takes a non-trivial amount of effort; how can we lower the bar and reduce the friction for adoption?) There is huge evidence that this is important. - Better documenting use cases with working samples/examples (Documentation on https://mahout.apache.org/users/basics/algorithms.html is spread out and there is too much focus on algorithms as opposed to use cases - this is an adoption blocker) This is also important. - Uniformity of the API set across all algorithms (are we providing the same experience across all APIs?) And many people have been tripped up by this. - Measuring/publishing scalability metrics of various algorithms (why would we want users to adopt Mahout vs. other frameworks for ML at scale?) I don't see this as important as some of your other points, but is still useful.
Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms
That should be easy. But that defeats the purpose of using mahout as there are already enough implementations of single node backpropagation (in which case GPU is much faster). Yexi: Regarding downpour SGD and sandblaster, may I suggest that the implementation better has no parameter server? It's obviously a single point of failure and in terms of bandwidth, a bottleneck. I heard that MLlib on top of Spark has a functional implementation (never read or test it), and its possible to build the workflow on top of YARN. Non of those framework has an heterogeneous topology. Yours Peng On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913488#comment-13913488 ] Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM: --- I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. (Doing a non-map-reduce neural network in Mahout would be of substantial interest.) Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. was (Author: maciejmazur): I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard is it to get GSOC in Mahout? Did anyone succeed last year? -- This message was sent by Atlassian JIRA (v6.1.5#6160)
Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms
With pleasure! the original downpour paper propose a parameter server from which subnodes download shards of old model and upload gradients. So if the parameter server is down, the process has to be delayed, it also requires that all model parameters to be stored and atomically updated on (and fetched from) a single machine, imposing asymmetric HDD and bandwidth requirement. This design is necessary only because each -=delta operation has to be atomic. Which cannot be ensured across network (e.g. on HDFS). But it doesn't mean that the operation cannot be decentralized: parameters can be sharded across multiple nodes and multiple accumulator instances can handle parts of the vector subtraction. This should be easy if you create a buffer for the stream of gradient, and allocate proper numbers of producers and consumers on each machine to make sure it doesn't overflow. Obviously this is far from MR framework, but at least it can be made homogeneous and slightly faster (because sparse data can be distributed in a way to minimize their overlapping, so gradients doesn't have to go across the network that frequent). If we instead using a centralized architect. Then there must be =1 backup parameter server for mission critical training. Yours Peng e.g. we can simply use a producer/consumer pattern If we use a producer/consumer pattern for all gradients, On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote: Peng, Can you provide more details about your thought? Regards, 2014-02-27 16:00 GMT-05:00 peng pc...@uowmail.edu.au: That should be easy. But that defeats the purpose of using mahout as there are already enough implementations of single node backpropagation (in which case GPU is much faster). Yexi: Regarding downpour SGD and sandblaster, may I suggest that the implementation better has no parameter server? It's obviously a single point of failure and in terms of bandwidth, a bottleneck. I heard that MLlib on top of Spark has a functional implementation (never read or test it), and its possible to build the workflow on top of YARN. Non of those framework has an heterogeneous topology. Yours Peng On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1426?page= com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=13913488#comment-13913488 ] Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM: --- I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. (Doing a non-map-reduce neural network in Mahout would be of substantial interest.) Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. was (Author: maciejmazur): I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes. There is only one word about Autoencoders in NeuralNetwork class. As far as I know Mahout doesn't support convolutional networks. Is it a good idea to implement one of these algorithms? Is it a reasonable amount of work? How hard
Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms
Hi Yexi, I was reading your code and found the MLP class is abstract-ish (both train functions throws exception). Is there a thread or ticket for shippable implementation? Yours Peng On Thu 27 Feb 2014 06:56:51 PM EST, peng wrote: With pleasure! the original downpour paper propose a parameter server from which subnodes download shards of old model and upload gradients. So if the parameter server is down, the process has to be delayed, it also requires that all model parameters to be stored and atomically updated on (and fetched from) a single machine, imposing asymmetric HDD and bandwidth requirement. This design is necessary only because each -=delta operation has to be atomic. Which cannot be ensured across network (e.g. on HDFS). But it doesn't mean that the operation cannot be decentralized: parameters can be sharded across multiple nodes and multiple accumulator instances can handle parts of the vector subtraction. This should be easy if you create a buffer for the stream of gradient, and allocate proper numbers of producers and consumers on each machine to make sure it doesn't overflow. Obviously this is far from MR framework, but at least it can be made homogeneous and slightly faster (because sparse data can be distributed in a way to minimize their overlapping, so gradients doesn't have to go across the network that frequent). If we instead using a centralized architect. Then there must be =1 backup parameter server for mission critical training. Yours Peng e.g. we can simply use a producer/consumer pattern If we use a producer/consumer pattern for all gradients, On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote: Peng, Can you provide more details about your thought? Regards, 2014-02-27 16:00 GMT-05:00 peng pc...@uowmail.edu.au: That should be easy. But that defeats the purpose of using mahout as there are already enough implementations of single node backpropagation (in which case GPU is much faster). Yexi: Regarding downpour SGD and sandblaster, may I suggest that the implementation better has no parameter server? It's obviously a single point of failure and in terms of bandwidth, a bottleneck. I heard that MLlib on top of Spark has a functional implementation (never read or test it), and its possible to build the workflow on top of YARN. Non of those framework has an heterogeneous topology. Yours Peng On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1426?page= com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=13913488#comment-13913488 ] Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM: --- I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. (Doing a non-map-reduce neural network in Mahout would be of substantial interest.) Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. was (Author: maciejmazur): I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. GSOC 2013 Neural network algorithms --- Key: MAHOUT-1426 URL: https://issues.apache.org/jira/browse/MAHOUT-1426 Project: Mahout Issue Type: Improvement Components: Classification Reporter: Maciej Mazur I would like to ask about possibilites of implementing neural network algorithms in mahout during GSOC. There is a classifier.mlp package with neural network. I can't see neighter RBM nor Autoencoder in these classes
Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms
Oh, thanks a lot, I missed that one :) +1 on easiest one implemented first. I haven't think about difficulty issue, need to read more about YARN extension. Yours Peng On Thu 27 Feb 2014 08:06:27 PM EST, Yexi Jiang wrote: Hi, Peng, Do you mean the MultilayerPerceptron? There are three 'train' method, and only one (the one without the parameters trackingKey and groupKey) is implemented. In current implementation, they are not used. Regards, Yexi 2014-02-27 19:31 GMT-05:00 Ted Dunning ted.dunn...@gmail.com: Generally for training models like this, there is an assumption that fault tolerance is not particularly necessary because the low risk of failure trades against algorithmic speed. For reasonably small chance of failure, simply re-running the training is just fine. If there is high risk of failure, simply checkpointing the parameter server is sufficient to allow restarts without redundancy. Sharding the parameter is quite possible and is reasonable when the parameter vector exceed 10's or 100's of millions of parameters, but isn't likely much necessary below that. The asymmetry is similarly not a big deal. The traffic to and from the parameter server isn't enormous. Building something simple and working first is a good thing. On Thu, Feb 27, 2014 at 3:56 PM, peng pc...@uowmail.edu.au wrote: With pleasure! the original downpour paper propose a parameter server from which subnodes download shards of old model and upload gradients. So if the parameter server is down, the process has to be delayed, it also requires that all model parameters to be stored and atomically updated on (and fetched from) a single machine, imposing asymmetric HDD and bandwidth requirement. This design is necessary only because each -=delta operation has to be atomic. Which cannot be ensured across network (e.g. on HDFS). But it doesn't mean that the operation cannot be decentralized: parameters can be sharded across multiple nodes and multiple accumulator instances can handle parts of the vector subtraction. This should be easy if you create a buffer for the stream of gradient, and allocate proper numbers of producers and consumers on each machine to make sure it doesn't overflow. Obviously this is far from MR framework, but at least it can be made homogeneous and slightly faster (because sparse data can be distributed in a way to minimize their overlapping, so gradients doesn't have to go across the network that frequent). If we instead using a centralized architect. Then there must be =1 backup parameter server for mission critical training. Yours Peng e.g. we can simply use a producer/consumer pattern If we use a producer/consumer pattern for all gradients, On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote: Peng, Can you provide more details about your thought? Regards, 2014-02-27 16:00 GMT-05:00 peng pc...@uowmail.edu.au: That should be easy. But that defeats the purpose of using mahout as there are already enough implementations of single node backpropagation (in which case GPU is much faster). Yexi: Regarding downpour SGD and sandblaster, may I suggest that the implementation better has no parameter server? It's obviously a single point of failure and in terms of bandwidth, a bottleneck. I heard that MLlib on top of Spark has a functional implementation (never read or test it), and its possible to build the workflow on top of YARN. Non of those framework has an heterogeneous topology. Yours Peng On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote: [ https://issues.apache.org/jira/browse/MAHOUT-1426?page= com.atlassian.jira.plugin.system.issuetabpanels:comment- tabpanelfocusedCommentId=13913488#comment-13913488 ] Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM: --- I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options for neural networks than map reduce. How about non-map-reduce version? I see that you think it is something that would make a sense. (Doing a non-map-reduce neural network in Mahout would be of substantial interest.) Do you think it will be a valueable contribution? Is there a need for this type of algorithm? I think about multi-threded batch gradient descent with pretraining (RBM or/and Autoencoders). I have looked into these old JIRAs. RBM patch was withdrawn. I would rather like to withdraw that patch, because by the time i implemented it i didn't know that the learning algorithm is not suited for MR, so I think there is no point including the patch. was (Author: maciejmazur): I've read the papers. I didn't think about distributed network. I had in mind network that will fit into memory, but will require significant amount of computations. I understand that there are better options
Re: Mahout on Spark?
I was suggested to switch to MLlib for its performance, but I doubt if that is production ready, even if it is I would still favour hadoop's sturdiness and self-healing. But maybe mahout can include contribs that M/R is not fit for, like downpour SGD or graph-based algorithms? On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote: To set expectations appropriately, I think it's important to point out this is completely infeasible short of a total rewrite, and I can't imagine that will happen. It may not be obvious if you haven't looked at the code how completely dependent on M/R it is. You can swap out M/R and Spark if you write in terms of something like Crunch, but that is not at all the case here. On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas jayunit...@gmail.com wrote: +100 for this, different execution engines, like the direction pig and crunch take Sent from my iPhone On Feb 19, 2014, at 5:19 AM, Gokhan Capan gkhn...@gmail.com wrote: I imagine in Mahout offering an option to the users to select from different execution engines (just like we currently do by giving M/R or sequential options), and starting from Spark. I am not sure what changes needed in the codebase, though. Maybe following MLI (or alike) and implementing some more stuff, such as common interfaces for iterating over data (the M/R way and the Spark way). IMO, another effort might be porting pre-online machine learning (such transforming text into vector based on the dictionary generated by seq2sparse before), machine learning based on mini-batches, and streaming summarization stuff in Mahout to Spark-Streaming. Best, Gokhan On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov dlie...@gmail.comwrote: PS I am moving along cost optimizer for spark-backed DRMs on some multiplicative pipelines that is capable of figuring different cost-based rewrites and R-Like DSL that mixes in-core and distributed matrix representations and blocks but it is painfully slow, i really only doing it like couple nights in a month. It does not look like i will be doing it on company time any time soon (and even if i did, the company doesn't seem to be inclined to contribute anything I do anything new on their time). It is all painfully slow, there's no direct funding for it anywhere with no string attached. That probably will be primary reason why Mahout would not be able to get much traction compared to university-based contributions. On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Unfortunately methinks the prospects of something like Mahout/MLLib merge seem very unlikely due to vastly diverged approach to the basics of linear algebra (and other things). Just like one cannot grow single tree out of two trunks -- not easily, anyway. It is fairly easy to port (and subsequently beat) MLib at this point from collection of algorithms point of view. But IMO goal should be more MLI-like first, and port second. And be very careful with concepts. Something that i so far don't see happening with MLib. MLib seems to be old-style Mahout-like rush to become a collection of basic algorithms rather than coherent foundation. Admittedly, i havent looked very closely. On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter s...@apache.org wrote: I'm also convinced that Spark is a superior platform for executing distributed ML algorithms. We've had a discussion about a change from Hadoop to another platform some time ago, but at that point in time it was not clear which of the upcoming dataflow processing systems (Spark, Hyracks, Stratosphere) would establish itself amongst the users. To me it seems pretty obvious that Spark made the race. I concur with Ted, it would be great to have the communities work together. I know that at least 4 mahout committers (including me) are already following Spark's mailinglist and actively participating in the discussions. What are the ideas how a fruitful cooperation look like? Best, Sebastian PS: I ported LLR-based cooccurrence analysis (aka item-based recommendation) to Spark some time ago, but I haven't had time to test my code on a large dataset yet. I'd be happy to see someone help with that. On 02/19/2014 08:04 AM, Nick Pentreath wrote: I know the Spark/Mllib devs can occasionally be quite set in ways of doing certain things, but we'd welcome as many Mahout devs as possible to work together. It may be too late, but perhaps a GSoC project to look at a port of some stuff like co occurrence recommender and streaming k-means? N -- Sent from Mailbox for iPhone On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath nick.pentre...@gmail.comwrote: My (admittedly heavily biased) view is Spark is a superior platform overall for ML. If the two communities can work together to leverage the strengths of Spark, and the large amount of good stuff in Mahout (as well as the fantastic depth of
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Really? I guess PageRank in mahout was removed due to inherited network bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is mllib implementation based on Lanczos or SSVD? Just curious... On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote: I bet page rank in mllib in spark finds stationary distribution much faster. On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote: Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Thanks a lot Sebastian, Ted and Dmitriy, I'll try Giraph for a performance benchmark. You are right, power iteration is just the most simple form of Lanczos, it shouldn't be in the scope. On Tue 18 Feb 2014 03:59:57 AM EST, Sebastian Schelter wrote: You can also use giraph for a superfast PageRank implementation. Giraph even runs on standard hadoop clusters. Pagerank is usually computed by power iteration, which is much simpler than lanczos or ssvd and only gives the eigenvector associated with the largest eigenvalue. Am 18.02.2014 09:33 schrieb Peng Cheng pc...@uowmail.edu.au: Really? I guess PageRank in mahout was removed due to inherited network bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is mllib implementation based on Lanczos or SSVD? Just curious... On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote: I bet page rank in mllib in spark finds stationary distribution much faster. On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote: Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Agreed, and this is the case where Lanczos algorithm is obsolete. My point is: if SSVD is unable to find the eigenvector of asymmetric matrix (this is a common formulation of PageRank, and some random walks, and many other things), then we still have to rely on large-scale Lanczos algorithm. On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote: For the symmetric case, SVD is eigen decomposition. On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote: If SSVD is not designed for such eigenvector problem. Then I would vote for retaining the Lanczos algorithm. However, I would like to see the opposite case, I have tested both algorithms on symmetric case and SSVD is much faster and more accurate than its competitor. Yours Peng On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote: In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
In PageRank I'm afraid I have no other option than eigenvector \lambda, but not singular vector u v:) The PageRank in Mahout was removed with other graph-based algorithm. On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote: SSVD is very probably better than Lanczos for any large decomposition. That said, it does SVD, not eigen decomposition which means that the question of symmetrical matrices or positive definiteness doesn't much matter. Do you really need eigen-decomposition? On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote: Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?
Just asking for possible replacement of our Lanczos-based PageRank implementation. - Peng
Learning to rank support in Mahout and Solr integration?
This is what I believe to be a typical learning to rank model: 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr these are queries/function queries). 2. Test those scorers on a ground truth dataset. Generating feature vectors for top-n results annotated by human. 3. Use an existing classifier/regressor (e.g. support vector ranking, GBDT, random forest etc.) on those feature vectors to get a ranking model. 4. Export this ranking model back to Solr as a custom ensemble query (a BooleanQuery with custom boosting factor for linear model, or a CustomScoreQuery with custom scoring function for non-linear model), push it to Solr server, register with QParser. Push it to production. End of. But I didn't find this workflow quite easy to implement in mahout-solr integration (is it discouraged for some reason?). Namely, there is no pipeline from results of scorers to a Mahout-compatible vector form, and there is no pipeline from ranking model back to ensemble query. (I only found the lucene2seq class, and the upcoming recommendation support, which don't quite fit into the scenario). So what's the best practice for easily implementing a realtime, learning to rank search engine in this case? I've worked in a bunch of startups and such appliance seems to be in high demand. (Remember that solr-based collaborative filtering model proposed by Dr Dunning? This is the content-based counterpart of it) I'm looking forward to streamline this process to make my upcoming work easier. I think Mahout/Solr is the undisputed instrument of choice due to their scalability and machine learning background of many of their top committers. Can we talk about it at some point? Yours Peng
Re: Learning to rank support in Mahout and Solr integration?
Hi Dr Dunning, Thanks a lot! I was trying to make the model generalizable enough, but I'm also afraid I may 'abuse' it a bit, Here is my existing solution: 1. wrap any scorer by a ValueSource (many out-of-the-box exists in lucene-solr, extensions are possible but they don't have to be registered with ValueSourceParser-they won't be used independently) 2. extend CustomScoreQuery to have a flat and straightforward explanation form. Use this as a wrapper of filters (As SubQ) and scorers (As FunctionQ) 3. write a converter to print flat explanation to Mahout-compatible vectors. 4. run a job to 'explain()' those ground truths on an index and dump the result vectors. 5. (optional) run other jobs to get not-content-based score vectors. 6. join them, feed into a classifier-regressor, do some model selections. 7. (from this point I haven't done anything) try to 'migrate' this model into another CustomScoreQuery, which has a strong scorer that ensemble features in the same way the model suggested. 8. push into Solr Cloud Server. Register with Qparser. What I found to be hard: 1. explanation is kind of abusive, its only designed for manual tweaking. I constantly run into problems where 'explain()' implementation was look down upon by developers and code stubs are used to fill. Notably, ToParentBlockJoin won't show nested scores, and ToChildBlockJoin simply doesn't work. 2. There is no automatic way to 'migrate' model to ensemble query. Though I haven't proceed that far I'm already afraid of the difficulty. 3. As a NoSQL database optimized to the core in text processing, Solr extensions are totally not intuitive and hard to debug and maintain. We try to keep this part minimal but still get stagnated at some point. Environment is build on CDH 5.0beta2 with YARN and Cloudera search (Solr 4.4), some bugs then force me to uninstall it and install Solr Cloud 4.6. I wonder if there are more 'out-of-the-box' solutions? Yours Peng On Sun 09 Feb 2014 05:53:20 PM EST, Ted Dunning wrote: I think that this is a bit of an idiosyncratic model for learning to rank, but it is a reasonably viable one. It would be good to have a discussion of what you find hard or easy and what you think is needed to make this work. Let's talk. On Sun, Feb 9, 2014 at 2:26 PM, peng pc...@uowmail.edu.au wrote: This is what I believe to be a typical learning to rank model: 1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr these are queries/function queries). 2. Test those scorers on a ground truth dataset. Generating feature vectors for top-n results annotated by human. 3. Use an existing classifier/regressor (e.g. support vector ranking, GBDT, random forest etc.) on those feature vectors to get a ranking model. 4. Export this ranking model back to Solr as a custom ensemble query (a BooleanQuery with custom boosting factor for linear model, or a CustomScoreQuery with custom scoring function for non-linear model), push it to Solr server, register with QParser. Push it to production. End of. But I didn't find this workflow quite easy to implement in mahout-solr integration (is it discouraged for some reason?). Namely, there is no pipeline from results of scorers to a Mahout-compatible vector form, and there is no pipeline from ranking model back to ensemble query. (I only found the lucene2seq class, and the upcoming recommendation support, which don't quite fit into the scenario). So what's the best practice for easily implementing a realtime, learning to rank search engine in this case? I've worked in a bunch of startups and such appliance seems to be in high demand. (Remember that solr-based collaborative filtering model proposed by Dr Dunning? This is the content-based counterpart of it) I'm looking forward to streamline this process to make my upcoming work easier. I think Mahout/Solr is the undisputed instrument of choice due to their scalability and machine learning background of many of their top committers. Can we talk about it at some point? Yours Peng
Re: Mahout 0.9 Release
+1, can't see a bad side. On Wed 29 Jan 2014 11:33:02 AM EST, Suneel Marthi wrote: +1 from me On Wednesday, January 29, 2014 8:58 AM, Sebastian Schelter s...@apache.org wrote: +1 On 01/29/2014 05:25 AM, Andrew Musselman wrote: Looks good. +1 On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com wrote: a), b), c), d) all passed here. CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were within the range [0,1]. Date: Tue, 28 Jan 2014 16:45:42 -0800 From: suneel_mar...@yahoo.com Subject: Mahout 0.9 Release To: u...@mahout.apache.org; dev@mahout.apache.org Fixed the issues that were reported with Clustering code this past week, upgraded codebase to Lucene 4.6.1 that was released today. Here's the URL for the 0.9 release in staging:- https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/ The artifacts have been signed with the following key: https://people.apache.org/keys/committer/smarthi.asc Please:- a) Verify that u can unpack the release (tar or zip) b) Verify u r able to compile the distro c) Run through the unit tests: mvn clean test d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run through all the different options in each script. Need a minimum of 3 '+1' votes from PMC for the release to be finalized.
Sebastian: On the subject of efficient In-memory DataModel for recommendation engine.
Hi Sebastian, Sorry I dropped out from the Hangout for a few minutes, when I get back its already over : Well, lets continue the conversation on the DataModel improvement: I was looking into your KDDCupFactorizablePreferences and found out that it doesn't load any data into the memory, the only data structure in that class is the dataFile used to generate a stream of preference from hard disk. I think this is why you can load it into 1G memory without a heapspace overflow. However, I think it is only good for memory saving at the expenses of lots of things (e.g. random access, random insert, delete and update, concurrency). Thus justify the necessity to load things into memory, theoretically, a preference array of netflix size will cost at least: [8bytes (userID : long) + 8bytes (itemID : long) + 4bytes (value : float)]* 100,480,507 = 2009610140bytes = 1.87159528956GB = 1916.51357651MB ...plus overhead. But I would rather it to be a bit bigger to trade for O(1) random access/update, but not too big, like the current row/column sparse matrix-ish implementation that duplicates everything. That's my concern, I got several ideas on optimizing my in-memory dataModel, but never had time to do them : Please give me a few more weeks, when the code is optimized to the teeth and support concurrent access, I'll submit it again for revision. Gokhan has also made a lot of work on this part, so its good to have many options. Yours Peg
Why Kahan summation was not used anywhere?
For a large scale computational engine this seems unwashed. Most summation/average and dot product of vectors still use naive summation despite of its O(n) error. Is there a reason? All the best, Yours Peng
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758103#comment-13758103 ] Peng Cheng commented on MAHOUT-1286: The existing open addressing hash table is for 1d arrays, not 2d matrices. I can get the concurrency done by next week, but there are simply too many pending optimization. e.g. if you set loadfactor to 1.2 it is pretty slow. If you can help improving on the TODO list in the code that will be awesome. Not sure about the consequence as 2d matrix interface has an int (16bit) index, but dataModel has a long (32bit) index. If you don't bother adding more things to mahout-math, then it should be alright. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754269#comment-13754269 ] Peng Cheng commented on MAHOUT-1286: Hi Gokhan, No problem, but it only has two files, I'll post the patch immediately. -Yours Peng Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1286: --- Attachment: Semifinal-implementation-added.patch Sorry about the late reply, and please be noted that the code can still be optimized at many places, I'll keep maintain it and keep an ear on all suggestions. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, Semifinal-implementation-added.patch Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: You are invited to Apache Mahout meet-up
Is the presentation going to be uploaded on Youtube or Slideshare? Sorry I cannot be there. On 13-08-22 08:46 AM, Yexi Jiang wrote: A great event. I wish I were in Bay area. 2013/8/22 Shannon Quinn squ...@gatech.edu I'm only sorry I'm not in the Bay area. Sounds great! On 8/22/13 3:38 AM, Stevo Slavić wrote: Retweeted meetup invite. Have fun! Kind regards, Stevo Slavic. On Thu, Aug 22, 2013 at 8:34 AM, Ted Dunning ted.dunn...@gmail.com wrote: Very cool. Would love to see folks turn out for this. On Wed, Aug 21, 2013 at 9:38 PM, Ellen Friedman b.ellen.fried...@gmail.com**wrote: The Apache Mahout user group has been re-activated. If you are in the Bay Area in California, join us on Aug 27 (Redwood City). Sebastian Schelter will be the main speaker, talking about new directions with Mahout recommendation. Grant Ingersoll, Ted Dunning and I be there to do a short introduction for the meet-up and update on the 0.8 release. Here's the link to rsvp: http://bit.ly/16K32hg Hope you can come, and please spread the word. Ellen
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742714#comment-13742714 ] Peng Cheng commented on MAHOUT-1286: Hi Dr Dunning, Great appreciation, I watched your speech in Berlin on youtube and finally have a clue on what is going on here. If i understand right, the core concept is to use Solr as a sparse matrix multiplier. So theoretically it can encapsulate any recommendation engine (not necessarily CF) if the recommendation phase can be cast as linear multiplication. Co-occurence matrix is one instance, other types of recommendations are possible, but slightly harder, require multiple queries sometimes. The following 3 cases should cover most classical CF instances: 1. Item-based CF (result = Sim(A,A)* h, where A is the rating matrix and Sim() is the item-to-item similarity matrix, between all pairs of items ): this is the easiest and has already been addressed in your speech: calculate Sim(A,A) beforehand, import into solr and run query ranked by weighted frequency. 2. User-based CF (result = A^T * Sim(A,h), where Sim() is the user-to-user similarity vector, between new user and all old users): slightly more complex, can run the first query on A ranked by the customized similarity function, then use the result of the first to run the second query on A^T ranked by weighted frequency. 3. SVD-based CF: no can do if the new user is not known before, AFAIK solr doesn't have any form of matrix pseudoinversion or optimization function. So determining new user's projection in the SV subspace is impossible given its dot with some old items. However, if the user in question is old, or new user can be merged into the model in real-time. Solr can just look-up its vector in SV subspace by a full match search. 4. ensemble: obviously another linear operation, can be interpreted by a query with mixed ranking function or multiple queries. Multi-model recommendation, as a juxtaposing of rating matrix (A_1 | A_2), was never a problem either using old style CF or recommendation-as-search. Judging by the sheer performance and scalabilty of solr, this could potentially make recommendation-as-search a superior option. However as Gokhan inferred, we will likely still use old algorithms for training, but solr for recommendation. So I'm going back to 1274 anyway, by using the posted DataModel as a temporary glue. It won't be hard for me or anybody else to refactor it for the solr interface. -Yours Peng Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736962#comment-13736962 ] Peng Cheng commented on MAHOUT-1286: The idea of ArrayMap has been discarded due to its impractical time consumption of insertion (O(n) for a batch insertion) and query (O(logn)). I have moved back to HashMap. Due to the same reason, I feel that using Sparse Row/Column matrix may have the same problem. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1286: --- Attachment: InMemoryDataModelTest.java InMemoryDataModel.java See uploaded files for detail Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736992#comment-13736992 ] Peng Cheng commented on MAHOUT-1286: Here is my final solution after numerous experiments: A combination of double hashing for storing user/item IDs and 2d hopscotch hashing (http://mcg.cs.tau.ac.il/papers/disc2008-hopscotch.pdf) for storing preferences as a map from user/item indices in the double hashing table. Hopscotch hashing maintains strong locality and high load factor, and each dimension uses an independent hash function. As a result, it can quickly extract a submatrix or single row or column. This is the smallest implementation I can think of, apparently only bloom map can achieve smaller memory footage. But it has many other problems. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1286: --- Fix Version/s: 0.9 Labels: collaborative-filtering datamodel patch recommender (was: ) Status: Patch Available (was: Open) According to my test, it can load the entire Netflix dataset into memory using only 3G heap space. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: patch, collaborative-filtering, datamodel, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737019#comment-13737019 ] Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:44 PM: - Hi Dr Dunning, Indeed both Gokhan and me have experimented on that, but I've run into some difficulties, namely 1) a columnar form doesn't support fast extraction of rows, yet dataModel should allow quick getPreferencesFromUser() and getPreferencesForItem(). 2) a columnar form doesn't support fast online update (time complexity is O( n ), maximally O( log n ) if using block copy and columns are sorted). 3) To create such dataModel we need to initialize a HashMap first, this uses twice as much as heap space for initialization, could defeat the purpose though. I'm not sure if Gokhan has encountered the same problem. Didn't hear from him for some time. The search based recommender is indeed a very tempting solution. I'm very sure it is an all-improving solution to similarity-based recommenders. But low rank matrix-factorization based ones should merge preferences from the new users immediately into the prediction model, of course you can just project it into the low rank subspace, but this reduces the performance a little bit. I'm not sure how much Lucene supports online update of indices, but according to guys I'm working with the online recommender seems to be in demand these days. was (Author: peng): Hi Dr Dunning, Indeed both Gokhan and me have experimented on that, but I've run into some difficulties, namely 1) a columnar form doesn't support fast extraction of rows, yet dataModel should allow quick getPreferencesFromUser() and getPreferencesForItem(). 2) a columnar form doesn't support fast online update (time complexity is O(n), maximally O(n) if using block copy and columns are sorted). 3) To create such dataModel we need to initialize a HashMap first, this uses twice as much as heap space for initialization, could defeat the purpose though. I'm not sure if Gokhan has encountered the same problem. Didn't hear from him for some time. The search based recommender is indeed a very tempting solution. I'm very sure it is an all-improving solution to similarity-based recommenders. But low rank matrix-factorization based ones should merge preferences from the new users immediately into the prediction model, of course you can just project it into the low rank subspace, but this reduces the performance a little bit. I'm not sure how much Lucene supports online update of indices, but according to guys I'm working with the online recommender seems to be in demand these days. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737019#comment-13737019 ] Peng Cheng commented on MAHOUT-1286: Hi Dr Dunning, Indeed both Gokhan and me have experimented on that, but I've run into some difficulties, namely 1) a columnar form doesn't support fast extraction of rows, yet dataModel should allow quick getPreferencesFromUser() and getPreferencesForItem(). 2) a columnar form doesn't support fast online update (time complexity is O(n), maximally O(n) if using block copy and columns are sorted). 3) To create such dataModel we need to initialize a HashMap first, this uses twice as much as heap space for initialization, could defeat the purpose though. I'm not sure if Gokhan has encountered the same problem. Didn't hear from him for some time. The search based recommender is indeed a very tempting solution. I'm very sure it is an all-improving solution to similarity-based recommenders. But low rank matrix-factorization based ones should merge preferences from the new users immediately into the prediction model, of course you can just project it into the low rank subspace, but this reduces the performance a little bit. I'm not sure how much Lucene supports online update of indices, but according to guys I'm working with the online recommender seems to be in demand these days. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737023#comment-13737023 ] Peng Cheng commented on MAHOUT-1286: Well, I mean, I partially agree that the effort I spent on this probably won't pay off as few will use In-memory/file dataModel in production, most of them will choose a databased-backed one. I just try to solve it because its a blocker. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736962#comment-13736962 ] Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:54 PM: - The idea of ArrayMap has been discarded due to its impractical time consumption of insertion (O( n ) for a batch insertion) and query (O(logn)). I have moved back to HashMap. Due to the same reason, I feel that using Sparse Row/Column matrix may have the same problem. was (Author: peng): The idea of ArrayMap has been discarded due to its impractical time consumption of insertion (O(n) for a batch insertion) and query (O(logn)). I have moved back to HashMap. Due to the same reason, I feel that using Sparse Row/Column matrix may have the same problem. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737553#comment-13737553 ] Peng Cheng commented on MAHOUT-1286: Hi Gentlemen, Thanks a lot for proving my point Gokhan, yeah I mean either user or item preferences extraction can be fast but not both. Sorry I should have proposed it in our last hangout but I missed the invitation :- But I tried to understand your proposal on recommendation-as-search. From what I heard on youtube, the new architecture is proposed as an easier and faster replacement of all existing recommenders that take DataModel. Each item is a weighted 'bag of words' generated by concurrence analysis/item similarity on previous ratings. New users's ratings are converted into weighted tuple of existing words and is matched with the items that have highest sum of hits. My concerns are that 1) does it support all type of recommenders and their ensemble? I know modern search engine like Google and YANDEX has a fairly complex ensemble search and ranking algorithm that looks similar to an ensemble recommender, but IMHO Lucene is built only for text search, not sure to what extend it is customizable. 2) does it support online learning? This feature is more important to SVDRecommender as a new user's recommendation is only known if this user is merged into the model. (Of course, an option is to project a new user into the user subspace by minimising its distance given its dot to existing items, but no body has test its performance before) Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737563#comment-13737563 ] Peng Cheng commented on MAHOUT-1286: Also, please be noted that the first patch is still not optimized to extreme. Many improvements can be made to make it smaller and faster. (see TODO: list in code) But I'm trying to get back to MAHOUT-1274, if we expect large scale refactoring on all recommenders in favor of recommendation-as-search, I'll have to suspend it until refactoring is finished. I'm waiting online for Dr Dunning's plan. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, datamodel, patch, recommender Fix For: 0.9 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: apache-math dependency
Apologies, I mistaken apache-math as mahout-math and didn't know what I'm talking about :) On 13-08-12 07:08 PM, Ted Dunning wrote: Yes. Apache Math linear algebra is very difficult for us to use because their matrices are non-extensible. But there is actually quite a lot of code to do with random distributions, optimization and quadrature. Those are much more likely to be useful to us. On Mon, Aug 12, 2013 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: Larger part of mahout-math is linear algebra, which is currently broken for sparse part of the equation and which we don't use at all. One part of the problem is that our use for that library is always a fringe case, and as far as i can tell, will always continue to be such. Another part of the problem is that keeping dependency will invite bypassing Mahout's solvers and, as a result, architecture inconsistency. That said, I guess Ted's argument (which is mainly cost, as i gathered), trumps the two above. On Mon, Aug 12, 2013 at 3:20 PM, Peng Cheng pc...@uowmail.edu.au wrote: seriously, I would prefer the dependency as a good architectural pattern. It encourages other people to use/contribute to it to avoid repetitive work. On 13-08-12 06:16 PM, Ted Dunning wrote: I am fine with it staying. On Mon, Aug 12, 2013 at 3:14 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: So you are ok with apache-math dependency to stay? On Mon, Aug 12, 2013 at 3:09 PM, Ted Dunning ted.dunn...@gmail.com wrote: So I checked on these. The non-trivial issues with replacing Commons Math include: - Poisson and negative binomial distributions. This would be several hours work to write and test (we have Colt-inherited negative binomial distribution, but it takes no longer to write a new one than to test an old one). - random number generators. This is about and hour or two of work to pull the MersenneTwister implementation into our code. - next prime number finder. Not a big deal to replicate, but it would take a few hours to do. - quadrature. We use an adaptive integration routine to check distribution properties. This, again, would take a few hours to replace. I really don't see the benefit to this work. On Mon, Aug 12, 2013 at 2:53 PM, Ted Dunning ted.dunn...@gmail.com wrote: 2 distribution.**PoissonDistribution; 2 distribution.**PascalDistribution; 2 distribution.**NormalDistribution; 1 util.FastMath; 1 random.RandomGenerator; 1 random.MersenneTwister; 1 primes.Primes; 1 linear.RealMatrix; 1 linear.EigenDecomposition; 1 linear.Array2DRowRealMatrix; 1 distribution.RealDistribution; 1 distribution.**IntegerDistribution; 1 analysis.integration.**UnivariateIntegrator; 1 analysis.integration.**RombergIntegrator; 1 analysis.UnivariateFunction;
Re: Hangout on Monday
Strange, I didn't see any invitation. On 13-08-05 06:54 PM, Ted Dunning wrote: Just sent invite to Mahout dev list. On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is for both. If you have g+ installed you can participate. If not, you can watch. On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org wrote: Is the link only for watching or also for participation? Never did a hangout before :) 2013/8/5 Andrew Musselman andrew.mussel...@gmail.com Can't make it alas On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu wrote: what's the addr of the hangout? On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au wrote: Nice, I'll be there. On 13-08-03 02:51 PM, Andrew Musselman wrote: Sounds good On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. 1600 PDT I got that right in the linked doc, just not on the more important email. On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: Let's have the hangout at 1600 on Monday, August 5th. Maybe asking the obvious here so I apologize for the spam. The timezone is PDT, correct?
Re: Hangout on Monday
So buggy, the program act as i'm in the meeting (showing a push to talk button), but it doesn't do anything. On 13-08-05 08:02 PM, Ted Dunning wrote: Hangouts clearly do not work the way I thought they did. The URL that I sent out was for the arhcived version of the meeting. On Mon, Aug 5, 2013 at 5:00 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, I didn't see any invitation. On 13-08-05 06:54 PM, Ted Dunning wrote: Just sent invite to Mahout dev list. On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is for both. If you have g+ installed you can participate. If not, you can watch. On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org wrote: Is the link only for watching or also for participation? Never did a hangout before :) 2013/8/5 Andrew Musselman andrew.mussel...@gmail.com Can't make it alas On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu wrote: what's the addr of the hangout? On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au wrote: Nice, I'll be there. On 13-08-03 02:51 PM, Andrew Musselman wrote: Sounds good On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. 1600 PDT I got that right in the linked doc, just not on the more important email. On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: Let's have the hangout at 1600 on Monday, August 5th. Maybe asking the obvious here so I apologize for the spam. The timezone is PDT, correct?
Re: Hangout on Monday
Oh Sorry I figure out the problem, my Google+ uses gmail address as my account. I'll change that right away. On 13-08-05 08:16 PM, Ted Dunning wrote: Peng, It looks like you are not actually on google plus. I have you in my Mahout circle under your iowa email address, but I am unable to add you to a hangout. On Mon, Aug 5, 2013 at 5:07 PM, Peng Cheng pc...@uowmail.edu.au wrote: So buggy, the program act as i'm in the meeting (showing a push to talk button), but it doesn't do anything. On 13-08-05 08:02 PM, Ted Dunning wrote: Hangouts clearly do not work the way I thought they did. The URL that I sent out was for the arhcived version of the meeting. On Mon, Aug 5, 2013 at 5:00 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, I didn't see any invitation. On 13-08-05 06:54 PM, Ted Dunning wrote: Just sent invite to Mahout dev list. On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is for both. If you have g+ installed you can participate. If not, you can watch. On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org wrote: Is the link only for watching or also for participation? Never did a hangout before :) 2013/8/5 Andrew Musselman andrew.mussel...@gmail.com Can't make it alas On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu wrote: what's the addr of the hangout? On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au wrote: Nice, I'll be there. On 13-08-03 02:51 PM, Andrew Musselman wrote: Sounds good On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. 1600 PDT I got that right in the linked doc, just not on the more important email. On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: Let's have the hangout at 1600 on Monday, August 5th. Maybe asking the obvious here so I apologize for the spam. The timezone is PDT, correct?
Re: Hangout on Monday
Nice, I'll be there. On 13-08-03 02:51 PM, Andrew Musselman wrote: Sounds good On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. 1600 PDT I got that right in the linked doc, just not on the more important email. On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis andrew.psal...@webtrends.com wrote: On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote: Let's have the hangout at 1600 on Monday, August 5th. Maybe asking the obvious here so I apologize for the spam. The timezone is PDT, correct?
Re: [jira] [Created] (MAHOUT-1298) SparseRowMatrix,SparseColMatrix: optimize transpose()
+1, we have type conversion anyway. On 29/07/2013 6:40 PM, Sebastian Schelter wrote: +1 2013/7/29 Dmitriy Lyubimov (JIRA) j...@apache.org Dmitriy Lyubimov created MAHOUT-1298: Summary: SparseRowMatrix,SparseColMatrix: optimize transpose() Key: MAHOUT-1298 URL: https://issues.apache.org/jira/browse/MAHOUT-1298 Project: Mahout Issue Type: New Feature Components: Math Affects Versions: 0.8 Reporter: Dmitriy Lyubimov Assignee: Dmitriy Lyubimov Fix For: 0.9 these matrices lack optimized transpose and rely onto AbstractMatrix's O(mn) implementation which is not cool for very sparse subblocks. proposal is to implement a custom transpose with two things in mind: 1) transpose result to row sparse matrix should be col sparse matrix, and vice versa (and not from default like() as default implementation would take); 2) obviously, iterate only thru non-zero elements only of all rows(columns). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717659#comment-13717659 ] Peng Cheng commented on MAHOUT-1286: Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right? Thanks a lot for the advice. My next experiment will just use GenericPreference [] directly, there will be no more PreferenceArray. Class Name |Objects | Shallow Heap |Retained Heap --- org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView| 72,237,632 | 1,733,703,168 | = 1,733,703,168 long[] |480,199 | 818,209,680 | = 818,209,680 float[] |480,190 | 410,563,592 | = 410,563,592 java.lang.Object[] | 18,230 | 361,525,488 | = 2,443,647,088 org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray |480,189 |15,366,048 | = 1,237,456,672 java.util.ArrayList | 17,811 | 427,464 | = 2,092,416,104 char[] | 2,150 | 272,632 | = 272,632 byte[] |141 |54,048 |= 54,048 java.lang.String | 2,119 |50,856 | = 271,920 java.util.concurrent.ConcurrentHashMap$HashEntry |673 |21,536 |= 38,104 java.net.URL |229 |14,656 |= 40,720 java.util.HashMap$Entry |344 |11,008 |= 68,760 --- Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
That's exactly what I'm trying to do right now :) (I'm testing FastByIDArrayMap), but we probably have more problems than just HashMap, based on the heap dump analysis result, PreferenceArray probably will be our next target. This is awesome, as your FactorizablePreferences didn't use it in the first place. Yours Peng On 13-07-23 05:46 PM, Sebastian Schelter wrote: IMHO you will always have memory issues if you try to provide constant time random access. Thats why I proposed to created a special memory efficient DataModel for sequential access. 2013/7/23 Peng Cheng (JIRA) j...@apache.org [ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717659#comment-13717659] Peng Cheng commented on MAHOUT-1286: Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right? Thanks a lot for the advice. My next experiment will just use GenericPreference [] directly, there will be no more PreferenceArray. Class Name |Objects | Shallow Heap |Retained Heap --- org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView| 72,237,632 | 1,733,703,168 | = 1,733,703,168 long[] |480,199 | 818,209,680 | = 818,209,680 float[] |480,190 | 410,563,592 | = 410,563,592 java.lang.Object[] | 18,230 | 361,525,488 | = 2,443,647,088 org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray |480,189 |15,366,048 | = 1,237,456,672 java.util.ArrayList | 17,811 | 427,464 | = 2,092,416,104 char[] | 2,150 | 272,632 | = 272,632 byte[] |141 |54,048 |= 54,048 java.lang.String | 2,119 |50,856 | = 271,920 java.util.concurrent.ConcurrentHashMap$HashEntry |673 |21,536 |= 38,104 java.net.URL |229 |14,656 |= 40,720 java.util.HashMap$Entry |344 |11,008 |= 68,760 --- Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715885#comment-13715885 ] Peng Cheng commented on MAHOUT-1286: On second thought, hash map is very likely not the culprit for poor memory efficiency here, apologies for the misinformation. The double hashing algorithm in FastByIDMap, as described in Don Knuth's book 'the art of computer programming', has a default loadFactor of 1.5, which means the size of array is only 1.5 times the number of keys. So theoretically the heap size of GenericDataModel should never exceed 3 times the size of FactorizablePreferences. I'm still very unclear about FastByIDMap's implementation, like how it handles deletion of entries. So I cannot tell if my observation on netflix is caused by GC (e.g. construct new arrays too often), or deletion, or extra space allocated for timestamp. We probably have to run netflix in debug mode to identify the problem. I'll try to bring up this topic in the next hangout. Please give me some hint if you are an expert in those FastMap implementations. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715906#comment-13715906 ] Peng Cheng commented on MAHOUT-1286: In another hand, I try to solve the problem by implementing FastByIDArrayMap, a slightly more compact Map implementation than FastByIDMap, it uses binary search to arrange all entries into a tight array, so its worst-case time complexity for get, put and delete is log(n) (much slower than double hashing's average O(1)). But has a (marginally) smaller memory footprint and faster iteration. It has no problem passing all unit tests. But its real performance can only be shown when embedded in FileDataModel. I'll post the result shortly. However, I don't feel this is the right direction. If Sean Owen did everything right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't worth the speed loss. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715906#comment-13715906 ] Peng Cheng edited comment on MAHOUT-1286 at 7/23/13 12:26 AM: -- In another hand, I try to solve the problem by implementing FastByIDArrayMap, a slightly more compact Map implementation than FastByIDMap, it uses binary search to arrange all entries into a tight array, so its worst-case time complexity for get, put and delete is log( n ) (much slower than double hashing's average O(1)). But has a (marginally) smaller memory footprint and faster iteration. It has no problem passing all unit tests. But its real performance can only be shown when embedded in FileDataModel. I'll post the result shortly. However, I don't feel this is the right direction. If Sean Owen did everything right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't worth the speed loss. was (Author: peng): In another hand, I try to solve the problem by implementing FastByIDArrayMap, a slightly more compact Map implementation than FastByIDMap, it uses binary search to arrange all entries into a tight array, so its worst-case time complexity for get, put and delete is log(n) (much slower than double hashing's average O(1)). But has a (marginally) smaller memory footprint and faster iteration. It has no problem passing all unit tests. But its real performance can only be shown when embedded in FileDataModel. I'll post the result shortly. However, I don't feel this is the right direction. If Sean Owen did everything right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't worth the speed loss. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715912#comment-13715912 ] Peng Cheng commented on MAHOUT-1286: Hi Sebastian, Gokhan, how do you feel about the cause of the memory efficiency problem? Do you think we should talk privately? I'm also interested in your experimentation results. Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Regarding Online Recommenders
Hi, Just one simple question: Is the org.apache.mahout.math.BinarySearch.binarySearch() function an optimized version of Arrays.binarySearch()? If it is not, why implement it again? Yours Peng On 13-07-17 06:31 PM, Sebastian Schelter wrote: You are completely right, the simple interface would only be usable for readonly / batch-updatable recommenders. Online recommenders might need something different. I tried to widen the discussion here to discuss all kinds of API changes in the recommenders that would be necessary in the future. 2013/7/17 Peng Cheng pc...@uowmail.edu.au One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
For a low-rank matrix factorization based recommender, a new preference is not itself, but a dot product of two vectors in the low dimensional space, so it needs no projection. The user and item vectors however may need to be projected into a lower dimensional space, if and only if you want to reduce the rank of the preference matrix. The refactorization step in SGD is super fast--that's the charm of SGD. So, yes, we will refactorize in every update. Yours Peng On 13-07-18 11:34 AM, Pat Ferrel wrote: On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote: Hi Pat, please see my response inline. Best, Gokhan On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote: May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? If you are referring to the recommender of discussion here, no, updating the model can be done with a single preference, using stochastic gradient descent, by updating the particular user and item factors simultaneously. Aren't there two different things needed to truly update the model: 1) add the new preference to the lower dimensional space 2) refactorize the all preferences. #2 only needs to be done periodically--afaik. #1 would be super fast and could be done at runtime. Am I wrong or are you planning to incrementally refactorize the entire preference array with every new preference?
Re: Regarding Online Recommenders
If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full item based online recommender has very different tradeoffs and anyway others may not have seen this disparity of results. Furthermore we don't have A/B test results yet to validate the offline metric. On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I
Re: Regarding Online Recommenders
Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model. This could be history from a new user asked to pick a few items to start the rec process, or an old user with some new action history not yet in the model. Are you going to allow for passing the entire history vector or userID+incremental new history to the recommender? I hope so. For what it's worth we did a comparison of Mahout Item based CF to Mahout ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months of data. The data was purchase data from a diverse ecom source with a large variety of products from electronics to clothes. We found Item based CF did far better than ALS. As we increased the number of latent factors the results got better but were never within 10% of item based (we used MAP as the offline metric). Not sure why but maybe it has to do with the diversity of the item types. I understand that a full
Re: Regarding Online Recommenders
I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How are the updates handled? Do you plan to require batch model refactorization for any update? Or perform some partial update by maybe just transforming new data into the LF space already in place then doing full refactorization every so often in batch mode? By 'anonymous users' I mean users with some history that is not yet incorporated in the LF model
Re: Regarding Online Recommenders
Wow, that's lightning fast. Is it a SparseMatrix or DenseMatrix? On 13-07-18 07:23 PM, Gokhan Capan wrote: I just started to implement a Matrix backed data model and pushed it, to check the performance and memory considerations. I believe I can try it on some data tomorrow. Best Gokhan On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote: I see, sorry I was too presumptuous. I only recently worked and tested SVDRecommender, never could have known its efficiency using an item-based recommender. Maybe there is space for algorithmic optimization. The online recommender Gokhan is working on is also an SVDRecommender. An online user-based or item-based recommender based on clustering technique would definitely be critical, but we need an expert to volunteer :) Perhaps Dr Dunning can have a few words? He announced the online clustering component. Yours Peng On 13-07-18 03:54 PM, Pat Ferrel wrote: No it was CPU bound not memory. I gave it something like 14G heap. It was running, just too slow to be of any real use. We switched to the hadoop version and stored precalculated recs in a db for every user. On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote: Strange, its just a little bit larger than limibseti dataset (17m ratings), did you encountered an outOfMemory or GCTimeOut exception? Allocating more heap space usually help. Yours Peng On 13-07-18 02:27 PM, Pat Ferrel wrote: It was about 2.5M users and 500K items with 25M actions over 6 months of data. On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote: If I remember right, a highlight of 0.8 release is an online clustering algorithm. I'm not sure if it can be used in item-based recommender, but this is definitely I would like to pursue. It's probably the only advantage a non-hadoop implementation can offer in the future. Many non-hadoop recommenders are pretty fast. But existing in-memory GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO they are the culprit of scalability problem. May I ask about the scale of your dataset? how many rating does it have? Yours Peng On 13-07-18 12:14 PM, Sebastian Schelter wrote: Well, with itembased the only problem is new items. New users can immediately be served by the model (although this is not well supported by the API in Mahout). For the majority of usecases I saw, it is perfectly fine to have a short delay until new items enter the recommender, usually this happens after a retraining in batch. You have to care for cold-start and collect some interactions anyway. 2013/7/18 Pat Ferrel pat.fer...@gmail.com Yes, what Myrrix does is good. My last aside was a wish for an item-based online recommender not only factorized. Ted talks about using Solr for this, which we're experimenting with alongside Myrrix. I suspect Solr works but it does require a bit of tinkering and doesn't have quite the same set of options--no llr similarity for instance. On the same subject I recently attended a workshop in Seattle for UAI2013 where Walmart reported similar results using a factorized recommender. They had to increase the factor number past where it would perform well. Along the way they saw increasing performance measuring precision offline. They eventually gave up on a factorized solution. This decision seems odd but anyway… In the case of Walmart and our data set they are quite diverse. The best idea is probably to create different recommenders for separate parts of the catalog but if you create one model on all items our intuition is that item-based works better than factorized. Again caveat--no A/B tests to support this yet. Doing an online item-based recommender would quickly run into scaling problems, no? We put together the simple Mahout in-memory version and it could not really handle more than a down-sampled few months of our data. Down-sampling lost us 20% of our precision scores so we moved to the hadoop version. Now we have use-cases for an online recommender that handles anonymous new users and that takes the story full circle. On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote: Hi Pat I think we should provide a simple support for recommending to anonymous users. We should have a method recommendToAnonymous() that takes a PreferenceArray as argument. For itembased recommenders, its straightforward to compute recommendations, for userbased you have to search through all users once, for latent factor models, you have to fold the user vector into the low dimensional space. I think Sean already added this method in myrrix and I have some code for my kornakapi project (a simple weblayer for mahout). Would such a method fit your needs? Best, Sebastian 2013/7/17 Pat Ferrel pat.fer...@gmail.com May I ask how you plan to support model updates and 'anonymous' users? I assume the latent factors model is calculated offline still in batch mode, then there are periodic updates? How
Re: Regarding Online Recommenders
I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Mmm, You are right, the most simple solution is usually the best, I'm creating new jira ticket. Yours Peng On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Awesome! your reinforcements are highly appreciated. On 13-07-17 01:29 AM, Abhishek Sharma wrote: Sorry to interrupt guys, but I just wanted to bring it to your notice that I am also interested in contributing to this idea. I am planning to participate in ASF-ICFOSS mentor-ship programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme. (this is very similar to GSOC) I do have strong concepts in machine learning (have done the ML course by Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work experience). I am not really sure of how can I approach this problem (but I do have a strong interest to work on this problem) hence would like to pair up on this. I am currently working as a research intern at Indian Institute of Science (IISc), Bangalore India and can put up 15-20 hrs per week. Please let me know your thoughts if I can be a part of this. Thanks Regards, Abhishek Sharma http://www.linkedin.com/in/abhi21 https://github.com/abhi21 On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote: Peng, This is the reason I separated out the DataModel, and only put the learner stuff there. The learner I mentioned yesterday just stores the parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care where preferences are stored. I, kind of, agree with the multi-level DataModel approach: One for iterating over all preferences, one for if one wants to deploy a recommender and perform a lot of top-N recommendation tasks. (Or one DataModel with a strategy that might reduce existing memory consumption, while still providing fast access, I am not sure. Let me try a matrix-backed DataModel approach) Gokhan On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
[jira] [Created] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and iteration
Peng Cheng created MAHOUT-1286: -- Summary: Memory-efficient DataModel, supporting fast online updates and iteration Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration
[ https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1286: --- Summary: Memory-efficient DataModel, supporting fast online updates and element-wise iteration (was: Memory-efficient DataModel, supporting fast online updates and iteration) Memory-efficient DataModel, supporting fast online updates and element-wise iteration - Key: MAHOUT-1286 URL: https://issues.apache.org/jira/browse/MAHOUT-1286 Project: Mahout Issue Type: Improvement Components: Collaborative Filtering Affects Versions: 0.9 Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h Most DataModel implementation in current CF component use hash map to enable fast 2d indexing and update. This is not memory-efficient for big data set. e.g. Netflix prize dataset takes 11G heap space as a FileDataModel. Improved implementation of DataModel should use more compact data structure (like arrays), this can trade a little of time complexity in 2d indexing for vast improvement in memory efficiency. In addition, any online recommender or online-to-batch converted recommender will not be affected by this in training process. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Regarding Online Recommenders
One thing that suddenly comes to my mind is that, for a simple interface like FactorizablePreferences, maybe sequential READ in real time is possible, but sequential WRITE in O(1) time is Utopia. Because you need to flush out old preference with same user and item ID (in worst case it could be an interpolation search), otherwise you are permitting a user rating an item twice with different values. Considering how FileDataModel suppose to work (new files flush old files), maybe using the simple interface has less advantages than we used to believe. On 13-07-17 04:58 PM, Sebastian Schelter wrote: Hi Peng, I never wanted to discard the old interface, I just wanted to split it up. I want to have a simple interface that only supports sequential access (and allows for very memory efficient implementions, e.g. by the use of primitive arrays). DataModel should *extend* this interface and provide sequential and random access (basically what is already does). Than a recommender such as SGD could state that it only needs sequential access to the preferences and you can either feed it a DataModel (so we dont break backwards compatibility) or a memory efficient sequential access thingy. Does that make sense for you? 2013/7/17 Peng Cheng pc...@uowmail.edu.au I see, OK so we shouldn't use the old implementation. But I mean, the old interface doesn't have to be discarded. The discrepancy between your FactorizablePreferences and DataModel is that, your model supports getPreferences(), which returns all preferences as an iterator, and DataModel supports a few old functions that returns preferences for an individual user or item. My point is that, it is not hard for each of them to implement what they lack of: old DataModel can implement getPreferences() just by a a loop in abstract class. Your new FactorizablePreferences can implement those old functions by a binary search that takes O(log n) time, or an interpolation search that takes O(log log n) time in average. So does the online update. It will just be a matter of different speed and space, but not different interface standard, we can use old unit tests, old examples, old everything. And we will be more flexible in writing ensemble recommender. Just a few thoughts, I'll have to validate the idea first before creating a new JIRA ticket. Yours Peng On 13-07-16 02:51 PM, Sebastian Schelter wrote: I completely agree, Netflix is less than one gigabye in a smart representation, 12x more memory is a nogo. The techniques used in FactorizablePreferences allow a much more memory efficient representation, tested on KDD Music dataset which is approx 2.5 times Netflix and fits into 3GB with that approach. 2013/7/16 Ted Dunning ted.dunn...@gmail.com Netflix is a small dataset. 12G for that seems quite excessive. Note also that this is before you have done any work. Ideally, 100million observations should take 1GB. On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote: The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it.
Re: Regarding Online Recommenders
Yeah, setPreference() and removePreference() shouldn't be there, but injecting Recommender back to DataModel is kind of a strong dependency, which may intermingle components for different concerns. Maybe we can do something to RefreshHelper class? e.g. push something into a swap field so the downstream of a refreshable chain can read it out. I have read Gokhan's UpdateAwareDataModel, and feel that it's probably too heavyweight for a model selector as every time he change the algorithm he has to re-register that. The second idea is indeed splendid, we should separate time-complexity first and space-complexity first implementation. What I'm not quite sure, is that if we really need to create two interfaces instead of one. Personally, I think 12G heap space is not that high right? Most new laptop can already handle that (emphasis on laptop). And if we replace hash map (the culprit of high memory consumption) with list/linkedList, it would simply degrade time complexity for a linear search to O(n), not too bad either. The current DataModel is a result of careful thoughts and has underwent extensive test, it is easier to expand on top of it instead of subverting it. All the best, Yours Peng On 13-07-16 01:05 AM, Sebastian Schelter wrote: Hi Gokhan, I like your proposals and I think this is an important discussion. Peng is also interested in working on online recommenders, so we should try to team up our efforts. I'd like to extend the discussion a little to related API changes, that I think are necessary. What do you think about completely removing the setPreference() and removePreference() methods from Recommender? I think they don't belong there for two reasons: First, they duplicate functionality from DataModel and second, a lot of recommenders are read-only/train-once and cannot handle single preference updates anyway. I think we should have a DataModel implementation that can be updated and an online learning recommender should be able to register to be notified with updates. We should further more split up the DataModel interface into a hierarchy of three parts: First, a simple readonly interface that allows sequential access to the data (similar to FactorizablePreferences). This allows us to create memory efficient implementations. E.g. Cheng reported in MAHOUT-1272 that the current DataModel needs 12GB heap for the Netflix dataset (100M ratings) which is unacceptable. I was able to fit the KDD Music dataset (250M ratings) into 3GB with FactorizablePreferences. The second interface would extend the readonly interface and should resemble what DataModel is today: An easy-to-use in-memory implementation that trades high memory consumption for convenient random access. And finally the third interface would extend the second and provide tooling for online updates of the data. What do you think of that? Does it sound reasonable? --sebastian The DataModel I imagine would follow the current API, where underlying preference storage is replaced with a matrix. A Recommender would then use the DataModel and the OnlineLearner, where Recommender#setPreference is delegated to DataModel#setPreference (like it does now), and DataModel#setPreference triggers OnlineLearner#train.
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707830#comment-13707830 ] Peng Cheng commented on MAHOUT-1272: Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti is a czech dating website. This dataset has been used in a live example described in book 'Mahout in Action', page 71, written by a few guys hanging around this site. parameters: private final static double lambda = 0.1; private final static int rank = 16; private static int numALSIterations=5; private static int numEpochs=20; double randomNoise=0.02; double learningRate=0.01; double learningDecayRate=1; result (using average absolute difference, the rating is based on a 1-10 scale): INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 time spent: 41.24s=== (should be noted the number of ALS iteration is much smaller than others, which leads to suboptimal result, but this is not the point of this test) Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 time spent: 118.188s=== Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 1.2798905733917445 time spent: 21.806s This is already the best result I can get, the original book claims a best result of 1.12 on this dataset, which I never achieve. If you have also experimented and find a better parameter set, please post here. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Fix For: 0.8 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1272: --- Attachment: libimsetiSVDRecomenderEvaluatorRunner.java here is the component for testing on libimseti dataset Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Fix For: 0.8 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707830#comment-13707830 ] Peng Cheng edited comment on MAHOUT-1272 at 7/13/13 8:57 PM: - Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti is a czech dating website. This dataset has been used in a live example described in book 'Mahout in Action', page 71, written by a few guys hanging around this site. parameters: private final static double lambda = 0.1; private final static int rank = 16; private static int numALSIterations=5; private static int numEpochs=20; (for ratingSGD) double randomNoise=0.02; double learningRate=0.01; double learningDecayRate=1; (for parallelSGD) double mu0=1; double decayFactor=1; int stepOffset=100; double forgettingExponent=-1; result (using average absolute difference, the rating is based on a 1-10 scale): INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 time spent: 41.24s=== (should be noted the number of ALS iteration is much smaller than others, which leads to suboptimal result, but this is not the point of this test) Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 time spent: 118.188s=== Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 1.2798905733917445 time spent: 21.806s This is already the best result I can get, the original book claims a best result of 1.12 on this dataset, which I never achieve. If you have also experimented and find a better parameter set, please post here. was (Author: peng): Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti is a czech dating website. This dataset has been used in a live example described in book 'Mahout in Action', page 71, written by a few guys hanging around this site. parameters: private final static double lambda = 0.1; private final static int rank = 16; private static int numALSIterations=5; private static int numEpochs=20; double randomNoise=0.02; double learningRate=0.01; double learningDecayRate=1; result (using average absolute difference, the rating is based on a 1-10 scale): INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 time spent: 41.24s=== (should be noted the number of ALS iteration is much smaller than others, which leads to suboptimal result, but this is not the point of this test) Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 time spent: 118.188s=== Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 1.2798905733917445 time spent: 21.806s This is already the best result I can get, the original book claims a best result of 1.12 on this dataset, which I never achieve. If you have also experimented and find a better parameter set, please post here. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Fix For: 0.8 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent
[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1272: --- Attachment: NetflixRecomenderEvaluatorRunner.java Runnable component for testing ParallelSGDFactorizer on netflix training dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I cannot get judging.txt for validation, but my purpose is just to test its efficiency on extreme scale, so whatever). Warning! To run it without danger you need to allocate at least 12G of heap space to jvm by using the following VM parameters: -Xms12288M -Xmx12288M. In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap will kill you (or both). I almost burned my laptop on this (which has only 8G RAM). As a result, I won't be able to post any result before I can get a better machine. But since its number of rating is about 6 times the size of the movielens-10m or libimseti dataset, and SGD scales linearly to this number, I estimate the running time to be between 2.5-3 minutes. I will be utmost obliged to anybody who can try it and post the result here (of course, if your machine can handle it). But obviously as Sebastian has pointed out, our FileDataModel needs some serious optimization to handle such scale. Hey Sebastian, can you try this out in your lab? That will be most helpful. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Fix For: 0.8 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, NetflixRecomenderEvaluatorRunner.java, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender
[ https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707065#comment-13707065 ] Peng Cheng commented on MAHOUT-1274: Main component finished. The new factorizer and recommender can support adding new users and items, and update user/item vectors in only one GD step (this is very suboptimal, but I will improve this part very soon). But I don't know how to test it, the sandbox GenericDataModel doesn't support setPreference(...) and removePreference(...) yet. (SlopeOneRecommenderTest doesn't test this part either). Could someone tell me if there is an alternative to avoid this problem? As Sebastian have foretold, now is not the best time for adding support for an online recommender: The SlopeOneRecommender is half-dead, many dependencies are incomplete, and everybody's attention were drawn to core-0.8 release. Regardless, I'll try to solve it myself, and spend some time on other tickets. SGD-based Online SVD recommender Key: MAHOUT-1274 URL: https://issues.apache.org/jira/browse/MAHOUT-1274 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, features, machine_learning, svd Original Estimate: 336h Remaining Estimate: 336h an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them into the training dataModel and update the result accordingly in real time. an online SVD recommender should override setPreference(...) and removePreference(...) in AbstractRecommender such that the factorization result is updated in O(1) time and without retraining. Right now the slopeOneRecommender is the only component possessing such capability. Since SGD is intrinsically an online algorithm and its CF implementation is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a good time to convert it. Such feature could come in handy for some websites. Implementation: Adding new users, items, or increasing rating matrix rank are just increasing size of user and item matrices. Reducing rating matrix rank involves just one svd. The real challenge here is that sgd is NO ONE-PASS algorithm, multiple passes are required to achieve an acceptable optimality and even more so if hyperparameters are bad. But here are two possible circumvents: 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as applying stochastic convex-opt algorithm to non-convex problem is anarchy. But it may be a long shot. 2. Run incomplete passes in each online update using ratings randomly sampled (but not uniformly sampled) from latest dataModel. I don't know how exactly this should be done but new rating should be sampled more frequently. Uniform sampling will results in old ratings being used more than new ratings in total. If somebody has worked on this batch-to-online conversion before and share his insight that would be awesome. This seems to be the most viable option, if I get the non-uniform pseudorandom generator that maintains a cumulative uniform distribution I want. I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it didn't pay off. Hopefully its not a bad idea to submit a new ticket here. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender
[ https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707104#comment-13707104 ] Peng Cheng commented on MAHOUT-1274: Totally agree, I don't know about other DataModel but current GenericDataModel uses two maps of PreferenceArray, which is conterintuitive. I thought it can be a double FastByIDMap that allows O(1) random access, but I must missed some other requirements. Haven't read FactorizablePreferences yet, thanks a lot for your advice. SGD-based Online SVD recommender Key: MAHOUT-1274 URL: https://issues.apache.org/jira/browse/MAHOUT-1274 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, features, machine_learning, svd Original Estimate: 336h Remaining Estimate: 336h an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them into the training dataModel and update the result accordingly in real time. an online SVD recommender should override setPreference(...) and removePreference(...) in AbstractRecommender such that the factorization result is updated in O(1) time and without retraining. Right now the slopeOneRecommender is the only component possessing such capability. Since SGD is intrinsically an online algorithm and its CF implementation is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a good time to convert it. Such feature could come in handy for some websites. Implementation: Adding new users, items, or increasing rating matrix rank are just increasing size of user and item matrices. Reducing rating matrix rank involves just one svd. The real challenge here is that sgd is NO ONE-PASS algorithm, multiple passes are required to achieve an acceptable optimality and even more so if hyperparameters are bad. But here are two possible circumvents: 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as applying stochastic convex-opt algorithm to non-convex problem is anarchy. But it may be a long shot. 2. Run incomplete passes in each online update using ratings randomly sampled (but not uniformly sampled) from latest dataModel. I don't know how exactly this should be done but new rating should be sampled more frequently. Uniform sampling will results in old ratings being used more than new ratings in total. If somebody has worked on this batch-to-online conversion before and share his insight that would be awesome. This seems to be the most viable option, if I get the non-uniform pseudorandom generator that maintains a cumulative uniform distribution I want. I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it didn't pay off. Hopefully its not a bad idea to submit a new ticket here. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: (Bi-)Weekly/Monthly Dev Sessions
Sorry I missed the meeting, I really want to listen to your discussion but yesterday a thunderstorm cut off my electricity. On 13-07-08 08:29 PM, Andrew Musselman wrote: I'm getting an error when I build after doing svn up: $ mvn package [INFO] Scanning for projects... [ERROR] The build could not read 1 project - [Help 1] [ERROR] [ERROR] The project (/home/akm/mahout/pom.xml) has 1 error [ERROR] Non-readable POM /home/akm/mahout/pom.xml: no more data available - expected end tag /project to close start tag project from line 2, parser stopped on END_TAG seen .../reporting\n/project\n... @1030:1 But there's a /project tag at the end of that.. On Mon, Jul 8, 2013 at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote: Hmm, seems like that old link doesn't work. Here's a new one: https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en -Grant On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote: How about tomorrow (Monday) night at 8:30 pm EDT? Anyone who wants to join, can browse to https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en If for some reason that doesn't work, ping me on IRC (gsingers) in the #mahout channel on Freenode. Agenda: 0.8 Release Testing -Grant On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Is today's Hangout happening? On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org wrote: Hi, One of the things we kicked around at Buzzwords was having a weekly/bi-weekly/monthly dev session via Google hangout (Drill does this with good success, I believe). Since we are so spread out, I thought I would throw out a Doodle (scheduling tool for those unfamiliar) to see what times work best for the majority of people interested in such a thing. Anyone is free to participate, but this is not a Q and A session, but is instead focused on writing code, fixing bugs, triaging JIRA, releasing, etc. If you are interested, please fill out http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time Zone since I did the poll!) I just grabbed a sampling of hours throughout the day. I also picked 1 week as being representative of this being on a repeating schedule. If none of the times work for you, but you are still interested, please respond here. I would imagine we would meet for 1-2 hours. Also, please reply with the frequency at which you would like to meet: [] Weekly [] Bi-weekly (every 2 weeks) [] Monthly My vote is every two weeks. -Grant -- Thanks, Pradeep Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: 0.8 progress
Hi Sebastian, I'm sorry for the entirely noobish questions: where can I download the judging.txt ground truth set? (netflix is pulling it off everywhere, so far I can only get the legacy trainingSet and qualifying.txt) and how do I inject the ParallelAlsFactorizationJob into a common recommender class? I was trying to reproduce your result (I own a small cluster), but don't even know where to start. The only related thing i found in mahout-example is a format converter. Thanks a lot if you can give me a hint. - Yours Peng On 13-07-01 01:24 AM, Sebastian Schelter wrote: I successfully ran the ALS and cooccurrence-based recommenders on the Netflix dataset on a 26 machine cluster using Hadoop 1.0.4. --sebastian On 28.06.2013 21:31, Jake Mannix wrote: I can run LDA on Twitter's cluster, on both reuters and some real data, as well as LR/SGD. On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote: We really should setup a VM that we can run a couple of nodes (perhaps at ASF?) on that we can share w/ everyone that makes it easy to test our stuff on Hadoop for the specific version that we ship. On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote: Can someone (if you have time and experience). Write a small shim to run all examples one after the other on a cluster and write up instructions on how to do it.? Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org wrote: Its crucial that we retest everything on a real cluster before the release. I will do this for the recommenders code next week. --sebastian Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org: I should have time next week to do the release, if we can get these knocked out. If not next week, the following. On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: 1. Could someone look at Mahout-1257? There is a patch that's been submitted but I am not sure if this has been superseded by Sean's against Mahout-1239. 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release, I see that the number of warnings has gone up over the last few builds. 3. I am more concerned about the cause of the mysterious cosmic rays that randomly fail unit tests (since we have moved to running parallel tests). I see that happening on my local repository too. From: Stevo Slavić ssla...@gmail.com To: dev@mahout.apache.org Sent: Friday, June 28, 2013 3:21 AM Subject: Re: 0.8 progress Well done team! Build is unstable, oscillates, IMO regardless of changes made. Judging from logs I suspect that some of the Jenkins nodes are not configured well, /tmp directory security related issues, and file size constraints. Could be also issue with our tests. Javadoc was reported earlier not to be OK (not all modules in aggregated javadoc), and code quality reports are not working OK, e.g. findbugs doesn't respect excludes - plan to work on this during weekend. Do we want to fix these before or after 0.8 release? Kind regards, Stevo Slavić. On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com wrote: All Done Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com wrote: I sent the comments. The code is good. But without the matrix/vector input we cant ship it in the release. Hope Yiqun and Da Zhang can make those changes quickly. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote: I see 1 issue left: MAHOUT-1214. It is assigned to Robin. Any chance we can finish this up this week? -Grant On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Finally got to finishing up M-833, the changes can be reviewed at https://reviews.apache.org/r/11774/diff/3/. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Tuesday, June 11, 2013 10:09 AM Subject: Re: 0.8 progress I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll
Re: 0.8 progress
Hi Sebastian, I'm sorry for the entirely noobish questions: where can I download the judging.txt ground truth set? (netflix is pulling it off everywhere, so far I can only get the legacy trainingSet and qualifying.txt) and how do I inject the ParallelAlsFactorizationJob into a common recommender class? I was trying to reproduce your result (I own a small cluster), but don't even know where to start. The only related thing i found in mahout-example is a format converter. Thanks a lot if you can give me a hint. - Yours Peng On 13-07-01 01:24 AM, Sebastian Schelter wrote: I successfully ran the ALS and cooccurrence-based recommenders on the Netflix dataset on a 26 machine cluster using Hadoop 1.0.4. --sebastian On 28.06.2013 21:31, Jake Mannix wrote: I can run LDA on Twitter's cluster, on both reuters and some real data, as well as LR/SGD. On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote: We really should setup a VM that we can run a couple of nodes (perhaps at ASF?) on that we can share w/ everyone that makes it easy to test our stuff on Hadoop for the specific version that we ship. On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote: Can someone (if you have time and experience). Write a small shim to run all examples one after the other on a cluster and write up instructions on how to do it.? Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org wrote: Its crucial that we retest everything on a real cluster before the release. I will do this for the recommenders code next week. --sebastian Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org: I should have time next week to do the release, if we can get these knocked out. If not next week, the following. On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: 1. Could someone look at Mahout-1257? There is a patch that's been submitted but I am not sure if this has been superseded by Sean's against Mahout-1239. 2. Stevo, I am for fixing the findbugs excludes as part of 0.8 release, I see that the number of warnings has gone up over the last few builds. 3. I am more concerned about the cause of the mysterious cosmic rays that randomly fail unit tests (since we have moved to running parallel tests). I see that happening on my local repository too. From: Stevo Slavić ssla...@gmail.com To: dev@mahout.apache.org Sent: Friday, June 28, 2013 3:21 AM Subject: Re: 0.8 progress Well done team! Build is unstable, oscillates, IMO regardless of changes made. Judging from logs I suspect that some of the Jenkins nodes are not configured well, /tmp directory security related issues, and file size constraints. Could be also issue with our tests. Javadoc was reported earlier not to be OK (not all modules in aggregated javadoc), and code quality reports are not working OK, e.g. findbugs doesn't respect excludes - plan to work on this during weekend. Do we want to fix these before or after 0.8 release? Kind regards, Stevo Slavić. On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com wrote: All Done Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com wrote: I sent the comments. The code is good. But without the matrix/vector input we cant ship it in the release. Hope Yiqun and Da Zhang can make those changes quickly. Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc. On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll gsing...@apache.org wrote: I see 1 issue left: MAHOUT-1214. It is assigned to Robin. Any chance we can finish this up this week? -Grant On Jun 23, 2013, at 9:26 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: Finally got to finishing up M-833, the changes can be reviewed at https://reviews.apache.org/r/11774/diff/3/. From: Grant Ingersoll gsing...@apache.org To: dev@mahout.apache.org Sent: Tuesday, June 11, 2013 10:09 AM Subject: Re: 0.8 progress I pushed M-1030 and M-1233. If we can get M-833 and M-1214 in by Thursday, I can roll an RC on Thursday. -Grant On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org wrote: Down to 4 issues! I would say what they are, but JIRA is flaking out again. My instinct is that 1030 and 1233 can be pushed. Suneel has been working hard to get M-833 in. Not sure on M-1214, Robin? -G On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org wrote: On Jun 9, 2013, at 6:02 PM, Grant Ingersoll gsing...@apache.org wrote: M-1067 -- Dmitriy -- This is an enhancement, should we push? Looks like this was committed already. Grant Ingersoll | @gsingers http://www.lucidworks.com Grant Ingersoll
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702175#comment-13702175 ] Peng Cheng commented on MAHOUT-1272: Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own you this. testing on netflix dataset has encountered some trouble, namely, I don't know where to download it :-. Great appreciation for anyone who can share his judging.txt. In the mean time I'll try more grouplens data. Since Sebastian has taken over the code, new test cases will only be posted as code snippets. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Fix For: 0.8 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702175#comment-13702175 ] Peng Cheng edited comment on MAHOUT-1272 at 7/8/13 6:06 PM: Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own you this. I'll test more grouplens data. Since Sebastian has taken over the code, new test cases will only be posted as code snippets. was (Author: peng): Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own you this. testing on netflix dataset has encountered some trouble, namely, I don't know where to download it :-. Great appreciation for anyone who can share his judging.txt. In the mean time I'll try more grouplens data. Since Sebastian has taken over the code, new test cases will only be posted as code snippets. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Fix For: 0.8 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701672#comment-13701672 ] Peng Cheng commented on MAHOUT-1272: Hey honoured contributors I've got some crude test results for the new parallel SGD factorizer for CF: 1. parameters: lambda = 1e-10 rank of the rating matrix/number of features of each user/item vectors = 50 number of biases: 3 (average rating + user bias + item bias) number of iterations/epochs = 2 (for all factorizers including ALSWR, ratingSGD and the proposed parallelSGD) initial mu/learning rate = 0.01 (for ratingSGD and proposed parallelSGD) decay rate of mu = 1 (does not decay) (for ratingSGD and proposed parallelSGD) other parameters are set to default. 2. result on movielens-10m (I don't know what the hell happened to ALSWR, the default hyperparameters must screw up real bad, but my point is the speed edge): a. RMSE Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ALSWRFactorizer: 3.7709163950800665E21 time spent: 6.179s=== Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.8847393972529887 time spent: 6.179s=== Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.8805947464818478 time spent: 3.084s b. Absolute Average INFO: ==Recommender With ALSWRFactorizer: 1.2085420449917682E19 time spent: 7.444s=== Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.675685274206 time spent: 7.444s=== Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.6775774766740665 time spent: 2.365s 3. result on movielens-1m (in average sgd works worse on it comparing to movielens-10m, perhaps I could use more iterations/epochs) a. RMSE Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ALSWRFactorizer: 1.3514189134383086E20 time spent: 0.637s=== Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.9312989913558529 time spent: 0.637s=== Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.9529995632658007 time spent: 0.305s b. Absolute Average Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ALSWRFactorizer: 1.58934499216789965E18 time spent: 0.626s=== Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.7459565635961599 time spent: 0.626s=== Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.7420818642753416 time spent: 0.297s Great thanks to Sebastian for his guidance, I'll upload the EvaluatorRunner class as a mahout-example component and the formatted code shortly. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http
[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1272: --- Attachment: ParallelSGDFactorizerTest.java ParallelSGDFactorizer.java GroupLensSVDRecomenderEvaluatorRunner.java My laptop is a HP Pavilion with Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 8G mem. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701679#comment-13701679 ] Peng Cheng commented on MAHOUT-1272: Hi Sebastian may I ask question? I digged some old post and found that the best result should be RMSE ~= 0.85, do you know the parameters being used? Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701682#comment-13701682 ] Peng Cheng edited comment on MAHOUT-1272 at 7/7/13 10:21 PM: - New parameter: lambda = 0.001 rank of the rating matrix/number of features of each user/item vectors = 5 number of iterations/epochs = 20 result on movielens-10m, all evaluation uses RMSE: Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.8119081937625745 time spent: 36.509s=== Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.8115207244832938 time spent: 8.747s This is fast and accurate enough, I'm advancing to netflix prize dataset. was (Author: peng): New parameter: lambda = 0.001 rank of the rating matrix/number of features of each user/item vectors = 5 number of iterations/epochs = 20 result on movielens-10m: Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.8119081937625745 time spent: 36.509s=== Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.8115207244832938 time spent: 8.747s This is fast and accurate enough, I'm advancing to netflix prize dataset. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701682#comment-13701682 ] Peng Cheng commented on MAHOUT-1272: New parameter: lambda = 0.001 rank of the rating matrix/number of features of each user/item vectors = 5 number of iterations/epochs = 20 result on movielens-10m: Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With RatingSGDFactorizer: 0.8119081937625745 time spent: 36.509s=== Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info INFO: ==Recommender With ParallelSGDFactorizer: 0.8115207244832938 time spent: 8.747s This is fast and accurate enough, I'm advancing to netflix prize dataset. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701688#comment-13701688 ] Peng Cheng commented on MAHOUT-1272: Hi Sebastian, Really? I would break my fingers to squeeze into 0.8 release. (not RC1 of course, but there is still RC2 :-) A few guys I work with are also kicking me for the online recommender, so I can work very hard and undistracted. You just tell me what to do next and I'll be thrilled to oblige. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Code Freeze for 0.8
Hi Dr Dunning, I recently joined the team and am working on tickets 1272 and 1274 right now. I was planning to commit to core-0.8 rc2 but the time frame seems harsh. Could you tell me if is it practical? I'm a hard worker. PS I was there at your presentation in Toronto this year. Not ashamed to say, one of the funniest lecture in my life. -Yours Peng On 13-07-07 05:19 PM, Grant Ingersoll wrote: Working on the release now. If anyone wants to join in, I'm on IRC as well. -Grant On Jul 5, 2013, at 12:40 PM, Sebastian Schelters...@apache.org wrote: +1 On 05.07.2013 18:06, Jake Mannix wrote: +1 On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunningted.dunn...@gmail.com wrote: +1 On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: +1 From: Grant Ingersollgsing...@apache.org To:dev@mahout.apache.org dev@mahout.apache.org Sent: Friday, July 5, 2013 10:36 AM Subject: Code Freeze for 0.8 I know it's short notice, but I'd like to suggest a code freeze for 0.8 today or tomorrow and I will do a 0.8 RC on Sunday. Based on JIRA, etc., it looks like this should be fine, but let me know if there are any objections. Thanks, Grant Grant Ingersoll | @gsingers http://www.lucidworks.com
Re: Code Freeze for 0.8
Hi Dr Dunning, Thanks a lot, I just read that the deadline is within 7 days and immediately realize how retarded my plan was. There will be not rc1 or rc2, just rc. Will have to spam some test in the next few days. - Peng On 07/07/2013 10:12 PM, Ted Dunning wrote: Peng, Strictly speaking, the code is frozen already. Sebastian seems to think some can get in, but even that is pushing things. On Sun, Jul 7, 2013 at 3:59 PM, Peng Cheng pc...@uowmail.edu.au wrote: Hi Dr Dunning, I recently joined the team and am working on tickets 1272 and 1274 right now. I was planning to commit to core-0.8 rc2 but the time frame seems harsh. Could you tell me if is it practical? I'm a hard worker. PS I was there at your presentation in Toronto this year. Not ashamed to say, one of the funniest lecture in my life. -Yours Peng On 13-07-07 05:19 PM, Grant Ingersoll wrote: Working on the release now. If anyone wants to join in, I'm on IRC as well. -Grant On Jul 5, 2013, at 12:40 PM, Sebastian Schelters...@apache.org wrote: +1 On 05.07.2013 18:06, Jake Mannix wrote: +1 On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunningted.dunn...@gmail.com wrote: +1 On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com wrote: +1 __**__ From: Grant Ingersollgsing...@apache.org To:dev@mahout.apache.org dev@mahout.apache.org Sent: Friday, July 5, 2013 10:36 AM Subject: Code Freeze for 0.8 I know it's short notice, but I'd like to suggest a code freeze for 0.8 today or tomorrow and I will do a 0.8 RC on Sunday. Based on JIRA, etc., it looks like this should be fine, but let me know if there are any objections. Thanks, Grant --**-- Grant Ingersoll | @gsingers http://www.lucidworks.com
[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701233#comment-13701233 ] Peng Cheng edited comment on MAHOUT-1272 at 7/6/13 2:43 PM: Hey I have finished the class and test for parallel sgd factorizer for matrix-completion based recommender (not mapreduced, just single machine multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only tested on toy and synthetic data (2000users * 1000 items) but it is pretty fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, apparently the executor induces high overhead allocation cost) And definitely faster than single machine ALSWR. I'm submitting my java files and patch for review. was (Author: peng): Hey I have finished the class and test for parallel sgd factorizer for matrix-completion based recommender (not mapreduced, just single machine multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only tested on toy and synthetic data (2000users * 1000 times) but it is pretty fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, apparently the executor induces high overhead allocation cost) And definitely faster than single machine ALSWR. I'm submitting my java files and patch for review. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1274) SGD-based Online SVD recommender
[ https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1274: --- Description: an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them into the training dataModel and update the result accordingly in real time. an online SVD recommender should override setPreference(...) and removePreference(...) in AbstractRecommender such that the factorization result is updated in O(1) time and without retraining. Right now the slopeOneRecommender is the only component possessing such capability. Since SGD is intrinsically an online algorithm and its CF implementation is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a good time to convert it. Such feature could come in handy for some websites. Implementation: Adding new users, items, or increasing rating matrix rank are just increasing size of user and item matrices. Reducing rating matrix rank involves just one svd. The real challenge here is that sgd is NO ONE-PASS algorithm, multiple passes are required to achieve an acceptable optimality and even more so if hyperparameters are bad. But here are two possible circumvents: 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as applying stochastic convex-opt algorithm to non-convex problem is anarchy. But it may be a long shot. 2. Run incomplete passes in each online update using ratings randomly sampled (but not uniformly sampled) from latest dataModel. I don't know how exactly this should be done but new rating should be sampled more frequently. Uniform sampling will results in old ratings being used more than new ratings in total. If somebody has worked on this batch-to-online conversion before and share his insight that would be awesome. This seems to be the most viable option, if I get the non-uniform pseudorandom generator that maintains a cumulative uniform distribution I want. I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it didn't pay off. Hopefully its not a bad idea to submit a new ticket here. was: an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them into the training dataModel and update the result accordingly in real time. an online SVD recommender should override setPreference(...) and removePreference(...) in AbstractRecommender such that the factorization result is updated in O(1) time and without retraining. Right now the slopeOneRecommender is the only component possessing such capability. Since SGD is intrinsically an online algorithm and its CF implementation is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a good time to convert it. Such feature could come in handy for some websites. Implementation: Adding new users, items, or increasing rating matrix rank are just increasing size of user and item matrices. Reducing rating matrix rank involves just one svd. The real challenge here is that sgd is NO ONE-PASS algorithm, multiple passes are required to achieve an acceptable optimality and even more so if hyperparameters are bad. But here are two possible circumvents: 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as applying stochastic convex-opt algorithm to non-convex problem is anarchy. But it may be a long shot. 2. Run incomplete passes in each online update using ratings randomly sampled (but not uniformly sampled) from latest dataModel. I don't know how exactly this should be done but new rating should be sampled more frequently. Uniform sampling will results in old ratings being used more than new ratings in total. If somebody has worked on this batch-to-online conversion before and share his insight that would be awesome. This seems to be the most viable option, if I get the non-uniform pseudorandom generator that maintains a cumulative uniform distribution I want. I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it didn't pay off. Hopefully its not a bad idea to submit a new tickets. SGD-based Online SVD recommender Key: MAHOUT-1274 URL: https://issues.apache.org/jira/browse/MAHOUT-1274 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, features, machine_learning, svd Original Estimate: 336h Remaining Estimate: 336h an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them
[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender
[ https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701380#comment-13701380 ] Peng Cheng commented on MAHOUT-1274: BTW may I ask (noobishly) that why you have deprecated the SlopeOneRecommender in the latest core-0.8 snapshot? i must have missed a lot in previous mahout-development emails before i join so apologies if its a stupid question. SGD-based Online SVD recommender Key: MAHOUT-1274 URL: https://issues.apache.org/jira/browse/MAHOUT-1274 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, features, machine_learning, svd Original Estimate: 336h Remaining Estimate: 336h an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them into the training dataModel and update the result accordingly in real time. an online SVD recommender should override setPreference(...) and removePreference(...) in AbstractRecommender such that the factorization result is updated in O(1) time and without retraining. Right now the slopeOneRecommender is the only component possessing such capability. Since SGD is intrinsically an online algorithm and its CF implementation is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a good time to convert it. Such feature could come in handy for some websites. Implementation: Adding new users, items, or increasing rating matrix rank are just increasing size of user and item matrices. Reducing rating matrix rank involves just one svd. The real challenge here is that sgd is NO ONE-PASS algorithm, multiple passes are required to achieve an acceptable optimality and even more so if hyperparameters are bad. But here are two possible circumvents: 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as applying stochastic convex-opt algorithm to non-convex problem is anarchy. But it may be a long shot. 2. Run incomplete passes in each online update using ratings randomly sampled (but not uniformly sampled) from latest dataModel. I don't know how exactly this should be done but new rating should be sampled more frequently. Uniform sampling will results in old ratings being used more than new ratings in total. If somebody has worked on this batch-to-online conversion before and share his insight that would be awesome. This seems to be the most viable option, if I get the non-uniform pseudorandom generator that maintains a cumulative uniform distribution I want. I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it didn't pay off. Hopefully its not a bad idea to submit a new ticket here. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender
Hi Sebastian, Thanks a lot for help! You mean core-1.0 or bundle-1.0? I hope I can work hard enough to catch the next release. Also, what do you think about the proposed online pseudorandom sampling problem? I was digging old threads and found MAHOUT-1069, which already did a lot of work I need right now, and used a lot of code optimization techniques, but was eventually rejected for being too complex and drastic. :- I wonder if overengineering is a researcher's most dangerous bane, happened to a lot of people. On 13-07-06 01:31 PM, Sebastian Schelter wrote: Hi Peng, We deprecated a lot of algorithms that we found to be not much used to streamline our codebase for a coming 1.0 release. Am 06.07.2013 10:25 schrieb Peng Cheng (JIRA) j...@apache.org: [ https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701380#comment-13701380] Peng Cheng commented on MAHOUT-1274: BTW may I ask (noobishly) that why you have deprecated the SlopeOneRecommender in the latest core-0.8 snapshot? i must have missed a lot in previous mahout-development emails before i join so apologies if its a stupid question. SGD-based Online SVD recommender Key: MAHOUT-1274 URL: https://issues.apache.org/jira/browse/MAHOUT-1274 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: collaborative-filtering, features, machine_learning, svd Original Estimate: 336h Remaining Estimate: 336h an online SVD recommender is otherwise similar to an offline SVD recommender except that, upon receiving one or several new recommendations, it can add them into the training dataModel and update the result accordingly in real time. an online SVD recommender should override setPreference(...) and removePreference(...) in AbstractRecommender such that the factorization result is updated in O(1) time and without retraining. Right now the slopeOneRecommender is the only component possessing such capability. Since SGD is intrinsically an online algorithm and its CF implementation is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a good time to convert it. Such feature could come in handy for some websites. Implementation: Adding new users, items, or increasing rating matrix rank are just increasing size of user and item matrices. Reducing rating matrix rank involves just one svd. The real challenge here is that sgd is NO ONE-PASS algorithm, multiple passes are required to achieve an acceptable optimality and even more so if hyperparameters are bad. But here are two possible circumvents: 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as applying stochastic convex-opt algorithm to non-convex problem is anarchy. But it may be a long shot. 2. Run incomplete passes in each online update using ratings randomly sampled (but not uniformly sampled) from latest dataModel. I don't know how exactly this should be done but new rating should be sampled more frequently. Uniform sampling will results in old ratings being used more than new ratings in total. If somebody has worked on this batch-to-online conversion before and share his insight that would be awesome. This seems to be the most viable option, if I get the non-uniform pseudorandom generator that maintains a cumulative uniform distribution I want. I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it didn't pay off. Hopefully its not a bad idea to submit a new ticket here. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1272: --- Labels: features patch test (was: ) Status: Patch Available (was: Open) Hey I have finished the class and test for parallel sgd factorizer for matrix-completion based recommender (not mapreduced, just single machine multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only tested on toy and synthetic data (2000users * 1000 times) but it is pretty fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, apparently the executor induces high overhead allocation cost) And definitely faster than single machine ALSWR. I'm submitting my java files and patch for review. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: patch, test, features Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1272: --- Attachment: ParallelSGDFactorizerTest.java ParallelSGDFactorizer.java java file Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Peng Cheng updated MAHOUT-1272: --- Attachment: mahout.patch patch Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701241#comment-13701241 ] Peng Cheng commented on MAHOUT-1272: The next step would be to create an online version of this (and recommender) sgd is an online algorithm but now works only for batch recommender. In the mean time the only online recommender in mahout is the slope-one, kind of a shame. Will create a new JIRA ticket tomorrow. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701247#comment-13701247 ] Peng Cheng commented on MAHOUT-1272: Aye aye, more test on the way. Much obliged to the quick suggestion. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Labels: features, patch, test Attachments: mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696932#comment-13696932 ] Peng Cheng commented on MAHOUT-1272: Looks like the 1/n learning rate doesn't work at all on SGD factorizer, maybe the convergence of stochastic optimization can't be applied on the non-convex MF problem. Can someone show me a paper discussing convergence bound of such problem? Much appreciated. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender
[ https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696155#comment-13696155 ] Peng Cheng commented on MAHOUT-1272: learning rate/step size are set to be identical to package ~.classifier.sgd, the old learning rate is exponential with a constant decaying factor, this setting seems to be only working for smooth functions (proved by Nesterov?), I'm not sure if it is true in CF. Otherwise, either use 1/sqrt(n) for convex f or 1/n for strongly convex f. Parallel SGD matrix factorizer for SVDrecommender - Key: MAHOUT-1272 URL: https://issues.apache.org/jira/browse/MAHOUT-1272 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Peng Cheng Assignee: Sean Owen Original Estimate: 336h Remaining Estimate: 336h a parallel factorizer based on MAHOUT-1089 may achieve better performance on multicore processor. existing code is single-thread and perhaps may still be outperformed by the default ALS-WR. In addition, its hardcoded online-to-batch-conversion prevents it to be used by an online recommender. An online SGD implementation may help build high-performance online recommender as a replacement of the outdated slope-one. The new factorizer can implement either DSGD (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf). Related discussion has been carried on for a while but remain inconclusive: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAHOUT-1089) SGD matrix factorization for rating prediction with user and item biases
[ https://issues.apache.org/jira/browse/MAHOUT-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695745#comment-13695745 ] Peng Cheng commented on MAHOUT-1089: Code is slick! But apparently there is no multi-threading yet. The proposal for it has been there for a long time: http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl Is somebody working on its implementation? apparently using hogwild or vanilla DSGD has no big impact on performance. SGD matrix factorization for rating prediction with user and item biases Key: MAHOUT-1089 URL: https://issues.apache.org/jira/browse/MAHOUT-1089 Project: Mahout Issue Type: New Feature Components: Collaborative Filtering Reporter: Zeno Gantner Assignee: Sebastian Schelter Attachments: MAHOUT-1089.patch, RatingSGDFactorizer.java, RatingSGDFactorizer.java A matrix factorization that is trained with standard SGD on all features at the same time, in contrast to ExpectationMaximizationFactorizer, which learns feature by feature. Additionally to the free features it models a rating bias for each user and item. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira