Re: PyMahout (incore) (alpha v0.1)

2021-01-06 Thread Peng Zhang
Well done Trevor.

-peng

On Thu, Jan 7, 2021 at 04:45 Trevor Grant  wrote:

> Hey all,
>
> I made a branch for a thing I'm toying with. PyMahout.
>
> See https://github.com/rawkintrevo/pymahout/tree/trunk
>
> Right now, its sort of dumb- it just makes a couple of random incore
> matrices... but it _does_ make them.
>
> Next I want to show I can do something with DRMs.
>
> Once I know its all possible- Ill make a batch of JIRA tickets and we can
> start implementing a python like package so that in theory in a pyspark
> workbook you could
>
> ```jupyter
> !pip install pymahout
> 
>
> import pymhout
>
> # do pymahot things here... in python.
>
> ```
>
> So if you're interested in helping /playing- reach out on here or direct-
> if there is a bunch of interest I can commit all of this to a branch as we
> play with it.
>
> Thanks!
> tg
>


Re: [ANNOUNCE] Andrew Musselman, New Mahout PMC Chair

2018-07-19 Thread Peng Zhang
Congrats Andrew!

On Thu, Jul 19, 2018 at 04:01 Andrew Musselman 
wrote:

> Thanks Andy, looking forward to it! Thank you too for your support and
> dedication the past two years; here's to continued progress!
>
> Best
> Andrew
>
> On Wed, Jul 18, 2018 at 1:30 PM, Andrew Palumbo 
> wrote:
> > Please join me in congratulating Andrew Musselman as the new Chair of
> > the
> > Apache Mahout Project Management Committee. I would like to thank
> > Andrew
> > for stepping up, all of us who have worked with him over the years
> > know his
> > dedication to the project to be invaluable.  I look forward to Andrew
> > taking taking the project into the future.
> >
> > Thank you,
> >
> > Andy
>


Re: Any idea which approaches to non-liniear svm are easily parallelizable?

2014-10-22 Thread peng
And I agree with Ted, the non-linearity induced by most kernel functions 
are overcomplex and can easily overfit. Deep learning is a more reliable 
abstraction.


On 10/22/2014 01:42 PM, peng wrote:
Is the kernel projection referring to online/incremental incomplete 
Cholesky decomposition? Sorry I haven't used SVM for a long time and 
didn't keep up with SotA.


If that's true, I haven't find an out-of-the-box implementation, but 
this should be easy.


Yours Peng

On 10/22/2014 01:32 PM, Dmitriy Lyubimov wrote:

Andrew, thanks a bunch for the pointers!



On Wed, Oct 22, 2014 at 10:14 AM, Andrew Palumbo ap@outlook.com 
wrote:



If you do want to stick with SVM-

This is a question that I keep coming back myself to and unfortunately
have forgotten more (and lost more literature) than I’ve retained.
I believe that the most easily parallelizable sections of libSVM for 
small

datasets are (for C-SVC(R), RBF and polynomial kernels):

 1.The Kernel projections
 2.The Hyper-Paramater grid search for C,  \gamma (I believe 
this

is now included in LibSVM- I havent looked at it in a while)
 3.For multi-class SVC:  the concurrent computation of each 
SVM for

each one-against-one class vote.

I’m unfamiliar with any easily parallizable method for QP itself.
Unfortunately for (2), (3) this involves broadcasting the entire 
dataset
out to each node of a cluster (or working in a shared memory 
environment)
so may not be practical depending on the size of your data set.  
I’ve only
ever implemented (2) for relatively small datasets using MPI and and 
a with

pure java socket implementation.

Other approaches (further from simple LibSVM), which are more 
applicable

to large datasets (I’m less familiar with these):

 4.Divide and conquer the QP/SMO problem and solve (As I’ve 
said,

I’m unfamiliar with this and  I don’t know of any standard)

 5.Break the training set into subsets and solve.

For (5) there are several approaches, two that I know of are ensemble
approaches and those that accumulate Support Vectors from each 
partition
and heuristically keep/reject them until the model converges. As 
well I’ve
read some just read some research on implementing this in map a 
MapReduce

style[2].

I came across this paper [1] last night which you may find 
interesting as

well which is an interesting comparison of some SVM parallelization
strategies, particularly it discusses (1) for a shared memory 
environment

and for offloading work to GPUs (using OpenMP  and CUDA). It also cites
several other nice papers discussing SVM parallelization strategies
especially for (5).  Also then goes on to discuss more purely linear
algebra approach to optimizing SVMs (sec. 5)

Also regarding (5) you may be interested in [2] (something I’ve only
looked over briefly).

[1] http://arxiv.org/pdf/1404.1066v1.pdf
[2] http://arxiv.org/pdf/1301.0082.pdf



From: ted.dunn...@gmail.com
Date: Tue, 21 Oct 2014 17:32:22 -0700
Subject: Re: Any idea which approaches to non-liniear svm are easily

parallelizable?

To: dev@mahout.apache.org

Last I heard, the best methods pre-project and do linear SVM.

Beyond that, I would guess that deep learning techniques would subsume
non-linear SVM pretty easily.  The best parallel implementation I know

for

that is in H2O.



On Tue, Oct 21, 2014 at 4:12 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:

in particular, from libSVM --
http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf ?

thanks.
-d







Re: Any idea which approaches to non-liniear svm are easily parallelizable?

2014-10-22 Thread peng
Yes I am, In fact, my question is just about whether approximation is 
used to make the total workload of computing the matrix sub-quadratic to 
the training set size.


On 10/22/2014 02:21 PM, Andrew Palumbo wrote:

Peng, I'm not sure if you were referring to what I wrote:


  1.The Kernel projections

if so-  I was talking about parallelizing the computation of eg. the RBF Kernals


Date: Wed, 22 Oct 2014 13:42:45 -0400
From: pc...@uowmail.edu.au
To: dev@mahout.apache.org
CC: dlie...@gmail.com
Subject: Re: Any idea which approaches to non-liniear svm are easily 
parallelizable?

Is the kernel projection referring to online/incremental incomplete
Cholesky decomposition? Sorry I haven't used SVM for a long time and
didn't keep up with SotA.

If that's true, I haven't find an out-of-the-box implementation, but
this should be easy.

Yours Peng

On 10/22/2014 01:32 PM, Dmitriy Lyubimov wrote:

Andrew, thanks a bunch for the pointers!



On Wed, Oct 22, 2014 at 10:14 AM, Andrew Palumbo ap@outlook.com wrote:


If you do want to stick with SVM-

This is a question that I keep coming back myself to and unfortunately
have forgotten more (and lost more literature) than I’ve retained.
I believe that the most easily parallelizable sections of libSVM for small
datasets are (for C-SVC(R), RBF and polynomial kernels):

  2.The Hyper-Paramater grid search for C,  \gamma (I believe this
is now included in LibSVM- I havent looked at it in a while)
  3.For multi-class SVC:  the concurrent computation of each SVM for
each one-against-one class vote.

I’m unfamiliar with any easily parallizable method for QP itself.
Unfortunately for (2), (3) this involves broadcasting the entire dataset
out to each node of a cluster (or working in a shared memory environment)
so may not be practical depending on the size of your data set.  I’ve only
ever implemented (2) for relatively small datasets using MPI and and a with
pure java socket implementation.

Other approaches (further from simple LibSVM), which are more applicable
to large datasets (I’m less familiar with these):

  4.Divide and conquer the QP/SMO problem and solve (As I’ve said,
I’m unfamiliar with this and  I don’t know of any standard)

  5.Break the training set into subsets and solve.

For (5) there are several approaches, two that I know of are ensemble
approaches and those that accumulate Support Vectors from each partition
and heuristically keep/reject them until the model converges.  As well I’ve
read some just read some research on implementing this in map a MapReduce
style[2].

I came across this paper [1] last night which you may find interesting as
well which is an interesting comparison of some SVM parallelization
strategies, particularly it discusses (1) for a shared memory environment
and for offloading work to GPUs (using OpenMP  and CUDA). It also cites
several other nice papers discussing SVM parallelization strategies
especially for (5).  Also then goes on to discuss more purely linear
algebra approach to optimizing SVMs (sec. 5)

Also regarding (5) you may be interested in [2] (something I’ve only
looked over briefly).

[1] http://arxiv.org/pdf/1404.1066v1.pdf
[2] http://arxiv.org/pdf/1301.0082.pdf



From: ted.dunn...@gmail.com
Date: Tue, 21 Oct 2014 17:32:22 -0700
Subject: Re: Any idea which approaches to non-liniear svm are easily

parallelizable?

To: dev@mahout.apache.org

Last I heard, the best methods pre-project and do linear SVM.

Beyond that, I would guess that deep learning techniques would subsume
non-linear SVM pretty easily.  The best parallel implementation I know

for

that is in H2O.



On Tue, Oct 21, 2014 at 4:12 PM, Dmitriy Lyubimov dlie...@gmail.com

wrote:

in particular, from libSVM --
http://www.csie.ntu.edu.tw/~cjlin/papers/libsvm.pdf ?

thanks.
-d







Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread peng
From my experience 1.1.0 is quite stable, plus some performance 
improvements that totally worth the effort.


On 10/19/2014 06:30 PM, Ted Dunning wrote:

On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel p...@occamsmachete.com wrote:


Getting off the dubious Spark 1.0.1 version is turning out to be a bit of
work. Does anyone object to upgrading our Spark dependency? I’m not sure if
Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean upgrading
your Spark cluster.


It is going to have to happen sooner or later.

Sooner may actually be less total pain.





Re: Why is mahout moving to spark?

2014-10-15 Thread peng
No it's not, spark is a superset of mapreduce. Besides the 'Hadoop 
MapReduce' here should denotes a specific implementation rather than an 
architecture


On 10/15/2014 03:44 PM, thejas prasad wrote:

Hey all,

  I am curious why mahout is moving away from spark? I mean it is here The
Mahout community decided to move its codebase onto modern data processing
systems that offer a richer programming model and more efficient execution
than Hadoop MapReduce. But why did this happen?

And also is there a place I see all the previous emails, in the user/dev
list ?

Thanks,
Thejas





Re: Upgrade to spark 1.0.x

2014-08-09 Thread Peng Cheng

+1

1.0.0 is recommended. Many release after 1.0.1 has a short test cycle 
and 1.0.2 apparently reverted many fix for causing more serious problem.


On 14-08-09 04:51 PM, Ted Dunning wrote:

+1

Until we release a version that uses spark, we should stay with what helps
us.  Once a release goes out then tracking whichever version of spark that
the big distros put out becomes more important.



On Sat, Aug 9, 2014 at 9:57 AM, Pat Ferrel pat.fer...@gmail.com wrote:


+1

Seems like we ought to keep up to the bleeding edge until the next Mahout
release, that’s when the pain of upgrade gets spread much wider. In fact if
Spark gets moved to Scala 2.11 before our release we probably should
consider upgrading Scala too.




Spark-shell web UI powered by IPython-notebook, maybe useful in promoting Mahout DRM environment.

2014-07-30 Thread Peng Cheng

Dudes,

For those who can't afford the glamourous DataBricks Spark Cloud, or got 
pissed by its incompatibility with Mahout DRM, you may consider this 
project as an alternative:


https://github.com/tribbloid/ISpark

Support for Mahout DRM is still being implemented and will be delivered 
in a few days.


Yours Peng


Re: VOTE: moving commits to git-wp.o.a github PR features.

2014-05-17 Thread Peng Cheng

+1

On Sat 17 May 2014 02:18:56 PM EDT, Gokhan Capan wrote:

+1

Sent from my iPhone


On May 16, 2014, at 21:38, Dmitriy Lyubimov dlie...@gmail.com wrote:

Hi,

I would like to initiate a procedural vote moving to git as our primary
commit system, and using github PRs as described in Jake Farrel's email to
@dev [1]

[1]
https://blogs.apache.org/infra/entry/improved_integration_between_apache_and

If voting succeeds, i will file a ticket with infra to commence necessary
changes and to move our project to git-wp as primary source for commits as
well as add github integration features [1]. (I assume pure git commits
will be required after that's done, with no svn commits allowed).

The motivation is to engage GIT and github PR features as described, and
avoid git mirror history messes like we've seen associated with authors.txt
file fluctations.

PMC and committers have binding votes, so please vote. Lazy consensus with
minimum 3 +1 votes. Vote will conclude in 96 hours to allow some extra time
for weekend (i.e. Tuesday afternoon PST) .

here is my +1

-d


Re: Plan for 1.0

2014-03-19 Thread peng

I have Saturday, Sunday and EDT 1700+.

On Wed 19 Mar 2014 12:30:49 PM EDT, Andrew Musselman wrote:

Friday afternoon Pacific Time is good for me too.


On Mar 19, 2014, at 12:14 AM, Saikat Kanjilal sxk1...@hotmail.com wrote:

I'm on pacific standard time and am free Sundays late afternoon

Sent from my iPhone


On Mar 19, 2014, at 12:13 AM, Dmitriy Lyubimov dlie...@gmail.com wrote:

i am on vacation, so most of the pacific daylight ranges on any day should
work for me.



On Wed, Mar 19, 2014 at 12:07 AM, Sebastian Schelter s...@apache.org wrote:

Friday would also work for me.



On 03/19/2014 08:05 AM, Suneel Marthi wrote:

Same here, travel next week and in Amsterdam the first week of April.  I
avoid Sundays or weekends for obvious reasons. How bout this Friday?

Sent from my iPhone


On Mar 19, 2014, at 3:02 AM, Sebastian Schelter s...@apache.org wrote:

Would some time on sunday work? I'll be traveling the next two weeks
starting from Tuesday.

Best,
Sebastian



On 03/19/2014 07:55 AM, Suneel Marthi wrote:
I had a hangout setup for 0.9, not sure if its still valid;  I can
check on that or can set one up now. When would people wanna have it?
Mondays and Wednesdays don't work for me.  Would Tuesdays 6pm Eastern
Time work ?





On Wednesday, March 19, 2014 2:45 AM, Sebastian Schelter 
s...@apache.org wrote:

Hi Saikat,

1) I think that Mahout-1248 and 1249 are still very important features
that I would love to see in the codebase as they would highly improve
the usability of our ALS code.

2) I think the last discussion item regarding h2o was to find a way to
compare it against existing or spark related algorithm implementation to
get a better picture of programming model and performance. I also don't
feel that a final decision has been reached about this.


3) We should have the hangout, can someone step up and organize it?

Best,
Sebastian





On 03/19/2014 04:45 AM, Saikat Kanjilal wrote:
Hi Guys,
I read through the email threads with the weigh ins for the inclusion
of H2O as well as spark and wanted to circle back on the plan for folks to
meet around 1.0, so a few questions:

1) How does the inclusion of H2O and spark weigh in importance versus
the current JIRA items that are existing for potentially new feature work
to be done in mahout (in my case JIRA 1248/1249)
2) From reading all the responses it doesn't seem like there's full
consensus on what the next steps are for h2o and how that relates to the
roadmap around 1.0, please correct me if I'm misunderstanding, can someone
outline whether any concrete decisions have been made on whether or not
mahout 1.0 will include h2o bindings
3) Are we moving forward with the google hangout , I didnt receive

anything about this yet




Thanks in advance.




Re: contributing to mahout

2014-03-06 Thread peng

Hi Hardik,

I'm forwarding previous thread about 1.0 release plan to you. As a 
spark user you will see many things to be done.


Yours Peng

On Thu 06 Mar 2014 12:57:20 PM EST, Sebastian Schelter wrote:

Hi Hardik,

at the moment, we are heavily working on polishing our documentation.
A very welcome contribution would be a nice to read writeup how to use
an algorithm mahout to solve an exemplary problem.

E.g. taking a movie ratings dataset and showing how to compute
recommendations on it.

Best,
Sebastian

On 03/06/2014 06:54 PM, Hardik Pandya wrote:

Hi all,

I am new to mahout and wanted to contribute into mahout dev
community, any
initial pointers for new comers like me appreciated

Thanks,
Hardik Pandya


On Thu, Mar 6, 2014 at 12:44 PM, Hardik Pandya
smarty.ju...@gmail.comwrote:


Hi all,

I am new to mahout and wanted to contribute into mahout dev
community, any
initial pointers for new comers like me appreciated

Thanks,
Hardik Pandya







Re: [jira] [Updated] (MAHOUT-1178) GSOC 2013: Improve Lucene support in Mahout

2014-03-02 Thread peng

Wow, waiting this for a long time, finally fixed.

On Sun 02 Mar 2014 05:01:26 PM EST, Suneel Marthi (JIRA) wrote:


  [ 
https://issues.apache.org/jira/browse/MAHOUT-1178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MAHOUT-1178:
--

 Fix Version/s: (was: Backlog)
1.0


GSOC 2013: Improve Lucene support in Mahout
---

 Key: MAHOUT-1178
 URL: https://issues.apache.org/jira/browse/MAHOUT-1178
 Project: Mahout
  Issue Type: New Feature
Reporter: Dan Filimon
Assignee: Gokhan Capan
  Labels: gsoc2013, mentor
 Fix For: 1.0

 Attachments: MAHOUT-1178-TEST.patch, MAHOUT-1178.patch


[via Ted Dunning]
It should be possible to view a Lucene index as a matrix.  This would
require that we standardize on a way to convert documents to rows.  There
are many choices, the discussion of which should be deferred to the actual
work on the project, but there are a few obvious constraints:
a) it should be possible to get the same result as dumping the term vectors
for each document each to a line and converting that result using standard
Mahout methods.
b) numeric fields ought to work somehow.
c) if there are multiple text fields that ought to work sensibly as well.
  Two options include dumping multiple matrices or to convert the fields
into a single row of a single matrix.
d) it should be possible to refer back from a row of the matrix to find the
correct document.  THis might be because we remember the Lucene doc number
or because a field is named as holding a unique id.
e) named vectors and matrices should be used if plausible.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Mahout 1.0 goals

2014-03-02 Thread peng

Hi Dr Dunning,

I'm reluctant to admit that my feeling is similar to many of Sean's 
customers. as a user of mahout and lucene-solr, I see a lot of 
similarities in their cases:

lucene | mahout
indexing takes text as sparse vectors and build inverted index | 
training takes data as sparse vectors and build model

inverted index exist in memory/HDFS | model exist in memory/HDFS
use by input text and return match with scores | use by input test data 
and return scores/labels
do model selection by comparing ordinal number of scores with ground 
truth | do model selection by comparing scores/labels with ground truth


Then lucene/solr/elasticsearch evolved to become most successful 
flagship products (as buggy and incomplete as it is, it still gain wide 
usage which mahout never achieved). Yet mahout still looks like being 
assembled by glue and duct tape. The major difficulties I encountered 
are:


1. Components are not interchangable: e.g. the data and model 
presentation for single-node CF is vastly different from MR CF. New 
feature sometimes add backward-incompatible presentation. This 
drastically demoralized user seeking to integrate with it and expecting 
improvement.
2. Components have strong dependency on others: e.g. Cross-validation 
of CF can only use in-memory DataModel, which SlopeOneRecommender 
cannot update properly (its removed but you got my point). Such design 
never draw enough attention apart from an 'won't fix' solution.
3. Many models can only be used internally, cannot be exported or 
reused in other applications. This is true in solr as well but its 
restful api is very universal and many etl tools has been built for it. 
In contrast mahout has a very hard learning curve for non-java 
developers.


its not bad t see mahout as a service on top of a library, if it 
doesn't take too much effort.


Yours Peng

On Sun 02 Mar 2014 11:45:33 AM EST, Ted Dunning wrote:

Ravi,

Good points.

On Sun, Mar 2, 2014 at 12:38 AM, Ravi Mummulla ravi.mummu...@gmail.comwrote:


- Natively support Windows (guidance, etc. No documentation exists today,
for instance)



There is a bit of demand for that.

- Faster time to first application (from discovery to first application

currently takes a non-trivial amount of effort; how can we lower the bar
and reduce the friction for adoption?)



There is huge evidence that this is important.



  - Better documenting use cases with working samples/examples
(Documentation
on https://mahout.apache.org/users/basics/algorithms.html is spread out
and
there is too much focus on algorithms as opposed to use cases - this is an
adoption blocker)



This is also important.



- Uniformity of the API set across all algorithms (are we providing the
same experience across all APIs?)



And many people have been tripped up by this.



  - Measuring/publishing scalability metrics of various algorithms (why
would
we want users to adopt Mahout vs. other frameworks for ML at scale?)



I don't see this as important as some of your other points, but is still
useful.



Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-27 Thread peng
That should be easy. But that defeats the purpose of using mahout as 
there are already enough implementations of single node backpropagation 
(in which case GPU is much faster).


Yexi:

Regarding downpour SGD and sandblaster, may I suggest that the 
implementation better has no parameter server? It's obviously a single 
point of failure and in terms of bandwidth, a bottleneck. I heard that 
MLlib on top of Spark has a functional implementation (never read or 
test it), and its possible to build the workflow on top of YARN. Non of 
those framework has an heterogeneous topology.


Yours Peng

On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote:


 [ 
https://issues.apache.org/jira/browse/MAHOUT-1426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13913488#comment-13913488
 ]

Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM:
---

I've read the papers. I didn't think about distributed network. I had in mind 
network that will fit into memory, but will require significant amount of 
computations.

I understand that there are better options for neural networks than map reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense. (Doing a 
non-map-reduce neural network in Mahout would be of substantial
interest.)
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining (RBM or/and 
Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i implemented it i 
didn't know that the learning algorithm is not suited for MR, so I think there is no 
point including the patch.


was (Author: maciejmazur):
I've read the papers. I didn't think about distributed network. I had in mind 
network that will fit into memory, but will require significant amount of 
computations.

I understand that there are better options for neural networks than map reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense.
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining (RBM or/and 
Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i implemented it i 
didn't know that the learning algorithm is not suited for MR, so I think there is no 
point including the patch.


GSOC 2013 Neural network algorithms
---

 Key: MAHOUT-1426
 URL: https://issues.apache.org/jira/browse/MAHOUT-1426
 Project: Mahout
  Issue Type: Improvement
  Components: Classification
Reporter: Maciej Mazur

I would like to ask about possibilites of implementing neural network 
algorithms in mahout during GSOC.
There is a classifier.mlp package with neural network.
I can't see neighter RBM  nor Autoencoder in these classes.
There is only one word about Autoencoders in NeuralNetwork class.
As far as I know Mahout doesn't support convolutional networks.
Is it a good idea to implement one of these algorithms?
Is it a reasonable amount of work?
How hard is it to get GSOC in Mahout?
Did anyone succeed last year?




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-27 Thread peng
With pleasure! the original downpour paper propose a parameter server 
from which subnodes download shards of old model and upload gradients. 
So if the parameter server is down, the process has to be delayed, it 
also requires that all model parameters to be stored and atomically 
updated on (and fetched from) a single machine, imposing asymmetric HDD 
and bandwidth requirement. This design is necessary only because each 
-=delta operation has to be atomic. Which cannot be ensured across 
network (e.g. on HDFS).


But it doesn't mean that the operation cannot be decentralized: 
parameters can be sharded across multiple nodes and multiple 
accumulator instances can handle parts of the vector subtraction. This 
should be easy if you create a buffer for the stream of gradient, and 
allocate proper numbers of producers and consumers on each machine to 
make sure it doesn't overflow. Obviously this is far from MR framework, 
but at least it can be made homogeneous and slightly faster (because 
sparse data can be distributed in a way to minimize their overlapping, 
so gradients doesn't have to go across the network that frequent).


If we instead using a centralized architect. Then there must be =1 
backup parameter server for mission critical training.


Yours Peng

e.g. we can simply use a producer/consumer pattern

If we use a producer/consumer pattern for all gradients,

On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote:

Peng,

Can you provide more details about your thought?

Regards,


2014-02-27 16:00 GMT-05:00 peng pc...@uowmail.edu.au:


That should be easy. But that defeats the purpose of using mahout as there
are already enough implementations of single node backpropagation (in which
case GPU is much faster).

Yexi:

Regarding downpour SGD and sandblaster, may I suggest that the
implementation better has no parameter server? It's obviously a single
point of failure and in terms of bandwidth, a bottleneck. I heard that
MLlib on top of Spark has a functional implementation (never read or test
it), and its possible to build the workflow on top of YARN. Non of those
framework has an heterogeneous topology.

Yours Peng


On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote:



  [ https://issues.apache.org/jira/browse/MAHOUT-1426?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanelfocusedCommentId=13913488#comment-13913488 ]

Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM:
---

I've read the papers. I didn't think about distributed network. I had in
mind network that will fit into memory, but will require significant amount
of computations.

I understand that there are better options for neural networks than map
reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense. (Doing a
non-map-reduce neural network in Mahout would be of substantial
interest.)
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining (RBM
or/and Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i
implemented it i didn't know that the learning algorithm is not suited for
MR, so I think there is no point including the patch.


was (Author: maciejmazur):
I've read the papers. I didn't think about distributed network. I had in
mind network that will fit into memory, but will require significant amount
of computations.

I understand that there are better options for neural networks than map
reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense.
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining (RBM
or/and Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i
implemented it i didn't know that the learning algorithm is not suited for
MR, so I think there is no point including the patch.

  GSOC 2013 Neural network algorithms

---

  Key: MAHOUT-1426
  URL: https://issues.apache.org/jira/browse/MAHOUT-1426
  Project: Mahout
   Issue Type: Improvement
   Components: Classification
 Reporter: Maciej Mazur

I would like to ask about possibilites of implementing neural network
algorithms in mahout during GSOC.
There is a classifier.mlp package with neural network.
I can't see neighter RBM  nor Autoencoder in these classes.
There is only one word about Autoencoders in NeuralNetwork class.
As far as I know Mahout doesn't support convolutional networks.
Is it a good idea to implement one of these algorithms?
Is it a reasonable amount of work?
How hard

Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-27 Thread peng

Hi Yexi,

I was reading your code and found the MLP class is abstract-ish (both 
train functions throws exception). Is there a thread or ticket for 
shippable implementation?


Yours Peng

On Thu 27 Feb 2014 06:56:51 PM EST, peng wrote:

With pleasure! the original downpour paper propose a parameter server
from which subnodes download shards of old model and upload gradients.
So if the parameter server is down, the process has to be delayed, it
also requires that all model parameters to be stored and atomically
updated on (and fetched from) a single machine, imposing asymmetric
HDD and bandwidth requirement. This design is necessary only because
each -=delta operation has to be atomic. Which cannot be ensured
across network (e.g. on HDFS).

But it doesn't mean that the operation cannot be decentralized:
parameters can be sharded across multiple nodes and multiple
accumulator instances can handle parts of the vector subtraction. This
should be easy if you create a buffer for the stream of gradient, and
allocate proper numbers of producers and consumers on each machine to
make sure it doesn't overflow. Obviously this is far from MR
framework, but at least it can be made homogeneous and slightly faster
(because sparse data can be distributed in a way to minimize their
overlapping, so gradients doesn't have to go across the network that
frequent).

If we instead using a centralized architect. Then there must be =1
backup parameter server for mission critical training.

Yours Peng

e.g. we can simply use a producer/consumer pattern

If we use a producer/consumer pattern for all gradients,

On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote:

Peng,

Can you provide more details about your thought?

Regards,


2014-02-27 16:00 GMT-05:00 peng pc...@uowmail.edu.au:


That should be easy. But that defeats the purpose of using mahout as
there
are already enough implementations of single node backpropagation
(in which
case GPU is much faster).

Yexi:

Regarding downpour SGD and sandblaster, may I suggest that the
implementation better has no parameter server? It's obviously a single
point of failure and in terms of bandwidth, a bottleneck. I heard that
MLlib on top of Spark has a functional implementation (never read or
test
it), and its possible to build the workflow on top of YARN. Non of
those
framework has an heterogeneous topology.

Yours Peng


On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote:



  [ https://issues.apache.org/jira/browse/MAHOUT-1426?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanelfocusedCommentId=13913488#comment-13913488 ]

Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM:
---

I've read the papers. I didn't think about distributed network. I
had in
mind network that will fit into memory, but will require
significant amount
of computations.

I understand that there are better options for neural networks than
map
reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense. (Doing a
non-map-reduce neural network in Mahout would be of substantial
interest.)
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining
(RBM
or/and Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i
implemented it i didn't know that the learning algorithm is not
suited for
MR, so I think there is no point including the patch.


was (Author: maciejmazur):
I've read the papers. I didn't think about distributed network. I
had in
mind network that will fit into memory, but will require
significant amount
of computations.

I understand that there are better options for neural networks than
map
reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense.
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining
(RBM
or/and Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i
implemented it i didn't know that the learning algorithm is not
suited for
MR, so I think there is no point including the patch.

  GSOC 2013 Neural network algorithms

---

  Key: MAHOUT-1426
  URL:
https://issues.apache.org/jira/browse/MAHOUT-1426
  Project: Mahout
   Issue Type: Improvement
   Components: Classification
 Reporter: Maciej Mazur

I would like to ask about possibilites of implementing neural network
algorithms in mahout during GSOC.
There is a classifier.mlp package with neural network.
I can't see neighter RBM  nor Autoencoder in these classes

Re: [jira] [Comment Edited] (MAHOUT-1426) GSOC 2013 Neural network algorithms

2014-02-27 Thread peng

Oh, thanks a lot, I missed that one :)
+1 on easiest one implemented first. I haven't think about difficulty 
issue, need  to read more about YARN extension.


Yours Peng

On Thu 27 Feb 2014 08:06:27 PM EST, Yexi Jiang wrote:

Hi, Peng,

Do you mean the MultilayerPerceptron? There are three 'train' method, and
only one (the one without the parameters trackingKey and groupKey) is
implemented. In current implementation, they are not used.

Regards,
Yexi


2014-02-27 19:31 GMT-05:00 Ted Dunning ted.dunn...@gmail.com:


Generally for training models like this, there is an assumption that fault
tolerance is not particularly necessary because the low risk of failure
trades against algorithmic speed.  For reasonably small chance of failure,
simply re-running the training is just fine.  If there is high risk of
failure, simply checkpointing the parameter server is sufficient to allow
restarts without redundancy.

Sharding the parameter is quite possible and is reasonable when the
parameter vector exceed 10's or 100's of millions of parameters, but isn't
likely much necessary below that.

The asymmetry is similarly not a big deal.  The traffic to and from the
parameter server isn't enormous.


Building something simple and working first is a good thing.


On Thu, Feb 27, 2014 at 3:56 PM, peng pc...@uowmail.edu.au wrote:


With pleasure! the original downpour paper propose a parameter server

from

which subnodes download shards of old model and upload gradients. So if

the

parameter server is down, the process has to be delayed, it also requires
that all model parameters to be stored and atomically updated on (and
fetched from) a single machine, imposing asymmetric HDD and bandwidth
requirement. This design is necessary only because each -=delta operation
has to be atomic. Which cannot be ensured across network (e.g. on HDFS).

But it doesn't mean that the operation cannot be decentralized:

parameters

can be sharded across multiple nodes and multiple accumulator instances

can

handle parts of the vector subtraction. This should be easy if you

create a

buffer for the stream of gradient, and allocate proper numbers of

producers

and consumers on each machine to make sure it doesn't overflow. Obviously
this is far from MR framework, but at least it can be made homogeneous

and

slightly faster (because sparse data can be distributed in a way to
minimize their overlapping, so gradients doesn't have to go across the
network that frequent).

If we instead using a centralized architect. Then there must be =1

backup

parameter server for mission critical training.

Yours Peng

e.g. we can simply use a producer/consumer pattern

If we use a producer/consumer pattern for all gradients,

On Thu 27 Feb 2014 05:09:52 PM EST, Yexi Jiang wrote:


Peng,

Can you provide more details about your thought?

Regards,


2014-02-27 16:00 GMT-05:00 peng pc...@uowmail.edu.au:

  That should be easy. But that defeats the purpose of using mahout as

there
are already enough implementations of single node backpropagation (in
which
case GPU is much faster).

Yexi:

Regarding downpour SGD and sandblaster, may I suggest that the
implementation better has no parameter server? It's obviously a single
point of failure and in terms of bandwidth, a bottleneck. I heard that
MLlib on top of Spark has a functional implementation (never read or

test

it), and its possible to build the workflow on top of YARN. Non of

those

framework has an heterogeneous topology.

Yours Peng


On Thu 27 Feb 2014 09:43:19 AM EST, Maciej Mazur (JIRA) wrote:



   [ https://issues.apache.org/jira/browse/MAHOUT-1426?page=
com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanelfocusedCommentId=13913488#comment-13913488 ]

Maciej Mazur edited comment on MAHOUT-1426 at 2/27/14 2:41 PM:
---

I've read the papers. I didn't think about distributed network. I had

in

mind network that will fit into memory, but will require significant
amount
of computations.

I understand that there are better options for neural networks than

map

reduce.
How about non-map-reduce version?
I see that you think it is something that would make a sense. (Doing a
non-map-reduce neural network in Mahout would be of substantial
interest.)
Do you think it will be a valueable contribution?
Is there a need for this type of algorithm?
I think about multi-threded batch gradient descent with pretraining

(RBM

or/and Autoencoders).

I have looked into these old JIRAs. RBM patch was withdrawn.
I would rather like to withdraw that patch, because by the time i
implemented it i didn't know that the learning algorithm is not suited
for
MR, so I think there is no point including the patch.


was (Author: maciejmazur):
I've read the papers. I didn't think about distributed network. I had

in

mind network that will fit into memory, but will require significant
amount
of computations.

I understand that there are better options

Re: Mahout on Spark?

2014-02-19 Thread peng
I was suggested to switch to MLlib for its performance, but I doubt if 
that is production ready, even if it is I would still favour hadoop's 
sturdiness and self-healing.
But maybe mahout can include contribs that M/R is not fit for, like 
downpour SGD or graph-based algorithms?


On Wed 19 Feb 2014 07:52:22 AM EST, Sean Owen wrote:

To set expectations appropriately, I think it's important to point out
this is completely infeasible short of a total rewrite, and I can't
imagine that will happen. It may not be obvious if you haven't looked
at the code how completely dependent on M/R it is.

You can swap out M/R and Spark if you write in terms of something like
Crunch, but that is not at all the case here.

On Wed, Feb 19, 2014 at 12:43 PM, Jay Vyas jayunit...@gmail.com wrote:

+100 for this, different execution engines, like the direction  pig and crunch 
take

Sent from my iPhone


On Feb 19, 2014, at 5:19 AM, Gokhan Capan gkhn...@gmail.com wrote:

I imagine in Mahout offering an option to the users to select from
different execution engines (just like we currently do by giving M/R or
sequential options), and starting from Spark. I am not sure what changes
needed in the codebase, though. Maybe following MLI (or alike) and
implementing some more stuff, such as common interfaces for iterating over
data (the M/R way and the Spark way).

IMO, another effort might be porting pre-online machine learning (such
transforming text into vector based on the dictionary generated by
seq2sparse before), machine learning based on mini-batches, and streaming
summarization stuff in Mahout to Spark-Streaming.

Best,
Gokhan

On Wed, Feb 19, 2014 at 10:45 AM, Dmitriy Lyubimov dlie...@gmail.comwrote:


PS I am moving along cost optimizer for spark-backed DRMs on some
multiplicative pipelines that is capable of figuring different cost-based
rewrites and R-Like DSL that mixes in-core and distributed matrix
representations and blocks but it is painfully slow, i really only doing it
like couple nights in a month. It does not look like i will be doing it on
company time any time soon (and even if i did, the company doesn't seem to
be inclined to contribute anything I do anything new on their time). It is
all painfully slow, there's no direct funding for it anywhere with no
string attached. That probably will be primary reason why Mahout would not
be able to get much traction compared to university-based contributions.


On Wed, Feb 19, 2014 at 12:27 AM, Dmitriy Lyubimov dlie...@gmail.com

wrote:



Unfortunately methinks the prospects of something like Mahout/MLLib merge
seem very unlikely due to vastly diverged approach to the basics of

linear

algebra (and other things). Just like one cannot grow single tree out of
two trunks -- not easily, anyway.

It is fairly easy to port (and subsequently beat) MLib at this point from
collection of algorithms point of view. But IMO goal should be more
MLI-like first, and port second. And be very careful with concepts.
Something that i so far don't see happening with MLib. MLib seems to be
old-style Mahout-like rush to become a collection of basic algorithms
rather than coherent foundation. Admittedly, i havent looked very

closely.



On Tue, Feb 18, 2014 at 11:41 PM, Sebastian Schelter s...@apache.org
wrote:


I'm also convinced that Spark is a superior platform for executing
distributed ML algorithms. We've had a discussion about a change from
Hadoop to another platform some time ago, but at that point in time it

was

not clear which of the upcoming dataflow processing systems (Spark,
Hyracks, Stratosphere) would establish itself amongst the users. To me

it

seems pretty obvious that Spark made the race.

I concur with Ted, it would be great to have the communities work
together. I know that at least 4 mahout committers (including me) are
already following Spark's mailinglist and actively participating in the
discussions.

What are the ideas how a fruitful cooperation look like?

Best,
Sebastian

PS:

I ported LLR-based cooccurrence analysis (aka item-based recommendation)
to Spark some time ago, but I haven't had time to test my code on a

large

dataset yet. I'd be happy to see someone help with that.







On 02/19/2014 08:04 AM, Nick Pentreath wrote:

I know the Spark/Mllib devs can occasionally be quite set in ways of
doing certain things, but we'd welcome as many Mahout devs as possible

to

work together.


It may be too late, but perhaps a GSoC project to look at a port of

some

stuff like co occurrence recommender and streaming k-means?




N
--
Sent from Mailbox for iPhone

On Wed, Feb 19, 2014 at 3:02 AM, Ted Dunning ted.dunn...@gmail.com
wrote:

On Tue, Feb 18, 2014 at 1:58 PM, Nick Pentreath 

nick.pentre...@gmail.comwrote:


My (admittedly heavily biased) view is Spark is a superior platform
overall
for ML. If the two communities can work together to leverage the
strengths
of Spark, and the large amount of good stuff in Mahout (as well as

the

fantastic depth of 

Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-18 Thread Peng Cheng
Really? I guess PageRank in mahout was removed due to inherited network 
bottleneck of mapreduce. But I didn't know MLlib has the implementation. 
Is mllib implementation based on Lanczos or SSVD? Just curious...


On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote:

I bet page rank in mllib in spark finds stationary distribution much faster.
On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:


Agreed, and this is the case where Lanczos algorithm is obsolete.
My point is: if SSVD is unable to find the eigenvector of asymmetric
matrix (this is a common formulation of PageRank, and some random walks,
and many other things), then we still have to rely on large-scale Lanczos
algorithm.

On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:


For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

  If SSVD is not designed for such eigenvector problem. Then I would vote

for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate
than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

  In PageRank I'm afraid I have no other option than eigenvector

\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

  SSVD is very probably better than Lanczos for any large decomposition.

That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

   Just asking for possible replacement of our Lanczos-based PageRank


implementation. - Peng







Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-18 Thread peng
Thanks a lot Sebastian, Ted and Dmitriy, I'll try Giraph for a 
performance benchmark.
You are right, power iteration is just the most simple form of Lanczos, 
it shouldn't be in the scope.


On Tue 18 Feb 2014 03:59:57 AM EST, Sebastian Schelter wrote:

You can also use giraph for a superfast PageRank implementation. Giraph
even runs on standard hadoop clusters.

Pagerank is usually computed by power iteration, which is much simpler than
lanczos or ssvd and only gives the eigenvector associated with the largest
eigenvalue.
Am 18.02.2014 09:33 schrieb Peng Cheng pc...@uowmail.edu.au:


Really? I guess PageRank in mahout was removed due to inherited network
bottleneck of mapreduce. But I didn't know MLlib has the implementation. Is
mllib implementation based on Lanczos or SSVD? Just curious...

On 17/02/2014 11:11 PM, Dmitriy Lyubimov wrote:


I bet page rank in mllib in spark finds stationary distribution much
faster.
On Feb 17, 2014 1:33 PM, peng pc...@uowmail.edu.au wrote:

  Agreed, and this is the case where Lanczos algorithm is obsolete.

My point is: if SSVD is unable to find the eigenvector of asymmetric
matrix (this is a common formulation of PageRank, and some random walks,
and many other things), then we still have to rely on large-scale Lanczos
algorithm.

On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

  For the symmetric case, SVD is eigen decomposition.





On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:

   If SSVD is not designed for such eigenvector problem. Then I would
vote


for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate
than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

   In PageRank I'm afraid I have no other option than eigenvector


\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

   SSVD is very probably better than Lanczos for any large
decomposition.


 That said, it does SVD, not eigen decomposition which means that
the
question of symmetrical matrices or positive definiteness doesn't
much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

Just asking for possible replacement of our Lanczos-based PageRank

  implementation. - Peng











Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread peng
If SSVD is not designed for such eigenvector problem. Then I would vote 
for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both 
algorithms on symmetric case and SSVD is much faster and more accurate 
than its competitor.


Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:

In PageRank I'm afraid I have no other option than eigenvector
\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

SSVD is very probably better than Lanczos for any large decomposition.
  That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:


Just asking for possible replacement of our Lanczos-based PageRank
implementation. - Peng





Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-17 Thread peng

Agreed, and this is the case where Lanczos algorithm is obsolete.
My point is: if SSVD is unable to find the eigenvector of asymmetric 
matrix (this is a common formulation of PageRank, and some random 
walks, and many other things), then we still have to rely on 
large-scale Lanczos algorithm.


On Mon 17 Feb 2014 04:25:16 PM EST, Ted Dunning wrote:

For the symmetric case, SVD is eigen decomposition.




On Mon, Feb 17, 2014 at 1:12 PM, peng pc...@uowmail.edu.au wrote:


If SSVD is not designed for such eigenvector problem. Then I would vote
for retaining the Lanczos algorithm.
However, I would like to see the opposite case, I have tested both
algorithms on symmetric case and SSVD is much faster and more accurate than
its competitor.

Yours Peng

On Wed 12 Feb 2014 03:25:47 PM EST, peng wrote:


In PageRank I'm afraid I have no other option than eigenvector
\lambda, but not singular vector u  v:) The PageRank in Mahout was
removed with other graph-based algorithm.

On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:


SSVD is very probably better than Lanczos for any large decomposition.
   That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:

  Just asking for possible replacement of our Lanczos-based PageRank

implementation. - Peng








Re: Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-12 Thread peng
In PageRank I'm afraid I have no other option than eigenvector \lambda, 
but not singular vector u  v:) The PageRank in Mahout was removed with 
other graph-based algorithm.


On Tue 11 Feb 2014 06:34:17 PM EST, Ted Dunning wrote:

SSVD is very probably better than Lanczos for any large decomposition.
  That said, it does SVD, not eigen decomposition which means that the
question of symmetrical matrices or positive definiteness doesn't much
matter.

Do you really need eigen-decomposition?



On Tue, Feb 11, 2014 at 2:55 PM, peng pc...@uowmail.edu.au wrote:


Just asking for possible replacement of our Lanczos-based PageRank
implementation. - Peng





Does SSVD supports eigendecomposition of non-symmetric non-positive-semidefinitive matrix better than Lanczos?

2014-02-11 Thread peng
Just asking for possible replacement of our Lanczos-based PageRank 
implementation. - Peng


Learning to rank support in Mahout and Solr integration?

2014-02-09 Thread peng

This is what I believe to be a typical learning to rank model:

1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr 
these are queries/function queries).
2. Test those scorers on a ground truth dataset. Generating feature 
vectors for top-n results annotated by human.
3. Use an existing classifier/regressor (e.g. support vector ranking, 
GBDT, random forest etc.) on those feature vectors to get a ranking model.
4. Export this ranking model back to Solr as a custom ensemble query (a 
BooleanQuery with custom boosting factor for linear model, or a 
CustomScoreQuery with custom scoring function for non-linear model), 
push it to Solr server, register with QParser. Push it to production. 
End of.


But I didn't find this workflow quite easy to implement in mahout-solr 
integration (is it discouraged for some reason?). Namely, there is no 
pipeline from results of scorers to a Mahout-compatible vector form, and 
there is no pipeline from ranking model back to ensemble query. (I only 
found the lucene2seq class, and the upcoming recommendation support, 
which don't quite fit into the scenario). So what's the best practice 
for easily implementing a realtime, learning to rank search engine in 
this case? I've worked in a bunch of startups and such appliance seems 
to be in high demand. (Remember that solr-based collaborative filtering 
model proposed by Dr Dunning? This is the content-based counterpart of it)


I'm looking forward to streamline this process to make my upcoming work 
easier. I think Mahout/Solr is the undisputed instrument of choice due 
to their scalability and machine learning background of many of their 
top committers. Can we talk about it at some point?


Yours Peng


Re: Learning to rank support in Mahout and Solr integration?

2014-02-09 Thread peng

Hi Dr Dunning,

Thanks a lot! I was trying to make the model generalizable enough, but 
I'm also afraid I may 'abuse' it a bit, Here is my existing solution:


1. wrap any scorer by a ValueSource (many out-of-the-box exists in 
lucene-solr, extensions are possible but they don't have to be 
registered with ValueSourceParser-they won't be used independently)
2. extend CustomScoreQuery to have a flat and straightforward 
explanation form. Use this as a wrapper of filters (As SubQ) and 
scorers (As FunctionQ)
3. write a converter to print flat explanation to Mahout-compatible 
vectors.
4. run a job to 'explain()' those ground truths on an index and dump 
the result vectors.

5. (optional) run other jobs to get not-content-based score vectors.
6. join them, feed into a classifier-regressor, do some model 
selections.
7. (from this point I haven't done anything) try to 'migrate' this 
model into another CustomScoreQuery, which has a strong scorer that 
ensemble features in the same way the model suggested.

8. push into Solr Cloud Server. Register with Qparser.

What I found to be hard:

1. explanation is kind of abusive, its only designed for manual 
tweaking. I constantly run into problems where 'explain()' 
implementation was look down upon by developers and code stubs are used 
to fill. Notably, ToParentBlockJoin won't show nested scores, and 
ToChildBlockJoin simply doesn't work.
2. There is no automatic way to 'migrate' model to ensemble query. 
Though I haven't proceed that far I'm already afraid of the difficulty.
3. As a NoSQL database optimized to the core in text processing, Solr 
extensions are totally not intuitive and hard to debug and maintain. We 
try to keep this part minimal but still get stagnated at some point.


Environment is build on CDH 5.0beta2 with YARN and Cloudera search 
(Solr 4.4), some bugs then force me to uninstall it and install Solr 
Cloud 4.6. I wonder if there are more 'out-of-the-box' solutions?


Yours Peng

On Sun 09 Feb 2014 05:53:20 PM EST, Ted Dunning wrote:

I think that this is a bit of an idiosyncratic model for learning to rank,
but it is a reasonably viable one.

It would be good to have a discussion of what you find hard or easy and
what you think is needed to make this work.

Let's talk.



On Sun, Feb 9, 2014 at 2:26 PM, peng pc...@uowmail.edu.au wrote:


This is what I believe to be a typical learning to rank model:

1. Create many weak rankers/scorers (a.k.a feature engineering, in Solr
these are queries/function queries).
2. Test those scorers on a ground truth dataset. Generating feature
vectors for top-n results annotated by human.
3. Use an existing classifier/regressor (e.g. support vector ranking,
GBDT, random forest etc.) on those feature vectors to get a ranking model.
4. Export this ranking model back to Solr as a custom ensemble query (a
BooleanQuery with custom boosting factor for linear model, or a
CustomScoreQuery with custom scoring function for non-linear model), push
it to Solr server, register with QParser. Push it to production. End of.

But I didn't find this workflow quite easy to implement in mahout-solr
integration (is it discouraged for some reason?). Namely, there is no
pipeline from results of scorers to a Mahout-compatible vector form, and
there is no pipeline from ranking model back to ensemble query. (I only
found the lucene2seq class, and the upcoming recommendation support, which
don't quite fit into the scenario). So what's the best practice for easily
implementing a realtime, learning to rank search engine in this case? I've
worked in a bunch of startups and such appliance seems to be in high
demand. (Remember that solr-based collaborative filtering model proposed by
Dr Dunning? This is the content-based counterpart of it)

I'm looking forward to streamline this process to make my upcoming work
easier. I think Mahout/Solr is the undisputed instrument of choice due to
their scalability and machine learning background of many of their top
committers. Can we talk about it at some point?

Yours Peng





Re: Mahout 0.9 Release

2014-01-29 Thread peng

+1, can't see a bad side.

On Wed 29 Jan 2014 11:33:02 AM EST, Suneel Marthi wrote:

+1 from me





On Wednesday, January 29, 2014 8:58 AM, Sebastian Schelter s...@apache.org 
wrote:

+1


On 01/29/2014 05:25 AM, Andrew Musselman wrote:

Looks good.

+1


On Tue, Jan 28, 2014 at 8:07 PM, Andrew Palumbo ap@outlook.com wrote:


a), b), c), d) all passed here.

CosineDistance of clustered points from cluster-reuters.sh -1 kmeans were
within the range [0,1].


Date: Tue, 28 Jan 2014 16:45:42 -0800
From: suneel_mar...@yahoo.com
Subject: Mahout 0.9 Release
To: u...@mahout.apache.org; dev@mahout.apache.org

Fixed the issues that were reported with Clustering code this past week,

upgraded codebase to Lucene 4.6.1 that was released today.


Here's the URL for the 0.9 release in staging:-


https://repository.apache.org/content/repositories/orgapachemahout-1004/org/apache/mahout/mahout-distribution/0.9/


The artifacts have been signed with the following key:
https://people.apache.org/keys/committer/smarthi.asc

Please:-
a) Verify that u can unpack the release (tar or zip)
b) Verify u r able to compile the distro
c)  Run through the unit tests: mvn clean test
d) Run the example scripts under $MAHOUT_HOME/examples/bin. Please run

through all the different options in each script.


Need a minimum of 3 '+1' votes from PMC for the release to be finalized.







Sebastian: On the subject of efficient In-memory DataModel for recommendation engine.

2013-10-20 Thread Peng Cheng

Hi Sebastian,

Sorry I dropped out from the Hangout for a few minutes, when I get back 
its already over :


Well, lets continue the conversation on the DataModel improvement:

I was looking into your KDDCupFactorizablePreferences and found out that 
it doesn't load any data into the memory, the only data structure in 
that class is the dataFile used to generate a stream of preference from 
hard disk. I think this is why you can load it into 1G memory without a 
heapspace overflow.


However, I think it is only good for memory saving at the expenses of 
lots of things (e.g. random access, random insert, delete and update, 
concurrency). Thus justify the necessity to load things into memory, 
theoretically, a preference array of netflix size will cost at least:


[8bytes (userID : long) + 8bytes (itemID : long) + 4bytes (value : 
float)]* 100,480,507 = 2009610140bytes = 1.87159528956GB = 1916.51357651MB


...plus overhead. But I would rather it to be a bit bigger to trade for 
O(1) random access/update, but not too big, like the current row/column 
sparse matrix-ish implementation that duplicates everything.


That's my concern, I got several ideas on optimizing my in-memory 
dataModel, but never had time to do them : Please give me a few more 
weeks, when the code is optimized to the teeth and support concurrent 
access, I'll submit it again for revision. Gokhan has also made a lot of 
work on this part, so its good to have many options.


Yours Peg




Why Kahan summation was not used anywhere?

2013-09-18 Thread Peng Cheng
For a large scale computational engine this seems unwashed. Most 
summation/average and dot product of vectors still use naive summation 
despite of its O(n) error.


Is there a reason?

All the best,
Yours Peng



[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-09-04 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13758103#comment-13758103
 ] 

Peng Cheng commented on MAHOUT-1286:


The existing open addressing hash table is for 1d arrays, not 2d matrices. I 
can get the concurrency done by next week, but there are simply too many 
pending optimization. e.g. if you set loadfactor to 1.2 it is pretty slow. If 
you can help improving on the TODO list in the code that will be awesome.

Not sure about the consequence as 2d matrix interface has an int (16bit) index, 
but dataModel has a long (32bit) index. If you don't bother adding more things 
to mahout-math, then it should be alright.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
 Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-29 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13754269#comment-13754269
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Gokhan,

No problem, but it only has two files, I'll post the patch immediately. -Yours 
Peng

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-29 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Attachment: Semifinal-implementation-added.patch

Sorry about the late reply, and please be noted that the code can still be 
optimized at many places, I'll keep maintain it and keep an ear on all 
suggestions.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java, 
 Semifinal-implementation-added.patch

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: You are invited to Apache Mahout meet-up

2013-08-22 Thread Peng Cheng
Is the presentation going to be uploaded on Youtube or Slideshare? Sorry 
I cannot be there.


On 13-08-22 08:46 AM, Yexi Jiang wrote:

A great event. I wish I were in Bay area.


2013/8/22 Shannon Quinn squ...@gatech.edu


I'm only sorry I'm not in the Bay area. Sounds great!


On 8/22/13 3:38 AM, Stevo Slavić wrote:


Retweeted meetup invite. Have fun!

Kind regards,
Stevo Slavic.


On Thu, Aug 22, 2013 at 8:34 AM, Ted Dunning ted.dunn...@gmail.com
wrote:

  Very cool.

Would love to see folks turn out for this.


On Wed, Aug 21, 2013 at 9:38 PM, Ellen Friedman
b.ellen.fried...@gmail.com**wrote:

  The Apache Mahout user group has been re-activated. If you are in the

Bay
Area in California, join us on Aug 27 (Redwood City).

Sebastian Schelter will be the main speaker, talking about new
directions
with Mahout recommendation. Grant Ingersoll, Ted Dunning and I be there


to


do a short introduction for the meet-up and update on the 0.8 release.

Here's the link to rsvp: http://bit.ly/16K32hg

Hope you can come, and please spread the word.

Ellen









[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-16 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13742714#comment-13742714
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Dr Dunning,

Great appreciation, I watched your speech in Berlin on youtube and finally have 
a clue on what is going on here.

If i understand right, the core concept is to use Solr as a sparse matrix 
multiplier. So theoretically it can encapsulate any recommendation engine (not 
necessarily CF) if the recommendation phase can be cast as linear 
multiplication. Co-occurence matrix is one instance, other types of 
recommendations are possible, but slightly harder, require multiple queries 
sometimes. The following 3 cases should cover most classical CF instances:

1. Item-based CF (result = Sim(A,A)* h, where A is the rating matrix and Sim() 
is the item-to-item similarity matrix, between all pairs of items ): this is 
the easiest and has already been addressed in your speech: calculate Sim(A,A) 
beforehand, import into solr and run query ranked by weighted frequency.

2. User-based CF (result = A^T * Sim(A,h), where Sim() is the user-to-user 
similarity vector, between new user and all old users): slightly more complex, 
can run the first query on A ranked by the customized similarity function, then 
use the result of the first to run the second query on A^T ranked by weighted 
frequency.

3. SVD-based CF: no can do if the new user is not known before, AFAIK solr 
doesn't have any form of matrix pseudoinversion or optimization function. So 
determining new user's projection in the SV subspace is impossible given its 
dot with some old items. However, if the user in question is old, or new user 
can be merged into the model in real-time. Solr can just look-up its vector in 
SV subspace by a full match search.

4. ensemble: obviously another linear operation, can be interpreted by a query 
with mixed ranking function or multiple queries. Multi-model recommendation, as 
a juxtaposing of rating matrix (A_1 | A_2), was never a problem either using 
old style CF or recommendation-as-search.

Judging by the sheer performance and scalabilty of solr, this could potentially 
make recommendation-as-search a superior option. However as Gokhan inferred, we 
will likely still use old algorithms for training, but solr for recommendation. 
So I'm going back to 1274 anyway, by using the posted DataModel as a temporary 
glue. It won't be hard for me or anybody else to refactor it for the solr 
interface.

-Yours Peng

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736962#comment-13736962
 ] 

Peng Cheng commented on MAHOUT-1286:


The idea of ArrayMap has been discarded due to its impractical time consumption 
of insertion (O(n) for a batch insertion) and query (O(logn)). I have moved 
back to HashMap. Due to the same reason, I feel that using Sparse Row/Column 
matrix may have the same problem.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Attachment: InMemoryDataModelTest.java
InMemoryDataModel.java

See uploaded files for detail

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736992#comment-13736992
 ] 

Peng Cheng commented on MAHOUT-1286:


Here is my final solution after numerous experiments: A combination of double 
hashing for storing user/item IDs and 2d hopscotch hashing 
(http://mcg.cs.tau.ac.il/papers/disc2008-hopscotch.pdf) for storing preferences 
as a map from user/item indices in the double hashing table. Hopscotch hashing 
maintains strong locality and high load factor, and each dimension uses an 
independent hash function. As a result, it can quickly extract a submatrix or 
single row or column.

This is the smallest implementation I can think of, apparently only bloom map 
can achieve smaller memory footage. But it has many other problems.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Fix Version/s: 0.9
   Labels: collaborative-filtering datamodel patch recommender  (was: )
   Status: Patch Available  (was: Open)

According to my test, it can load the entire Netflix dataset into memory using 
only 3G heap space.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: patch, collaborative-filtering, datamodel, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737019#comment-13737019
 ] 

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:44 PM:
-

Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some 
difficulties, namely 1) a columnar form doesn't support fast extraction of 
rows, yet dataModel should allow quick getPreferencesFromUser() and 
getPreferencesForItem(). 2) a columnar form doesn't support fast online update 
(time complexity is O( n ), maximally O( log n ) if using block copy and 
columns are sorted). 3) To create such dataModel we need to initialize a 
HashMap first, this uses twice as much as heap space for initialization, could 
defeat the purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him 
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure 
it is an all-improving solution to similarity-based recommenders. But low rank 
matrix-factorization based ones should merge preferences from the new users 
immediately into the prediction model, of course you can just project it into 
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according 
to guys I'm working with the online recommender seems to be in demand these 
days.

  was (Author: peng):
Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some 
difficulties, namely 1) a columnar form doesn't support fast extraction of 
rows, yet dataModel should allow quick getPreferencesFromUser() and 
getPreferencesForItem(). 2) a columnar form doesn't support fast online update 
(time complexity is O(n), maximally O(n) if using block copy and columns are 
sorted). 3) To create such dataModel we need to initialize a HashMap first, 
this uses twice as much as heap space for initialization, could defeat the 
purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him 
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure 
it is an all-improving solution to similarity-based recommenders. But low rank 
matrix-factorization based ones should merge preferences from the new users 
immediately into the prediction model, of course you can just project it into 
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according 
to guys I'm working with the online recommender seems to be in demand these 
days.
  
 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737019#comment-13737019
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Dr Dunning,

Indeed both Gokhan and me have experimented on that, but I've run into some 
difficulties, namely 1) a columnar form doesn't support fast extraction of 
rows, yet dataModel should allow quick getPreferencesFromUser() and 
getPreferencesForItem(). 2) a columnar form doesn't support fast online update 
(time complexity is O(n), maximally O(n) if using block copy and columns are 
sorted). 3) To create such dataModel we need to initialize a HashMap first, 
this uses twice as much as heap space for initialization, could defeat the 
purpose though.

I'm not sure if Gokhan has encountered the same problem. Didn't hear from him 
for some time.

The search based recommender is indeed a very tempting solution. I'm very sure 
it is an all-improving solution to similarity-based recommenders. But low rank 
matrix-factorization based ones should merge preferences from the new users 
immediately into the prediction model, of course you can just project it into 
the low rank subspace, but this reduces the performance a little bit.

I'm not sure how much Lucene supports online update of indices, but according 
to guys I'm working with the online recommender seems to be in demand these 
days.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737023#comment-13737023
 ] 

Peng Cheng commented on MAHOUT-1286:


Well, I mean, I partially agree that the effort I spent on this probably won't 
pay off as few will use In-memory/file dataModel in production, most of them 
will choose a databased-backed one. I just try to solve it because its a 
blocker.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736962#comment-13736962
 ] 

Peng Cheng edited comment on MAHOUT-1286 at 8/12/13 4:54 PM:
-

The idea of ArrayMap has been discarded due to its impractical time consumption 
of insertion (O( n ) for a batch insertion) and query (O(logn)). I have moved 
back to HashMap. Due to the same reason, I feel that using Sparse Row/Column 
matrix may have the same problem.

  was (Author: peng):
The idea of ArrayMap has been discarded due to its impractical time 
consumption of insertion (O(n) for a batch insertion) and query (O(logn)). I 
have moved back to HashMap. Due to the same reason, I feel that using Sparse 
Row/Column matrix may have the same problem.
  
 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737553#comment-13737553
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Gentlemen,

Thanks a lot for proving my point Gokhan, yeah I mean either user or item 
preferences extraction can be fast but not both.

Sorry I should have proposed it in our last hangout but I missed the invitation 
:- But I tried to understand your proposal on recommendation-as-search.

From what I heard on youtube, the new architecture is proposed as an easier 
and faster replacement of all existing recommenders that take DataModel. Each 
item is a weighted 'bag of words' generated by concurrence analysis/item 
similarity on previous ratings. New users's ratings are converted into 
weighted tuple of existing words and is matched with the items that have 
highest sum of hits.

My concerns are that 1) does it support all type of recommenders and their 
ensemble? I know modern search engine like Google and YANDEX has a fairly 
complex ensemble search and ranking algorithm that looks similar to an ensemble 
recommender, but IMHO Lucene is built only for text search, not sure to what 
extend it is customizable. 2) does it support online learning? This feature is 
more important to SVDRecommender as a new user's recommendation is only known 
if this user is merged into the model. (Of course, an option is to project a 
new user into the user subspace by minimising its distance given its dot to 
existing items, but no body has test its performance before)

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-08-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737563#comment-13737563
 ] 

Peng Cheng commented on MAHOUT-1286:


Also, please be noted that the first patch is still not optimized to extreme. 
Many improvements can be made to make it smaller and faster. (see TODO: list in 
code) But I'm trying to get back to MAHOUT-1274, if we expect large scale 
refactoring on all recommenders in favor of recommendation-as-search, I'll have 
to suspend it until refactoring is finished.

I'm waiting online for Dr Dunning's plan.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, datamodel, patch, recommender
 Fix For: 0.9

 Attachments: InMemoryDataModel.java, InMemoryDataModelTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: apache-math dependency

2013-08-12 Thread Peng Cheng
Apologies, I mistaken apache-math as mahout-math and didn't know what 
I'm talking about :)


On 13-08-12 07:08 PM, Ted Dunning wrote:

Yes.  Apache Math linear algebra is very difficult for us to use because
their matrices are non-extensible.

But there is actually quite a lot of code to do with random distributions,
optimization and quadrature. Those are much more likely to be useful to us.


On Mon, Aug 12, 2013 at 3:26 PM, Dmitriy Lyubimov dlie...@gmail.com wrote:


Larger part of mahout-math is linear algebra, which is currently broken for
sparse part of the equation and which we don't use at all.

One part of the problem is that our use for that library is always a fringe
case, and as far as i can tell, will always continue to be such.

Another part of the problem is that keeping dependency will invite
bypassing Mahout's solvers and, as a result, architecture inconsistency.

That said, I guess Ted's argument (which is mainly cost, as i gathered),
trumps the two above.


On Mon, Aug 12, 2013 at 3:20 PM, Peng Cheng pc...@uowmail.edu.au wrote:


seriously, I would prefer the dependency as a good architectural pattern.
It encourages other people to use/contribute to it to avoid repetitive

work.


On 13-08-12 06:16 PM, Ted Dunning wrote:


I am fine with it staying.


On Mon, Aug 12, 2013 at 3:14 PM, Dmitriy Lyubimov dlie...@gmail.com
wrote:

  So you are ok with apache-math dependency to stay?


On Mon, Aug 12, 2013 at 3:09 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  So I checked on these.  The non-trivial issues with replacing Commons
Math


include:

- Poisson and negative binomial distributions.  This would be several


hours


work to write and test (we have Colt-inherited negative binomial
distribution, but it takes no longer to write a new one than to test

an

old


one).

- random number generators.  This is about and hour or two of work to


pull


the MersenneTwister implementation into our code.

- next prime number finder.  Not a big deal to replicate, but it would


take


a few hours to do.

- quadrature.  We use an adaptive integration routine to check


distribution


properties.  This, again, would take a few hours to replace.

I really don't see the benefit to this work.

On Mon, Aug 12, 2013 at 2:53 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  2 distribution.**PoissonDistribution;

 2 distribution.**PascalDistribution;
 2 distribution.**NormalDistribution;
 1 util.FastMath;
 1 random.RandomGenerator;
 1 random.MersenneTwister;
 1 primes.Primes;
 1 linear.RealMatrix;
 1 linear.EigenDecomposition;
 1 linear.Array2DRowRealMatrix;
 1 distribution.RealDistribution;
 1 distribution.**IntegerDistribution;
 1 analysis.integration.**UnivariateIntegrator;
 1 analysis.integration.**RombergIntegrator;
 1 analysis.UnivariateFunction;









Re: Hangout on Monday

2013-08-05 Thread Peng Cheng

Strange, I didn't see any invitation.

On 13-08-05 06:54 PM, Ted Dunning wrote:

Just sent invite to Mahout dev list.


On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com wrote:


It is for both.

If you have g+ installed you can participate.  If not, you can watch.



On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org wrote:


Is the link only for watching or also for participation? Never did a
hangout before :)

2013/8/5 Andrew Musselman andrew.mussel...@gmail.com


Can't make it alas


On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu

wrote:
what's the addr of the hangout?


On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au

wrote:

Nice, I'll be there.


On 13-08-03 02:51 PM, Andrew Musselman wrote:


Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning 

ted.dunn...@gmail.com

wrote:

  Yes.  1600 PDT

I got that right in the linked doc, just not on the more important

email.




On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com


wrote:
On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

  Let's have the hangout at 1600 on Monday, August 5th.
Maybe asking the obvious here so I apologize for the spam. The

timezone

is


PDT, correct?












Re: Hangout on Monday

2013-08-05 Thread Peng Cheng
So buggy, the program act as i'm in the meeting (showing a push to 
talk button), but it doesn't do anything.


On 13-08-05 08:02 PM, Ted Dunning wrote:

Hangouts clearly do not work the way I thought they did.  The URL that I
sent out was for the arhcived version of the meeting.


On Mon, Aug 5, 2013 at 5:00 PM, Peng Cheng pc...@uowmail.edu.au wrote:


Strange, I didn't see any invitation.


On 13-08-05 06:54 PM, Ted Dunning wrote:


Just sent invite to Mahout dev list.


On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

  It is for both.

If you have g+ installed you can participate.  If not, you can watch.



On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org
wrote:

  Is the link only for watching or also for participation? Never did a

hangout before :)

2013/8/5 Andrew Musselman andrew.mussel...@gmail.com

  Can't make it alas


On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang kuny...@stanford.edu


wrote:
what's the addr of the hangout?


On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au


wrote:


Nice, I'll be there.


On 13-08-03 02:51 PM, Andrew Musselman wrote:

  Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning 


ted.dunn...@gmail.com

  wrote:

   Yes.  1600 PDT


I got that right in the linked doc, just not on the more important


email.


On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com

  wrote:

On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

   Let's have the hangout at 1600 on Monday, August 5th.
Maybe asking the obvious here so I apologize for the spam. The


timezone

is

  PDT, correct?











Re: Hangout on Monday

2013-08-05 Thread Peng Cheng
Oh Sorry I figure out the problem, my Google+ uses gmail address as my 
account. I'll change that right away.


On 13-08-05 08:16 PM, Ted Dunning wrote:

Peng,

It looks like you are not actually on google plus.  I have you in my Mahout
circle under your iowa email address, but I am unable to add you to a
hangout.


On Mon, Aug 5, 2013 at 5:07 PM, Peng Cheng pc...@uowmail.edu.au wrote:


So buggy, the program act as i'm in the meeting (showing a push to talk
button), but it doesn't do anything.


On 13-08-05 08:02 PM, Ted Dunning wrote:


Hangouts clearly do not work the way I thought they did.  The URL that I
sent out was for the arhcived version of the meeting.


On Mon, Aug 5, 2013 at 5:00 PM, Peng Cheng pc...@uowmail.edu.au wrote:

  Strange, I didn't see any invitation.


On 13-08-05 06:54 PM, Ted Dunning wrote:

  Just sent invite to Mahout dev list.


On Mon, Aug 5, 2013 at 3:53 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

   It is for both.


If you have g+ installed you can participate.  If not, you can watch.



On Mon, Aug 5, 2013 at 3:51 PM, Sebastian Schelter s...@apache.org
wrote:

   Is the link only for watching or also for participation? Never did a


hangout before :)

2013/8/5 Andrew Musselman andrew.mussel...@gmail.com

   Can't make it alas


On Mon, Aug 5, 2013 at 3:12 PM, Michael Kun Yang 
kuny...@stanford.edu

  wrote:

what's the addr of the hangout?


On Sun, Aug 4, 2013 at 10:37 AM, Peng Cheng pc...@uowmail.edu.au

  wrote:

  Nice, I'll be there.

On 13-08-03 02:51 PM, Andrew Musselman wrote:

   Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning 

  ted.dunn...@gmail.com

   wrote:
Yes.  1600 PDT

  I got that right in the linked doc, just not on the more

important

  email.
On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com

   wrote:


On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:

Let's have the hangout at 1600 on Monday, August 5th.
Maybe asking the obvious here so I apologize for the spam. The

  timezone

is
   PDT, correct?











Re: Hangout on Monday

2013-08-04 Thread Peng Cheng

Nice, I'll be there.

On 13-08-03 02:51 PM, Andrew Musselman wrote:

Sounds good


On Sat, Aug 3, 2013 at 12:04 AM, Ted Dunning ted.dunn...@gmail.com wrote:


Yes.  1600 PDT

I got that right in the linked doc, just not on the more important email.




On Fri, Aug 2, 2013 at 3:30 PM, Andrew Psaltis 
andrew.psal...@webtrends.com

wrote:
On 8/2/13 4:42 PM, Ted Dunning ted.dunn...@gmail.com wrote:


Let's have the hangout at 1600 on Monday, August 5th.

Maybe asking the obvious here so I apologize for the spam. The timezone

is

PDT, correct?







Re: [jira] [Created] (MAHOUT-1298) SparseRowMatrix,SparseColMatrix: optimize transpose()

2013-07-29 Thread Peng Cheng

+1, we have type conversion anyway.

On 29/07/2013 6:40 PM, Sebastian Schelter wrote:

+1

2013/7/29 Dmitriy Lyubimov (JIRA) j...@apache.org


Dmitriy Lyubimov created MAHOUT-1298:


  Summary: SparseRowMatrix,SparseColMatrix: optimize transpose()
  Key: MAHOUT-1298
  URL: https://issues.apache.org/jira/browse/MAHOUT-1298
  Project: Mahout
   Issue Type: New Feature
   Components: Math
 Affects Versions: 0.8
 Reporter: Dmitriy Lyubimov
 Assignee: Dmitriy Lyubimov
  Fix For: 0.9


these matrices lack optimized transpose and rely onto AbstractMatrix's
O(mn) implementation which is not cool for very sparse subblocks.

proposal is to implement a custom transpose with two things in mind:

1) transpose result to row sparse matrix should be col sparse matrix, and
vice versa (and not from default like() as default implementation would
take);

2) obviously, iterate only thru non-zero elements only of all
rows(columns).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira






[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-23 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717659#comment-13717659
 ] 

Peng Cheng commented on MAHOUT-1286:


Aye aye, I just did, turns out that instances of PreferenceArray$PreferenceView 
has taken 1.7G. Quite unexpected right? Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will be 
no more PreferenceArray.

Class Name 
|Objects |  Shallow Heap |Retained Heap
---
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
 72,237,632 | 1,733,703,168 | = 1,733,703,168
long[] 
|480,199 |   818,209,680 |   = 818,209,680
float[]
|480,190 |   410,563,592 |   = 410,563,592
java.lang.Object[] 
| 18,230 |   361,525,488 | = 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray   
|480,189 |15,366,048 | = 1,237,456,672
java.util.ArrayList
| 17,811 |   427,464 | = 2,092,416,104
char[] 
|  2,150 |   272,632 |   = 272,632
byte[] 
|141 |54,048 |= 54,048
java.lang.String   
|  2,119 |50,856 |   = 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry   
|673 |21,536 |= 38,104
java.net.URL   
|229 |14,656 |= 40,720
java.util.HashMap$Entry
|344 |11,008 |= 68,760
---


 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-23 Thread Peng Cheng
That's exactly what I'm trying to do right now :) (I'm testing 
FastByIDArrayMap), but we probably have more problems than just HashMap, 
based on the heap dump analysis result, PreferenceArray probably will be 
our next target. This is awesome, as your FactorizablePreferences didn't 
use it in the first place.


Yours Peng

On 13-07-23 05:46 PM, Sebastian Schelter wrote:

IMHO you will always have memory issues if you try to provide constant time
random access. Thats why I proposed to created a special memory efficient
DataModel for sequential access.


2013/7/23 Peng Cheng (JIRA) j...@apache.org


 [
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717659#comment-13717659]

Peng Cheng commented on MAHOUT-1286:


Aye aye, I just did, turns out that instances of
PreferenceArray$PreferenceView has taken 1.7G. Quite unexpected right?
Thanks a lot for the advice.
My next experiment will just use GenericPreference [] directly, there will
be no more PreferenceArray.

Class Name
 |Objects |  Shallow Heap |Retained Heap

---
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray$PreferenceView|
72,237,632 | 1,733,703,168 | = 1,733,703,168
long[]
 |480,199 |   818,209,680 |   = 818,209,680
float[]
  |480,190 |   410,563,592 |   = 410,563,592
java.lang.Object[]
 | 18,230 |   361,525,488 | = 2,443,647,088
org.apache.mahout.cf.taste.impl.model.GenericUserPreferenceArray
 |480,189 |15,366,048 | = 1,237,456,672
java.util.ArrayList
  | 17,811 |   427,464 | = 2,092,416,104
char[]
 |  2,150 |   272,632 |   = 272,632
byte[]
 |141 |54,048 |= 54,048
java.lang.String
 |  2,119 |50,856 |   = 271,920
java.util.concurrent.ConcurrentHashMap$HashEntry
 |673 |21,536 |= 38,104
java.net.URL
 |229 |14,656 |= 40,720
java.util.HashMap$Entry
  |344 |11,008 |= 68,760

---



Memory-efficient DataModel, supporting fast online updates and

element-wise iteration
-

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

Most DataModel implementation in current CF component use hash map to

enable fast 2d indexing and update. This is not memory-efficient for big
data set. e.g. Netflix prize dataset takes 11G heap space as a
FileDataModel.

Improved implementation of DataModel should use more compact data

structure (like arrays), this can trade a little of time complexity in 2d
indexing for vast improvement in memory efficiency. In addition, any online
recommender or online-to-batch converted recommender will not be affected
by this in training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira






[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715885#comment-13715885
 ] 

Peng Cheng commented on MAHOUT-1286:


On second thought, hash map is very likely not the culprit for poor memory 
efficiency here, apologies for the misinformation. The double hashing algorithm 
in FastByIDMap, as described in Don Knuth's book 'the art of computer 
programming', has a default loadFactor of 1.5, which means the size of array is 
only 1.5 times the number of keys. So theoretically the heap size of 
GenericDataModel should never exceed 3 times the size of 
FactorizablePreferences. I'm still very unclear about FastByIDMap's 
implementation, like how it handles deletion of entries. So I cannot tell if my 
observation on netflix is caused by GC (e.g. construct new arrays too often), 
or deletion, or extra space allocated for timestamp. We probably have to run 
netflix in debug mode to identify the problem.

I'll try to bring up this topic in the next hangout. Please give me some hint 
if you are an expert in those FastMap implementations.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715906#comment-13715906
 ] 

Peng Cheng commented on MAHOUT-1286:


In another hand, I try to solve the problem by implementing FastByIDArrayMap, a 
slightly more compact Map implementation than FastByIDMap, it uses binary 
search to arrange all entries into a tight array, so its worst-case time 
complexity for get, put and delete is log(n) (much slower than double hashing's 
average O(1)). But has a (marginally) smaller memory footprint and faster 
iteration. It has no problem passing all unit tests. But its real performance 
can only be shown when embedded in FileDataModel. I'll post the result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything 
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't 
worth the speed loss.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715906#comment-13715906
 ] 

Peng Cheng edited comment on MAHOUT-1286 at 7/23/13 12:26 AM:
--

In another hand, I try to solve the problem by implementing FastByIDArrayMap, a 
slightly more compact Map implementation than FastByIDMap, it uses binary 
search to arrange all entries into a tight array, so its worst-case time 
complexity for get, put and delete is log( n ) (much slower than double 
hashing's average O(1)). But has a (marginally) smaller memory footprint and 
faster iteration. It has no problem passing all unit tests. But its real 
performance can only be shown when embedded in FileDataModel. I'll post the 
result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything 
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't 
worth the speed loss.

  was (Author: peng):
In another hand, I try to solve the problem by implementing 
FastByIDArrayMap, a slightly more compact Map implementation than FastByIDMap, 
it uses binary search to arrange all entries into a tight array, so its 
worst-case time complexity for get, put and delete is log(n) (much slower than 
double hashing's average O(1)). But has a (marginally) smaller memory footprint 
and faster iteration. It has no problem passing all unit tests. But its real 
performance can only be shown when embedded in FileDataModel. I'll post the 
result shortly.

However, I don't feel this is the right direction. If Sean Owen did everything 
right in his FastByIDMap, then reducing memory footage to 0.66 times doesn't 
worth the speed loss.
  
 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-22 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13715912#comment-13715912
 ] 

Peng Cheng commented on MAHOUT-1286:


Hi Sebastian, Gokhan, how do you feel about the cause of the memory efficiency 
problem? Do you think we should talk privately? I'm also interested in your 
experimentation results.

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Regarding Online Recommenders

2013-07-19 Thread Peng Cheng

Hi,

Just one simple question: Is the 
org.apache.mahout.math.BinarySearch.binarySearch() function an optimized 
version of Arrays.binarySearch()? If it is not, why implement it again?


Yours Peng

On 13-07-17 06:31 PM, Sebastian Schelter wrote:

You are completely right, the simple interface would only be usable for
readonly / batch-updatable recommenders. Online recommenders might need
something different. I tried to widen the discussion here to discuss all
kinds of API changes in the recommenders that would be necessary in the
future.



2013/7/17 Peng Cheng pc...@uowmail.edu.au


One thing that suddenly comes to my mind is that, for a simple interface
like FactorizablePreferences, maybe sequential READ in real time is
possible, but sequential WRITE in O(1) time is Utopia. Because you need to
flush out old preference with same user and item ID (in worst case it could
be an interpolation search), otherwise you are permitting a user rating an
item twice with different values. Considering how FileDataModel suppose to
work (new files flush old files), maybe using the simple interface has less
advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:


Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access
(and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au

  I see, OK so we shouldn't use the old implementation. But I mean, the old

interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an
interpolation
search that takes O(log log n) time in average. So does the online
update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:

  I completely agree, Netflix is less than one gigabye in a smart

representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient
representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

   Netflix is a small dataset.  12G for that seems quite excessive.


Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

   The second idea is indeed splendid, we should separate
time-complexity


first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

  laptop

  can already handle that (emphasis on laptop). And if we replace hash

map
(the culprit of high memory consumption) with list/linkedList, it
would
simply degrade time complexity for a linear search to O(n), not too
bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead
of
subverting it.









Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
For a low-rank matrix factorization based recommender, a new preference 
is not itself, but a dot product of two vectors in the low dimensional 
space, so it needs no projection. The user and item vectors however may 
need to be projected into a lower dimensional space, if and only if you 
want to reduce the rank of the preference matrix. The refactorization 
step in SGD is super fast--that's the charm of SGD. So, yes, we will 
refactorize in every update.


Yours Peng

On 13-07-18 11:34 AM, Pat Ferrel wrote:

On Jul 17, 2013, at 1:19 PM, Gokhan Capan gkhn...@gmail.com wrote:

Hi Pat, please see my response inline.

Best,
Gokhan


On Wed, Jul 17, 2013 at 8:23 PM, Pat Ferrel pat.fer...@gmail.com wrote:


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled?


If you are referring to the recommender of discussion here, no, updating
the model can be done with a single preference, using stochastic gradient
descent, by updating the particular user and item factors simultaneously.


Aren't there two different things needed to truly update the model: 1) add the 
new preference to the lower dimensional space 2) refactorize the all 
preferences. #2 only needs to be done periodically--afaik. #1 would be super 
fast and could be done at runtime.  Am I wrong or are you planning to 
incrementally refactorize the entire preference array with every new preference?






Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but 
this is definitely I would like to pursue. It's probably the only 
advantage a non-hadoop implementation can offer in the future.


Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for 
sandboxes, IMHO they are the culprit of scalability problem.


May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months

of

data. The data was purchase data from a diverse ecom source with a large
variety of products from electronics to clothes. We found Item based CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full item based online recommender has very different
tradeoffs and anyway others may not have seen this disparity of results.
Furthermore we don't have A/B test results yet to validate the offline
metric.

On Jul 16, 2013, at 2:41 PM, Gokhan Capan gkhn...@gmail.com wrote:

Peng,

This is the reason I separated out the DataModel, and only put the

learner

stuff there. The learner I

Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
Strange, its just a little bit larger than limibseti dataset (17m 
ratings), did you encountered an outOfMemory or GCTimeOut exception? 
Allocating more heap space usually help.


Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model. This could be history from a new user asked
to pick a few items to start the rec process, or an old user with some

new

action history not yet in the model. Are you going to allow for passing

the

entire history vector or userID+incremental new history to the

recommender?

I hope so.

For what it's worth we did a comparison of Mahout Item based CF to Mahout
ALS-WR CF on 2.5M users and 500K items with many M actions over 6 months

of

data. The data was purchase data from a diverse ecom source with a large
variety of products from electronics to clothes. We found Item based CF

did

far better than ALS. As we increased the number of latent factors the
results got better but were never within 10% of item based (we used MAP

as

the offline metric). Not sure why but maybe it has to do with the

diversity

of the item types.

I understand that a full

Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng
I see, sorry I was too presumptuous. I only recently worked and tested 
SVDRecommender, never could have known its efficiency using an 
item-based recommender. Maybe there is space for algorithmic optimization.


The online recommender Gokhan is working on is also an SVDRecommender. 
An online user-based or item-based recommender based on clustering 
technique would definitely be critical, but we need an expert to 
volunteer :)


Perhaps Dr Dunning can have a few words? He announced the online 
clustering component.


Yours Peng

On 13-07-18 03:54 PM, Pat Ferrel wrote:

No it was CPU bound not memory. I gave it something like 14G heap. It was 
running, just too slow to be of any real use. We switched to the hadoop version 
and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m ratings), did 
you encountered an outOfMemory or GCTimeOut exception? Allocating more heap 
space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:

It was about 2.5M users and 500K items with 25M actions over 6 months of data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering 
algorithm. I'm not sure if it can be used in item-based recommender, but this 
is definitely I would like to pursue. It's probably the only advantage a 
non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory 
GenericDataModel and FileDataModel are largely implemented for sandboxes, IMHO 
they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:

Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender, usually
this happens after a retraining in batch. You have to care for cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com


Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit of
tinkering and doesn't have quite the same set of options--no llr similarity
for instance.

On the same subject I recently attended a workshop in Seattle for UAI2013
where Walmart reported similar results using a factorized recommender. They
had to increase the factor number past where it would perform well. Along
the way they saw increasing performance measuring precision offline. They
eventually gave up on a factorized solution. This decision seems odd but
anyway… In the case of Walmart and our data set they are quite diverse. The
best idea is probably to create different recommenders for separate parts
of the catalog but if you create one model on all items our intuition is
that item-based works better than factorized. Again caveat--no A/B tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and it
could not really handle more than a down-sampled few months of our data.
Down-sampling lost us 20% of our precision scores so we moved to the hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org wrote:

Hi Pat

I think we should provide a simple support for recommending to anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com


May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How are the updates handled? Do

you

plan to require batch model refactorization for any update? Or perform

some

partial update by maybe just transforming new data into the LF space
already in place then doing full refactorization every so often in batch
mode?

By 'anonymous users' I mean users with some history that is not yet
incorporated in the LF model

Re: Regarding Online Recommenders

2013-07-18 Thread Peng Cheng

Wow, that's lightning fast.

Is it a SparseMatrix or DenseMatrix?

On 13-07-18 07:23 PM, Gokhan Capan wrote:

I just started to implement a Matrix backed data model and pushed it, to
check the performance and memory considerations.

I believe I can try it on some data tomorrow.

Best

Gokhan


On Thu, Jul 18, 2013 at 11:05 PM, Peng Cheng pc...@uowmail.edu.au wrote:


I see, sorry I was too presumptuous. I only recently worked and tested
SVDRecommender, never could have known its efficiency using an item-based
recommender. Maybe there is space for algorithmic optimization.

The online recommender Gokhan is working on is also an SVDRecommender. An
online user-based or item-based recommender based on clustering technique
would definitely be critical, but we need an expert to volunteer :)

Perhaps Dr Dunning can have a few words? He announced the online
clustering component.

Yours Peng


On 13-07-18 03:54 PM, Pat Ferrel wrote:


No it was CPU bound not memory. I gave it something like 14G heap. It was
running, just too slow to be of any real use. We switched to the hadoop
version and stored precalculated recs in a db for every user.

On Jul 18, 2013, at 12:06 PM, Peng Cheng pc...@uowmail.edu.au wrote:

Strange, its just a little bit larger than limibseti dataset (17m
ratings), did you encountered an outOfMemory or GCTimeOut exception?
Allocating more heap space usually help.

Yours Peng

On 13-07-18 02:27 PM, Pat Ferrel wrote:


It was about 2.5M users and 500K items with 25M actions over 6 months of
data.

On Jul 18, 2013, at 10:15 AM, Peng Cheng pc...@uowmail.edu.au wrote:

If I remember right, a highlight of 0.8 release is an online clustering
algorithm. I'm not sure if it can be used in item-based recommender, but
this is definitely I would like to pursue. It's probably the only advantage
a non-hadoop implementation can offer in the future.

Many non-hadoop recommenders are pretty fast. But existing in-memory
GenericDataModel and FileDataModel are largely implemented for sandboxes,
IMHO they are the culprit of scalability problem.

May I ask about the scale of your dataset? how many rating does it have?

Yours Peng

On 13-07-18 12:14 PM, Sebastian Schelter wrote:


Well, with itembased the only problem is new items. New users can
immediately be served by the model (although this is not well supported
by
the API in Mahout). For the majority of usecases I saw, it is perfectly
fine to have a short delay until new items enter the recommender,
usually
this happens after a retraining in batch. You have to care for
cold-start
and collect some interactions anyway.


2013/7/18 Pat Ferrel pat.fer...@gmail.com

  Yes, what Myrrix does is good.

My last aside was a wish for an item-based online recommender not only
factorized. Ted talks about using Solr for this, which we're
experimenting
with alongside Myrrix. I suspect Solr works but it does require a bit
of
tinkering and doesn't have quite the same set of options--no llr
similarity
for instance.

On the same subject I recently attended a workshop in Seattle for
UAI2013
where Walmart reported similar results using a factorized recommender.
They
had to increase the factor number past where it would perform well.
Along
the way they saw increasing performance measuring precision offline.
They
eventually gave up on a factorized solution. This decision seems odd
but
anyway… In the case of Walmart and our data set they are quite
diverse. The
best idea is probably to create different recommenders for separate
parts
of the catalog but if you create one model on all items our intuition
is
that item-based works better than factorized. Again caveat--no A/B
tests to
support this yet.

Doing an online item-based recommender would quickly run into scaling
problems, no? We put together the simple Mahout in-memory version and
it
could not really handle more than a down-sampled few months of our
data.
Down-sampling lost us 20% of our precision scores so we moved to the
hadoop
version. Now we have use-cases for an online recommender that handles
anonymous new users and that takes the story full circle.

On Jul 17, 2013, at 1:28 PM, Sebastian Schelter s...@apache.org
wrote:

Hi Pat

I think we should provide a simple support for recommending to
anonymous
users. We should have a method recommendToAnonymous() that takes a
PreferenceArray as argument. For itembased recommenders, its
straightforward to compute recommendations, for userbased you have to
search through all users once, for latent factor models, you have to
fold
the user vector into the low dimensional space.

I think Sean already added this method in myrrix and I have some code
for
my kornakapi project (a simple weblayer for mahout).

Would such a method fit your needs?

Best,
Sebastian



2013/7/17 Pat Ferrel pat.fer...@gmail.com

  May I ask how you plan to support model updates and 'anonymous' users?

I assume the latent factors model is calculated offline still in batch
mode, then there are periodic updates? How

Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng
I see, OK so we shouldn't use the old implementation. But I mean, the 
old interface doesn't have to be discarded. The discrepancy between your 
FactorizablePreferences and DataModel is that, your model supports 
getPreferences(), which returns all preferences as an iterator, and 
DataModel supports a few old functions that returns preferences for an 
individual user or item.


My point is that, it is not hard for each of them to implement what they 
lack of: old DataModel can implement getPreferences() just by a a loop 
in abstract class. Your new FactorizablePreferences can implement those 
old functions by a binary search that takes O(log n) time, or an 
interpolation search that takes O(log log n) time in average. So does 
the online update. It will just be a matter of different speed and 
space, but not different interface standard, we can use old unit tests, 
old examples, old everything. And we will be more flexible in writing 
ensemble recommender.


Just a few thoughts, I'll have to validate the idea first before 
creating a new JIRA ticket.


Yours Peng


On 13-07-16 02:51 PM, Sebastian Schelter wrote:

I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com


Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au wrote:


The second idea is indeed splendid, we should separate time-complexity
first and space-complexity first implementation. What I'm not quite sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.





Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng
Mmm, You are right, the most simple solution is usually the best, I'm 
creating new jira ticket.


Yours Peng

On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au


I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an interpolation
search that takes O(log log n) time in average. So does the online update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

  The second idea is indeed splendid, we should separate time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new


laptop


can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.








Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng

Awesome! your reinforcements are highly appreciated.

On 13-07-17 01:29 AM, Abhishek Sharma wrote:

Sorry to interrupt guys, but I just wanted to bring it to your notice that
I am also interested in contributing to this idea. I am planning to
participate in ASF-ICFOSS mentor-ship
programmehttps://cwiki.apache.org/confluence/display/COMDEV/ASF-ICFOSS+Pilot+Mentoring+Programme.
(this is very similar to GSOC)

I do have strong concepts in machine learning (have done the ML course by
Andrew NG on coursera) also, I am good in programming (have 2.5 yrs of work
experience). I am not really sure of how can I approach this problem (but I
do have a strong interest to work on this problem) hence would like to pair
up on this. I am currently working as a research intern at Indian Institute
of Science (IISc), Bangalore India and can put up 15-20 hrs per week.

Please let me know your thoughts if I can be a part of this.

Thanks  Regards,
Abhishek Sharma
http://www.linkedin.com/in/abhi21
https://github.com/abhi21


On Wed, Jul 17, 2013 at 3:11 AM, Gokhan Capan gkhn...@gmail.com wrote:


Peng,

This is the reason I separated out the DataModel, and only put the learner
stuff there. The learner I mentioned yesterday just stores the
parameters, (noOfUsers+noOfItems)*noOfLatentFactors, and does not care
where preferences are stored.

I, kind of, agree with the multi-level DataModel approach:
One for iterating over all preferences, one for if one wants to deploy a
recommender and perform a lot of top-N recommendation tasks.

(Or one DataModel with a strategy that might reduce existing memory
consumption, while still providing fast access, I am not sure. Let me try a
matrix-backed DataModel approach)

Gokhan


On Tue, Jul 16, 2013 at 9:51 PM, Sebastian Schelter s...@apache.org
wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient

representation,

tested on KDD Music dataset which is approx 2.5 times Netflix and fits

into

3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com


Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au

wrote:

The second idea is indeed splendid, we should separate

time-complexity

first and space-complexity first implementation. What I'm not quite

sure,

is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new

laptop

can already handle that (emphasis on laptop). And if we replace hash

map

(the culprit of high memory consumption) with list/linkedList, it

would

simply degrade time complexity for a linear search to O(n), not too

bad

either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead

of

subverting it.








[jira] [Created] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and iteration

2013-07-17 Thread Peng Cheng (JIRA)
Peng Cheng created MAHOUT-1286:
--

 Summary: Memory-efficient DataModel, supporting fast online 
updates and iteration
 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen


Most DataModel implementation in current CF component use hash map to enable 
fast 2d indexing and update. This is not memory-efficient for big data set. 
e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.

Improved implementation of DataModel should use more compact data structure 
(like arrays), this can trade a little of time complexity in 2d indexing for 
vast improvement in memory efficiency. In addition, any online recommender or 
online-to-batch converted recommender will not be affected by this in training 
process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1286) Memory-efficient DataModel, supporting fast online updates and element-wise iteration

2013-07-17 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1286:
---

Summary: Memory-efficient DataModel, supporting fast online updates and 
element-wise iteration  (was: Memory-efficient DataModel, supporting fast 
online updates and iteration)

 Memory-efficient DataModel, supporting fast online updates and element-wise 
 iteration
 -

 Key: MAHOUT-1286
 URL: https://issues.apache.org/jira/browse/MAHOUT-1286
 Project: Mahout
  Issue Type: Improvement
  Components: Collaborative Filtering
Affects Versions: 0.9
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 Most DataModel implementation in current CF component use hash map to enable 
 fast 2d indexing and update. This is not memory-efficient for big data set. 
 e.g. Netflix prize dataset takes 11G heap space as a FileDataModel.
 Improved implementation of DataModel should use more compact data structure 
 (like arrays), this can trade a little of time complexity in 2d indexing for 
 vast improvement in memory efficiency. In addition, any online recommender or 
 online-to-batch converted recommender will not be affected by this in 
 training process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Regarding Online Recommenders

2013-07-17 Thread Peng Cheng
One thing that suddenly comes to my mind is that, for a simple interface 
like FactorizablePreferences, maybe sequential READ in real time is 
possible, but sequential WRITE in O(1) time is Utopia. Because you need 
to flush out old preference with same user and item ID (in worst case it 
could be an interpolation search), otherwise you are permitting a user 
rating an item twice with different values. Considering how 
FileDataModel suppose to work (new files flush old files), maybe using 
the simple interface has less advantages than we used to believe.


On 13-07-17 04:58 PM, Sebastian Schelter wrote:

Hi Peng,

I never wanted to discard the old interface, I just wanted to split it up.
I want to have a simple interface that only supports sequential access (and
allows for very memory efficient implementions, e.g. by the use of
primitive arrays). DataModel should *extend* this interface and provide
sequential and random access (basically what is already does).

Than a recommender such as SGD could state that it only needs sequential
access to the preferences and you can either feed it a DataModel (so we
dont break backwards compatibility) or a memory efficient sequential
access thingy.

Does that make sense for you?


2013/7/17 Peng Cheng pc...@uowmail.edu.au


I see, OK so we shouldn't use the old implementation. But I mean, the old
interface doesn't have to be discarded. The discrepancy between your
FactorizablePreferences and DataModel is that, your model supports
getPreferences(), which returns all preferences as an iterator, and
DataModel supports a few old functions that returns preferences for an
individual user or item.

My point is that, it is not hard for each of them to implement what they
lack of: old DataModel can implement getPreferences() just by a a loop in
abstract class. Your new FactorizablePreferences can implement those old
functions by a binary search that takes O(log n) time, or an interpolation
search that takes O(log log n) time in average. So does the online update.
It will just be a matter of different speed and space, but not different
interface standard, we can use old unit tests, old examples, old
everything. And we will be more flexible in writing ensemble recommender.

Just a few thoughts, I'll have to validate the idea first before creating
a new JIRA ticket.

Yours Peng



On 13-07-16 02:51 PM, Sebastian Schelter wrote:


I completely agree, Netflix is less than one gigabye in a smart
representation, 12x more memory is a nogo. The techniques used in
FactorizablePreferences allow a much more memory efficient representation,
tested on KDD Music dataset which is approx 2.5 times Netflix and fits
into
3GB with that approach.


2013/7/16 Ted Dunning ted.dunn...@gmail.com

  Netflix is a small dataset.  12G for that seems quite excessive.

Note also that this is before you have done any work.

Ideally, 100million observations should take  1GB.

On Tue, Jul 16, 2013 at 8:19 AM, Peng Cheng pc...@uowmail.edu.au
wrote:

  The second idea is indeed splendid, we should separate time-complexity

first and space-complexity first implementation. What I'm not quite
sure,
is that if we really need to create two interfaces instead of one.
Personally, I think 12G heap space is not that high right? Most new


laptop


can already handle that (emphasis on laptop). And if we replace hash map
(the culprit of high memory consumption) with list/linkedList, it would
simply degrade time complexity for a linear search to O(n), not too bad
either. The current DataModel is a result of careful thoughts and has
underwent extensive test, it is easier to expand on top of it instead of
subverting it.








Re: Regarding Online Recommenders

2013-07-16 Thread Peng Cheng
Yeah, setPreference() and removePreference() shouldn't be there, but 
injecting Recommender back to DataModel is kind of a strong dependency, 
which may intermingle components for different concerns. Maybe we can do 
something to RefreshHelper class? e.g. push something into a swap field 
so the downstream of a refreshable chain can read it out. I have read 
Gokhan's UpdateAwareDataModel, and feel that it's probably too 
heavyweight for a model selector as every time he change the algorithm 
he has to re-register that.


The second idea is indeed splendid, we should separate time-complexity 
first and space-complexity first implementation. What I'm not quite 
sure, is that if we really need to create two interfaces instead of one. 
Personally, I think 12G heap space is not that high right? Most new 
laptop can already handle that (emphasis on laptop). And if we replace 
hash map (the culprit of high memory consumption) with list/linkedList, 
it would simply degrade time complexity for a linear search to O(n), not 
too bad either. The current DataModel is a result of careful thoughts 
and has underwent extensive test, it is easier to expand on top of it 
instead of subverting it.


All the best,
Yours Peng

On 13-07-16 01:05 AM, Sebastian Schelter wrote:

Hi Gokhan,

I like your proposals and I think this is an important discussion. Peng
is also interested in working on online recommenders, so we should try
to team up our efforts. I'd like to extend the discussion a little to
related API changes, that I think are necessary.

What do you think about completely removing the setPreference() and
removePreference() methods from Recommender? I think they don't belong
there for two reasons: First,  they duplicate functionality from
DataModel and second, a lot of recommenders are read-only/train-once and
cannot handle single preference updates anyway.

I think we should have a DataModel implementation that can be updated
and an online learning recommender should be able to register to be
notified with updates.

We should further more split up the DataModel interface into a hierarchy
of three parts:

First, a simple readonly interface that allows sequential access to the
data (similar to FactorizablePreferences). This allows us to create
memory efficient implementations. E.g. Cheng reported in MAHOUT-1272
that the current DataModel needs 12GB heap for the Netflix dataset (100M
ratings) which is unacceptable. I was able to fit the KDD Music dataset
(250M ratings) into 3GB with FactorizablePreferences.

The second interface would extend the readonly interface and should
resemble what DataModel is today: An easy-to-use in-memory
implementation that trades high memory consumption for convenient random
access.

And finally the third interface would extend the second and provide
tooling for online updates of the data.

What do you think of that? Does it sound reasonable?

--sebastian



The DataModel I imagine would follow the current API, where underlying
preference storage is replaced with a matrix.

A Recommender would then use the DataModel and the OnlineLearner, where
Recommender#setPreference is delegated to DataModel#setPreference (like it
does now), and DataModel#setPreference triggers OnlineLearner#train.









[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng commented on MAHOUT-1272:


Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.


 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: libimsetiSVDRecomenderEvaluatorRunner.java

here is the component for testing on libimseti dataset

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
 ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707830#comment-13707830
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/13/13 8:57 PM:
-

Test on libimseti dataset (http://www.occamslab.com/petricek/data/), libimseti 
is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

(for ratingSGD)
  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

(for parallelSGD)
  double mu0=1;
  double decayFactor=1;
  int stepOffset=100;
  double forgettingExponent=-1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  was (Author: peng):
Test on libimseti dataset (http://www.occamslab.com/petricek/data/), 
libimseti is a czech dating website.
This dataset has been used in a live example described in book 'Mahout in 
Action', page 71, written by a few guys hanging around this site.

parameters:
  private final static double lambda = 0.1;
  private final static int rank = 16;
  
  private static int numALSIterations=5;
  private static int numEpochs=20;

  double randomNoise=0.02;
  double learningRate=0.01;
  double learningDecayRate=1;

result (using average absolute difference, the rating is based on a 1-10 scale):

INFO: ==Recommender With ALSWRFactorizer: 1.5623366369454739 
time spent: 41.24s=== (should be noted the number of ALS 
iteration is much smaller than others, which leads to suboptimal result, but 
this is not the point of this test)
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 1.28022379922957 
time spent: 118.188s===
Jul 13, 2013 4:39:34 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
1.2798905733917445 time spent: 21.806s

This is already the best result I can get, the original book claims a best 
result of 1.12 on this dataset, which I never achieve. If you have also 
experimented and find a better parameter set, please post here.

  
 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
 ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-13 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: NetflixRecomenderEvaluatorRunner.java

Runnable component for testing ParallelSGDFactorizer on netflix training 
dataset (yeah, only the trainingSet generated by NetflixDatasetConverter, I 
cannot get judging.txt for validation, but my purpose is just to test its 
efficiency on extreme scale, so whatever).

Warning! To run it without danger you need to allocate at least 12G of heap 
space to jvm by using the following VM parameters:

-Xms12288M -Xmx12288M.

In addition, 16G+ RAM is MANDATORY otherwise either garbage collection or swap 
will kill you (or both). I almost burned my laptop on this (which has only 8G 
RAM). As a result, I won't be able to post any result before I can get a better 
machine. But since its number of rating is about 6 times the size of the 
movielens-10m or libimseti dataset, and SGD scales linearly to this number, I 
estimate the running time to be between 2.5-3 minutes.

I will be utmost obliged to anybody who can try it and post the result here (of 
course, if your machine can handle it). But obviously as Sebastian has pointed 
out, our FileDataModel needs some serious optimization to handle such scale.

Hey Sebastian, can you try this out in your lab? That will be most helpful.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 libimsetiSVDRecomenderEvaluatorRunner.java, mahout.patch, 
 NetflixRecomenderEvaluatorRunner.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizer.java, ParallelSGDFactorizerTest.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707065#comment-13707065
 ] 

Peng Cheng commented on MAHOUT-1274:


Main component finished. The new factorizer and recommender can support adding 
new users and items, and update user/item vectors in only one GD step (this is 
very suboptimal, but I will improve this part very soon).

But I don't know how to test it, the sandbox GenericDataModel doesn't support 
setPreference(...) and removePreference(...) yet. (SlopeOneRecommenderTest 
doesn't test this part either). Could someone tell me if there is an 
alternative to avoid this problem?

As Sebastian have foretold, now is not the best time for adding support for an 
online recommender: The SlopeOneRecommender is half-dead, many dependencies are 
incomplete, and everybody's attention were drawn to core-0.8 release. 
Regardless, I'll try to solve it myself, and spend some time on other tickets.

 SGD-based Online SVD recommender
 

 Key: MAHOUT-1274
 URL: https://issues.apache.org/jira/browse/MAHOUT-1274
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, features, machine_learning, svd
   Original Estimate: 336h
  Remaining Estimate: 336h

 an online SVD recommender is otherwise similar to an offline SVD recommender 
 except that, upon receiving one or several new recommendations, it can add 
 them into the training dataModel and update the result accordingly in real 
 time.
 an online SVD recommender should override setPreference(...) and 
 removePreference(...) in AbstractRecommender such that the factorization 
 result is updated in O(1) time and without retraining.
 Right now the slopeOneRecommender is the only component possessing such 
 capability.
 Since SGD is intrinsically an online algorithm and its CF implementation is 
 available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
 good time to convert it. Such feature could come in handy for some websites.
 Implementation: Adding new users, items, or increasing rating matrix rank are 
 just increasing size of user and item matrices. Reducing rating matrix rank 
 involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
 algorithm, multiple passes are required to achieve an acceptable optimality 
 and even more so if hyperparameters are bad. But here are two possible 
 circumvents:
 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
 applying stochastic convex-opt algorithm to non-convex problem is anarchy. 
 But it may be a long shot.
 2. Run incomplete passes in each online update using ratings randomly sampled 
 (but not uniformly sampled) from latest dataModel. I don't know how exactly 
 this should be done but new rating should be sampled more frequently. Uniform 
 sampling will results in old ratings being used more than new ratings in 
 total. If somebody has worked on this batch-to-online conversion before and 
 share his insight that would be awesome. This seems to be the most viable 
 option, if I get the non-uniform pseudorandom generator that maintains a 
 cumulative uniform distribution I want.
 I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but 
 it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-12 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13707104#comment-13707104
 ] 

Peng Cheng commented on MAHOUT-1274:


Totally agree, I don't know about other DataModel but current GenericDataModel 
uses two maps of PreferenceArray, which is conterintuitive. I thought it can be 
a double FastByIDMap that allows O(1) random access, but I must missed some 
other requirements.

Haven't read FactorizablePreferences yet, thanks a lot for your advice.

 SGD-based Online SVD recommender
 

 Key: MAHOUT-1274
 URL: https://issues.apache.org/jira/browse/MAHOUT-1274
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, features, machine_learning, svd
   Original Estimate: 336h
  Remaining Estimate: 336h

 an online SVD recommender is otherwise similar to an offline SVD recommender 
 except that, upon receiving one or several new recommendations, it can add 
 them into the training dataModel and update the result accordingly in real 
 time.
 an online SVD recommender should override setPreference(...) and 
 removePreference(...) in AbstractRecommender such that the factorization 
 result is updated in O(1) time and without retraining.
 Right now the slopeOneRecommender is the only component possessing such 
 capability.
 Since SGD is intrinsically an online algorithm and its CF implementation is 
 available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
 good time to convert it. Such feature could come in handy for some websites.
 Implementation: Adding new users, items, or increasing rating matrix rank are 
 just increasing size of user and item matrices. Reducing rating matrix rank 
 involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
 algorithm, multiple passes are required to achieve an acceptable optimality 
 and even more so if hyperparameters are bad. But here are two possible 
 circumvents:
 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
 applying stochastic convex-opt algorithm to non-convex problem is anarchy. 
 But it may be a long shot.
 2. Run incomplete passes in each online update using ratings randomly sampled 
 (but not uniformly sampled) from latest dataModel. I don't know how exactly 
 this should be done but new rating should be sampled more frequently. Uniform 
 sampling will results in old ratings being used more than new ratings in 
 total. If somebody has worked on this batch-to-online conversion before and 
 share his insight that would be awesome. This seems to be the most viable 
 option, if I get the non-uniform pseudorandom generator that maintains a 
 cumulative uniform distribution I want.
 I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but 
 it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: (Bi-)Weekly/Monthly Dev Sessions

2013-07-09 Thread Peng Cheng
Sorry I missed the meeting, I really want to listen to your discussion 
but yesterday a thunderstorm cut off my electricity.


On 13-07-08 08:29 PM, Andrew Musselman wrote:

I'm getting an error when I build after doing svn up:

$ mvn package
[INFO] Scanning for projects...
[ERROR] The build could not read 1 project - [Help 1]
[ERROR]
[ERROR]   The project  (/home/akm/mahout/pom.xml) has 1 error
[ERROR] Non-readable POM /home/akm/mahout/pom.xml: no more data
available - expected end tag /project to close start tag project from
line 2, parser stopped on END_TAG seen .../reporting\n/project\n...
@1030:1

But there's a /project tag at the end of that..


On Mon, Jul 8, 2013 at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:


Hmm, seems like that old link doesn't work.  Here's a new one:
https://plus.google.com/hangouts/_/899b63ca1b3864c749886348cdddfcd80d00bb0b?hl=en

-Grant

On Jul 7, 2013, at 5:24 PM, Grant Ingersoll gsing...@apache.org wrote:


How about tomorrow (Monday) night at 8:30 pm EDT?

Anyone who wants to join, can browse to

https://plus.google.com/hangouts/_/1aa32da8d1f9b1669cf6b5ec8bce123d12aec409?hl=en
 If for some reason that doesn't work, ping me on IRC (gsingers) in the
#mahout channel on Freenode.


Agenda:

0.8 Release Testing

-Grant


On Jun 25, 2013, at 6:17 PM, Suneel Marthi suneel_mar...@yahoo.com

wrote:

Is today's Hangout happening?



On Wed, Jun 12, 2013 at 4:26 AM, Grant Ingersoll gsing...@apache.org

wrote:
Hi,

One of the things we kicked around at Buzzwords was having a
weekly/bi-weekly/monthly dev session via Google hangout (Drill does

this

with good success, I believe).  Since we are so spread out, I thought

I

would throw out a Doodle (scheduling tool for those unfamiliar) to see

what

times work best for the majority of people interested in such a thing.
   Anyone is free to participate, but this is not a Q and A session,

but is

instead focused on writing code, fixing bugs, triaging JIRA,

releasing,

etc.

If you are interested, please fill out

http://doodle.com/gatxxkm7f25fq5y8 (note, all times are Eastern Time

Zone

since I did the poll!)  I just

grabbed a sampling of hours throughout the day.  I also picked 1 week

as

being representative of this being on a repeating schedule.  If none

of

the

times work for you, but you are still interested, please respond

here.  I

would imagine we would meet for 1-2 hours.

Also, please reply with the frequency at which you would like to meet:

[]  Weekly
[]  Bi-weekly (every 2 weeks)
[]  Monthly

My vote is every two weeks.

-Grant




--
Thanks,
Pradeep


Grant Ingersoll | @gsingers
http://www.lucidworks.com







Grant Ingersoll | @gsingers
http://www.lucidworks.com











Re: 0.8 progress

2013-07-08 Thread Peng Cheng

Hi Sebastian,

I'm sorry for the entirely noobish questions: where can I download the 
judging.txt ground truth set? (netflix is pulling it off everywhere, so 
far I can only get the legacy trainingSet and qualifying.txt)
and how do I inject the ParallelAlsFactorizationJob into a common 
recommender class?
I was trying to reproduce your result (I own a small cluster), but don't 
even know where to start. The only related thing i found in 
mahout-example is a format converter.


Thanks a lot if you can give me a hint.

- Yours Peng

On 13-07-01 01:24 AM, Sebastian Schelter wrote:

I successfully ran the ALS and cooccurrence-based recommenders on the
Netflix dataset on a 26 machine cluster using Hadoop 1.0.4.

--sebastian


On 28.06.2013 21:31, Jake Mannix wrote:

I can run LDA on Twitter's cluster, on both reuters and some real data,
as well as LR/SGD.


On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote:


We really should setup a VM that we can run a couple of nodes (perhaps at
ASF?) on that we can share w/ everyone that makes it easy to test our stuff
on Hadoop for the specific version that we ship.

On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote:


Can someone (if you have time and experience). Write a small shim to run
all examples one after the other on a cluster and write up instructions

on

how to do it.?

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org

wrote:

Its crucial that we retest everything on a real cluster before the

release.

I will do this for the recommenders code next week.

--sebastian
Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org:


I should have time next week to do the release, if we can get these
knocked out.  If not next week, the following.

On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com
wrote:


1. Could someone look at Mahout-1257? There is a patch that's been

submitted but I am not sure if this has been superseded by Sean's

against

Mahout-1239.

2. Stevo, I am for fixing the findbugs excludes as part of 0.8

release,

I see that the number of warnings has gone up over the last few builds.

3. I am more concerned about the cause of the mysterious cosmic rays

that randomly fail unit tests (since we have moved to running parallel
tests).  I see that happening on my local repository too.





From: Stevo Slavić ssla...@gmail.com
To: dev@mahout.apache.org
Sent: Friday, June 28, 2013 3:21 AM
Subject: Re: 0.8 progress


Well done team!

Build is unstable, oscillates, IMO regardless of changes made. Judging

from

logs I suspect that some of the Jenkins nodes are not configured well,

/tmp

directory security related issues, and file size constraints. Could be

also

issue with our tests.

Javadoc was reported earlier not to be OK (not all modules in

aggregated

javadoc), and code quality reports are not working OK, e.g. findbugs
doesn't respect excludes - plan to work on this during weekend.

Do we want to fix these before or after 0.8 release?

Kind regards,
Stevo Slavić.


On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com

wrote:

All Done

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com

wrote:

I sent the comments. The code is good. But without the matrix/vector

input

we cant ship it in the release. Hope Yiqun and Da Zhang can make

those

changes quickly.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll 

gsing...@apache.org

wrote:


I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any

chance

we

can finish this up this week?

-Grant

On Jun 23, 2013, at 9:26 AM, Suneel Marthi 

suneel_mar...@yahoo.com

wrote:


Finally got to finishing up M-833, the changes can be reviewed at

https://reviews.apache.org/r/11774/diff/3/.






From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Tuesday, June 11, 2013 10:09 AM
Subject: Re: 0.8 progress


I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by

Thursday, I can roll an RC on Thursday.

-Grant

On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org

wrote:

Down to 4 issues!  I would say what they are, but JIRA is flaking

out

again.

My instinct is that 1030 and 1233 can be pushed.  Suneel has been

working hard to get M-833 in.  Not sure on M-1214, Robin?

-G

On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org

wrote:

On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 

gsing...@apache.org

wrote:

M-1067 -- Dmitriy  --  This is an enhancement, should we push?

Looks like this was committed already.





Grant Ingersoll | @gsingers
http://www.lucidworks.com


Grant Ingersoll

Re: 0.8 progress

2013-07-08 Thread Peng Cheng

Hi Sebastian,

I'm sorry for the entirely noobish questions: where can I download the 
judging.txt ground truth set? (netflix is pulling it off everywhere, so 
far I can only get the legacy trainingSet and qualifying.txt)
and how do I inject the ParallelAlsFactorizationJob into a common 
recommender class?
I was trying to reproduce your result (I own a small cluster), but don't 
even know where to start. The only related thing i found in 
mahout-example is a format converter.


Thanks a lot if you can give me a hint.

- Yours Peng

On 13-07-01 01:24 AM, Sebastian Schelter wrote:

I successfully ran the ALS and cooccurrence-based recommenders on the
Netflix dataset on a 26 machine cluster using Hadoop 1.0.4.

--sebastian


On 28.06.2013 21:31, Jake Mannix wrote:

I can run LDA on Twitter's cluster, on both reuters and some real data,
as well as LR/SGD.


On Fri, Jun 28, 2013 at 11:51 AM, Grant Ingersoll gsing...@apache.orgwrote:


We really should setup a VM that we can run a couple of nodes (perhaps at
ASF?) on that we can share w/ everyone that makes it easy to test our stuff
on Hadoop for the specific version that we ship.

On Jun 28, 2013, at 2:41 PM, Robin Anil robin.a...@gmail.com wrote:


Can someone (if you have time and experience). Write a small shim to run
all examples one after the other on a cluster and write up instructions

on

how to do it.?

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Fri, Jun 28, 2013 at 1:11 PM, Sebastian Schelter s...@apache.org

wrote:

Its crucial that we retest everything on a real cluster before the

release.

I will do this for the recommenders code next week.

--sebastian
Am 28.06.2013 14:03 schrieb Grant Ingersoll gsing...@apache.org:


I should have time next week to do the release, if we can get these
knocked out.  If not next week, the following.

On Jun 28, 2013, at 5:46 AM, Suneel Marthi suneel_mar...@yahoo.com
wrote:


1. Could someone look at Mahout-1257? There is a patch that's been

submitted but I am not sure if this has been superseded by Sean's

against

Mahout-1239.

2. Stevo, I am for fixing the findbugs excludes as part of 0.8

release,

I see that the number of warnings has gone up over the last few builds.

3. I am more concerned about the cause of the mysterious cosmic rays

that randomly fail unit tests (since we have moved to running parallel
tests).  I see that happening on my local repository too.





From: Stevo Slavić ssla...@gmail.com
To: dev@mahout.apache.org
Sent: Friday, June 28, 2013 3:21 AM
Subject: Re: 0.8 progress


Well done team!

Build is unstable, oscillates, IMO regardless of changes made. Judging

from

logs I suspect that some of the Jenkins nodes are not configured well,

/tmp

directory security related issues, and file size constraints. Could be

also

issue with our tests.

Javadoc was reported earlier not to be OK (not all modules in

aggregated

javadoc), and code quality reports are not working OK, e.g. findbugs
doesn't respect excludes - plan to work on this during weekend.

Do we want to fix these before or after 0.8 release?

Kind regards,
Stevo Slavić.


On Fri, Jun 28, 2013 at 12:32 AM, Robin Anil robin.a...@gmail.com

wrote:

All Done

Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 11:36 PM, Robin Anil robin.a...@gmail.com

wrote:

I sent the comments. The code is good. But without the matrix/vector

input

we cant ship it in the release. Hope Yiqun and Da Zhang can make

those

changes quickly.


Robin Anil | Software Engineer | +1 312 869 2602 | Google Inc.


On Sun, Jun 23, 2013 at 8:46 PM, Grant Ingersoll 

gsing...@apache.org

wrote:


I see 1 issue left: MAHOUT-1214.  It is assigned to Robin.  Any

chance

we

can finish this up this week?

-Grant

On Jun 23, 2013, at 9:26 AM, Suneel Marthi 

suneel_mar...@yahoo.com

wrote:


Finally got to finishing up M-833, the changes can be reviewed at

https://reviews.apache.org/r/11774/diff/3/.






From: Grant Ingersoll gsing...@apache.org
To: dev@mahout.apache.org
Sent: Tuesday, June 11, 2013 10:09 AM
Subject: Re: 0.8 progress


I pushed M-1030 and M-1233.  If we can get M-833 and M-1214 in by

Thursday, I can roll an RC on Thursday.

-Grant

On Jun 11, 2013, at 8:56 AM, Grant Ingersoll gsing...@apache.org

wrote:

Down to 4 issues!  I would say what they are, but JIRA is flaking

out

again.

My instinct is that 1030 and 1233 can be pushed.  Suneel has been

working hard to get M-833 in.  Not sure on M-1214, Robin?

-G

On Jun 9, 2013, at 6:10 PM, Grant Ingersoll gsing...@apache.org

wrote:

On Jun 9, 2013, at 6:02 PM, Grant Ingersoll 

gsing...@apache.org

wrote:

M-1067 -- Dmitriy  --  This is an enhancement, should we push?

Looks like this was committed already.





Grant Ingersoll | @gsingers
http://www.lucidworks.com


Grant Ingersoll

[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702175#comment-13702175
 ] 

Peng Cheng commented on MAHOUT-1272:


Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own 
you this.
testing on netflix dataset has encountered some trouble, namely, I don't know 
where to download it :-. Great appreciation for anyone who can share his 
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as 
code snippets.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-08 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13702175#comment-13702175
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/8/13 6:06 PM:


Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I own 
you this.
I'll test more grouplens data. Since Sebastian has taken over the code, new 
test cases will only be posted as code snippets.

  was (Author: peng):
Hey Sebastian, Hudson, Thank you so much for on pushing things that hard. I 
own you this.
testing on netflix dataset has encountered some trouble, namely, I don't know 
where to download it :-. Great appreciation for anyone who can share his 
judging.txt. In the mean time I'll try more grouplens data.
Since Sebastian has taken over the code, new test cases will only be posted as 
code snippets.
  
 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Fix For: 0.8

 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701672#comment-13701672
 ] 

Peng Cheng commented on MAHOUT-1272:


Hey honoured contributors I've got some crude test results for the new parallel 
SGD factorizer for CF:

1. parameters:
lambda = 1e-10
rank of the rating matrix/number of features of each user/item vectors = 50
number of biases: 3 (average rating + user bias + item bias)
number of iterations/epochs = 2 (for all factorizers including ALSWR, 
ratingSGD and the proposed parallelSGD)
initial mu/learning rate = 0.01 (for ratingSGD and proposed parallelSGD)
decay rate of mu = 1 (does not decay) (for ratingSGD and proposed 
parallelSGD)
other parameters are set to default.

2. result on movielens-10m (I don't know what the hell happened to ALSWR, the 
default hyperparameters must screw up real bad, but my point is the speed edge):
  a. RMSE

Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 3.7709163950800665E21 
time spent: 6.179s===
Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8847393972529887 time spent: 6.179s===
Jul 07, 2013 5:20:23 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8805947464818478 time spent: 3.084s

  b. Absolute Average

INFO: ==Recommender With ALSWRFactorizer: 1.2085420449917682E19 
time spent: 7.444s===
Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.675685274206 time spent: 7.444s===
Jul 07, 2013 5:22:39 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.6775774766740665 time spent: 2.365s

3. result on movielens-1m (in average sgd works worse on it comparing to 
movielens-10m, perhaps I could use more iterations/epochs)

  a. RMSE

Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 1.3514189134383086E20 
time spent: 0.637s===
Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.9312989913558529 time spent: 0.637s===
Jul 07, 2013 5:26:04 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.9529995632658007 time spent: 0.305s

  b. Absolute Average

Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ALSWRFactorizer: 
1.58934499216789965E18 time spent: 0.626s===
Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.7459565635961599 time spent: 0.626s===
Jul 07, 2013 5:25:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.7420818642753416 time spent: 0.297s

Great thanks to Sebastian for his guidance, I'll upload the EvaluatorRunner 
class as a mahout-example component and the formatted code shortly.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http

[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: ParallelSGDFactorizerTest.java
ParallelSGDFactorizer.java
GroupLensSVDRecomenderEvaluatorRunner.java

My laptop is a HP Pavilion with Intel® Core™ i7-3610QM CPU @ 2.30GHz × 8 and 8G 
mem.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701679#comment-13701679
 ] 

Peng Cheng commented on MAHOUT-1272:


Hi Sebastian may I ask question? I digged some old post and found that the best 
result should be RMSE ~= 0.85, do you know the parameters being used?

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/7/13 10:21 PM:
-

New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m, all evaluation uses RMSE:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

  was (Author: peng):
New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.
  
 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701682#comment-13701682
 ] 

Peng Cheng commented on MAHOUT-1272:


New parameter:
lambda = 0.001
rank of the rating matrix/number of features of each user/item vectors = 5
number of iterations/epochs = 20

result on movielens-10m:
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With RatingSGDFactorizer: 
0.8119081937625745 time spent: 36.509s===
Jul 07, 2013 6:18:57 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: ==Recommender With ParallelSGDFactorizer: 
0.8115207244832938 time spent: 8.747s

This is fast and accurate enough, I'm advancing to netflix prize dataset.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-07 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701688#comment-13701688
 ] 

Peng Cheng commented on MAHOUT-1272:


Hi Sebastian,

Really? I would break my fingers to squeeze into 0.8 release. (not RC1 of 
course, but there is still RC2 :-) A few guys I work with are also kicking me 
for the online recommender, so I can work very hard and undistracted. You just 
tell me what to do next and I'll be thrilled to oblige.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: GroupLensSVDRecomenderEvaluatorRunner.java, 
 mahout.patch, ParallelSGDFactorizer.java, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java, ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Code Freeze for 0.8

2013-07-07 Thread Peng Cheng

Hi Dr Dunning,

I recently joined the team and am working on tickets 1272 and 1274 right 
now. I was planning to commit to core-0.8 rc2 but the time frame seems 
harsh. Could you tell me if is it practical? I'm a hard worker.


PS I was there at your presentation in Toronto this year. Not ashamed to 
say, one of the funniest lecture in my life.


-Yours Peng

On 13-07-07 05:19 PM, Grant Ingersoll wrote:

Working on the release now.  If anyone wants to join in, I'm on IRC as well.

-Grant


On Jul 5, 2013, at 12:40 PM, Sebastian Schelters...@apache.org  wrote:


+1

On 05.07.2013 18:06, Jake Mannix wrote:

+1



On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunningted.dunn...@gmail.com  wrote:


+1


On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com

wrote:
+1




From: Grant Ingersollgsing...@apache.org
To:dev@mahout.apache.org  dev@mahout.apache.org
Sent: Friday, July 5, 2013 10:36 AM
Subject: Code Freeze for 0.8


I know it's short notice, but I'd like to suggest a code freeze for 0.8
today or tomorrow and I will do a 0.8 RC on Sunday.  Based on JIRA, etc.,
it looks like this should be fine, but let me know if there are any
objections.

Thanks,
Grant





Grant Ingersoll | @gsingers
http://www.lucidworks.com











Re: Code Freeze for 0.8

2013-07-07 Thread Peng Cheng

Hi Dr Dunning,

Thanks a lot, I just read that the deadline is within 7 days and 
immediately realize how retarded my plan was. There will be not rc1 or 
rc2, just rc.

Will have to spam some test in the next few days.

- Peng

On 07/07/2013 10:12 PM, Ted Dunning wrote:

Peng,

Strictly speaking, the code is frozen already.  Sebastian seems to think
some can get in, but even that is pushing things.


On Sun, Jul 7, 2013 at 3:59 PM, Peng Cheng pc...@uowmail.edu.au wrote:


Hi Dr Dunning,

I recently joined the team and am working on tickets 1272 and 1274 right
now. I was planning to commit to core-0.8 rc2 but the time frame seems
harsh. Could you tell me if is it practical? I'm a hard worker.

PS I was there at your presentation in Toronto this year. Not ashamed to
say, one of the funniest lecture in my life.

-Yours Peng


On 13-07-07 05:19 PM, Grant Ingersoll wrote:


Working on the release now.  If anyone wants to join in, I'm on IRC as
well.

-Grant


On Jul 5, 2013, at 12:40 PM, Sebastian Schelters...@apache.org  wrote:

  +1

On 05.07.2013 18:06, Jake Mannix wrote:


+1



On Fri, Jul 5, 2013 at 8:47 AM, Ted Dunningted.dunn...@gmail.com
  wrote:

  +1


On Fri, Jul 5, 2013 at 7:43 AM, Suneel Marthi suneel_mar...@yahoo.com


wrote:
+1



__**__
From: Grant Ingersollgsing...@apache.org
To:dev@mahout.apache.org  dev@mahout.apache.org
Sent: Friday, July 5, 2013 10:36 AM
Subject: Code Freeze for 0.8


I know it's short notice, but I'd like to suggest a code freeze for
0.8
today or tomorrow and I will do a 0.8 RC on Sunday.  Based on JIRA,
etc.,
it looks like this should be fine, but let me know if there are any
objections.

Thanks,
Grant



  --**--

Grant Ingersoll | @gsingers
http://www.lucidworks.com














[jira] [Comment Edited] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-06 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701233#comment-13701233
 ] 

Peng Cheng edited comment on MAHOUT-1272 at 7/6/13 2:43 PM:


Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 items) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.

  was (Author: peng):
Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 times) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.
  
 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1274:
---

Description: 
an online SVD recommender is otherwise similar to an offline SVD recommender 
except that, upon receiving one or several new recommendations, it can add them 
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and 
removePreference(...) in AbstractRecommender such that the factorization result 
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such 
capability.

Since SGD is intrinsically an online algorithm and its CF implementation is 
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
good time to convert it. Such feature could come in handy for some websites.

Implementation: Adding new users, items, or increasing rating matrix rank are 
just increasing size of user and item matrices. Reducing rating matrix rank 
involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
algorithm, multiple passes are required to achieve an acceptable optimality and 
even more so if hyperparameters are bad. But here are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But 
it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly sampled 
(but not uniformly sampled) from latest dataModel. I don't know how exactly 
this should be done but new rating should be sampled more frequently. Uniform 
sampling will results in old ratings being used more than new ratings in total. 
If somebody has worked on this batch-to-online conversion before and share his 
insight that would be awesome. This seems to be the most viable option, if I 
get the non-uniform pseudorandom generator that maintains a cumulative uniform 
distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it 
didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

  was:
an online SVD recommender is otherwise similar to an offline SVD recommender 
except that, upon receiving one or several new recommendations, it can add them 
into the training dataModel and update the result accordingly in real time.

an online SVD recommender should override setPreference(...) and 
removePreference(...) in AbstractRecommender such that the factorization result 
is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such 
capability.

Since SGD is intrinsically an online algorithm and its CF implementation is 
available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
good time to convert it. Such feature could come in handy for some websites.

Implementation: Adding new users, items, or increasing rating matrix rank are 
just increasing size of user and item matrices. Reducing rating matrix rank 
involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
algorithm, multiple passes are required to achieve an acceptable optimality and 
even more so if hyperparameters are bad. But here are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
applying stochastic convex-opt algorithm to non-convex problem is anarchy. But 
it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly sampled 
(but not uniformly sampled) from latest dataModel. I don't know how exactly 
this should be done but new rating should be sampled more frequently. Uniform 
sampling will results in old ratings being used more than new ratings in total. 
If somebody has worked on this batch-to-online conversion before and share his 
insight that would be awesome. This seems to be the most viable option, if I 
get the non-uniform pseudorandom generator that maintains a cumulative uniform 
distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but it 
didn't pay off. Hopefully its not a bad idea to submit a new tickets.


 SGD-based Online SVD recommender
 

 Key: MAHOUT-1274
 URL: https://issues.apache.org/jira/browse/MAHOUT-1274
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, features, machine_learning, svd
   Original Estimate: 336h
  Remaining Estimate: 336h

 an online SVD recommender is otherwise similar to an offline SVD recommender 
 except that, upon receiving one or several new recommendations, it can add 
 them

[jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701380#comment-13701380
 ] 

Peng Cheng commented on MAHOUT-1274:


BTW may I ask (noobishly) that why you have deprecated the SlopeOneRecommender 
in the latest core-0.8 snapshot? i must have missed a lot in previous 
mahout-development emails before i join so apologies if its a stupid question.

 SGD-based Online SVD recommender
 

 Key: MAHOUT-1274
 URL: https://issues.apache.org/jira/browse/MAHOUT-1274
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, features, machine_learning, svd
   Original Estimate: 336h
  Remaining Estimate: 336h

 an online SVD recommender is otherwise similar to an offline SVD recommender 
 except that, upon receiving one or several new recommendations, it can add 
 them into the training dataModel and update the result accordingly in real 
 time.
 an online SVD recommender should override setPreference(...) and 
 removePreference(...) in AbstractRecommender such that the factorization 
 result is updated in O(1) time and without retraining.
 Right now the slopeOneRecommender is the only component possessing such 
 capability.
 Since SGD is intrinsically an online algorithm and its CF implementation is 
 available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would be a 
 good time to convert it. Such feature could come in handy for some websites.
 Implementation: Adding new users, items, or increasing rating matrix rank are 
 just increasing size of user and item matrices. Reducing rating matrix rank 
 involves just one svd. The real challenge here is that sgd is NO ONE-PASS 
 algorithm, multiple passes are required to achieve an acceptable optimality 
 and even more so if hyperparameters are bad. But here are two possible 
 circumvents:
 1. Use one-pass algorithms like averaged-SGD, not sure if it can ever work as 
 applying stochastic convex-opt algorithm to non-convex problem is anarchy. 
 But it may be a long shot.
 2. Run incomplete passes in each online update using ratings randomly sampled 
 (but not uniformly sampled) from latest dataModel. I don't know how exactly 
 this should be done but new rating should be sampled more frequently. Uniform 
 sampling will results in old ratings being used more than new ratings in 
 total. If somebody has worked on this batch-to-online conversion before and 
 share his insight that would be awesome. This seems to be the most viable 
 option, if I get the non-uniform pseudorandom generator that maintains a 
 cumulative uniform distribution I want.
 I found a very old ticket (MAHOUT-572) mentioning online SVD recommender but 
 it didn't pay off. Hopefully its not a bad idea to submit a new ticket here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: [jira] [Commented] (MAHOUT-1274) SGD-based Online SVD recommender

2013-07-06 Thread Peng Cheng

Hi Sebastian,

Thanks a lot for help! You mean core-1.0 or bundle-1.0? I hope I can 
work hard enough to catch the next release. Also, what do you think 
about the proposed online pseudorandom sampling problem?


I was digging old threads and found MAHOUT-1069, which already did a lot 
of work I need right now, and used a lot of code optimization 
techniques, but was eventually rejected for being too complex and 
drastic. :-


I wonder if overengineering is a researcher's most dangerous bane, 
happened to a lot of people.


On 13-07-06 01:31 PM, Sebastian Schelter wrote:

Hi Peng,

We deprecated a lot of algorithms that we found to be not much used to
streamline our codebase for a coming 1.0 release.
Am 06.07.2013 10:25 schrieb Peng Cheng (JIRA) j...@apache.org:


 [
https://issues.apache.org/jira/browse/MAHOUT-1274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701380#comment-13701380]

Peng Cheng commented on MAHOUT-1274:


BTW may I ask (noobishly) that why you have deprecated the
SlopeOneRecommender in the latest core-0.8 snapshot? i must have missed a
lot in previous mahout-development emails before i join so apologies if its
a stupid question.


SGD-based Online SVD recommender


 Key: MAHOUT-1274
 URL: https://issues.apache.org/jira/browse/MAHOUT-1274
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: collaborative-filtering, features,

machine_learning, svd

   Original Estimate: 336h
  Remaining Estimate: 336h

an online SVD recommender is otherwise similar to an offline SVD

recommender except that, upon receiving one or several new recommendations,
it can add them into the training dataModel and update the result
accordingly in real time.

an online SVD recommender should override setPreference(...) and

removePreference(...) in AbstractRecommender such that the factorization
result is updated in O(1) time and without retraining.

Right now the slopeOneRecommender is the only component possessing such

capability.

Since SGD is intrinsically an online algorithm and its CF implementation

is available in core-0.8 (See MAHOUT-1089, MAHOUT-1272), I presume it would
be a good time to convert it. Such feature could come in handy for some
websites.

Implementation: Adding new users, items, or increasing rating matrix

rank are just increasing size of user and item matrices. Reducing rating
matrix rank involves just one svd. The real challenge here is that sgd is
NO ONE-PASS algorithm, multiple passes are required to achieve an
acceptable optimality and even more so if hyperparameters are bad. But here
are two possible circumvents:

1. Use one-pass algorithms like averaged-SGD, not sure if it can ever

work as applying stochastic convex-opt algorithm to non-convex problem is
anarchy. But it may be a long shot.

2. Run incomplete passes in each online update using ratings randomly

sampled (but not uniformly sampled) from latest dataModel. I don't know how
exactly this should be done but new rating should be sampled more
frequently. Uniform sampling will results in old ratings being used more
than new ratings in total. If somebody has worked on this batch-to-online
conversion before and share his insight that would be awesome. This seems
to be the most viable option, if I get the non-uniform pseudorandom
generator that maintains a cumulative uniform distribution I want.

I found a very old ticket (MAHOUT-572) mentioning online SVD recommender

but it didn't pay off. Hopefully its not a bad idea to submit a new ticket
here.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA
administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira






[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Labels: features patch test  (was: )
Status: Patch Available  (was: Open)

Hey I have finished the class and test for parallel sgd factorizer for 
matrix-completion based recommender (not mapreduced, just single machine 
multi-thread), it is loosely based on vanilla sgd and hogwild!. I have only 
tested on toy and synthetic data (2000users * 1000 times) but it is pretty 
fast, 3-5x times faster than vanilla sgd with 8 cores. (never exceed 6x, 
apparently the executor induces high overhead allocation cost) And definitely 
faster than single machine ALSWR. 

I'm submitting my java files and patch for review.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: patch, test, features
   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: ParallelSGDFactorizerTest.java
ParallelSGDFactorizer.java

java file

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Cheng updated MAHOUT-1272:
---

Attachment: mahout.patch

patch

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701241#comment-13701241
 ] 

Peng Cheng commented on MAHOUT-1272:


The next step would be to create an online version of this (and recommender)
sgd is an online algorithm but now works only for batch recommender.
In the mean time the only online recommender in mahout is the slope-one, kind 
of a shame.
Will create a new JIRA ticket tomorrow.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-05 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13701247#comment-13701247
 ] 

Peng Cheng commented on MAHOUT-1272:


Aye aye, more test on the way. Much obliged to the quick suggestion.





 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
  Labels: features, patch, test
 Attachments: mahout.patch, ParallelSGDFactorizer.java, 
 ParallelSGDFactorizerTest.java

   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-07-01 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696932#comment-13696932
 ] 

Peng Cheng commented on MAHOUT-1272:


Looks like the 1/n learning rate doesn't work at all on SGD factorizer, maybe 
the convergence of stochastic optimization can't be applied on the non-convex 
MF problem. Can someone show me a paper discussing convergence bound of such 
problem? Much appreciated.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1272) Parallel SGD matrix factorizer for SVDrecommender

2013-06-29 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13696155#comment-13696155
 ] 

Peng Cheng commented on MAHOUT-1272:


learning rate/step size are set to be identical to package ~.classifier.sgd, 
the old learning rate is exponential with a constant decaying factor, this 
setting seems to be only working for smooth functions (proved by Nesterov?), 
I'm not sure if it is true in CF. Otherwise, either use 1/sqrt(n) for convex f 
or 1/n for strongly convex f.

 Parallel SGD matrix factorizer for SVDrecommender
 -

 Key: MAHOUT-1272
 URL: https://issues.apache.org/jira/browse/MAHOUT-1272
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Peng Cheng
Assignee: Sean Owen
   Original Estimate: 336h
  Remaining Estimate: 336h

 a parallel factorizer based on MAHOUT-1089 may achieve better performance on 
 multicore processor.
 existing code is single-thread and perhaps may still be outperformed by the 
 default ALS-WR.
 In addition, its hardcoded online-to-batch-conversion prevents it to be used 
 by an online recommender. An online SGD implementation may help build 
 high-performance online recommender as a replacement of the outdated 
 slope-one.
 The new factorizer can implement either DSGD 
 (http://www.mpi-inf.mpg.de/~rgemulla/publications/gemulla11dsgd.pdf) or 
 hogwild! (www.cs.wisc.edu/~brecht/papers/hogwildTR.pdf).
 Related discussion has been carried on for a while but remain inconclusive:
 http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAHOUT-1089) SGD matrix factorization for rating prediction with user and item biases

2013-06-28 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13695745#comment-13695745
 ] 

Peng Cheng commented on MAHOUT-1089:


Code is slick! But apparently there is no multi-threading yet.
The proposal for it has been there for a long time:
http://web.archiveorange.com/archive/v/z6zxQUSahofuPKEzZkzl

Is somebody working on its implementation?
apparently using hogwild or vanilla DSGD has no big impact on performance.

 SGD matrix factorization for rating prediction with user and item biases
 

 Key: MAHOUT-1089
 URL: https://issues.apache.org/jira/browse/MAHOUT-1089
 Project: Mahout
  Issue Type: New Feature
  Components: Collaborative Filtering
Reporter: Zeno Gantner
Assignee: Sebastian Schelter
 Attachments: MAHOUT-1089.patch, RatingSGDFactorizer.java, 
 RatingSGDFactorizer.java


 A matrix factorization that is trained with standard SGD on all features at 
 the same time, in contrast to ExpectationMaximizationFactorizer, which learns 
 feature by feature.
 Additionally to the free features it models a rating bias for each user and 
 item.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


  1   2   >