Re: Number of features for ALS

2014-03-27 Thread j.barrett Strausser
Thanks Ted, Yes for the time problem. We tend to use aggregations of session data. So instead of asking for user recommendations we do things like user+sessions recommendations. Of course, deciding when sessions start and stop isn't trivial. I ideally what I would want to is time-weight views usi

Re: Number of features for ALS

2014-03-27 Thread Ted Dunning
For the poly-syllable challenged, hetereoscedasticity - degree of variation changes. This is common with counts because you expect the standard deviation of count data to be proportional to sqrt(n). time imhogeneity - changes in behavior over time. One way to handle this (roughly) is to first r

Re: Number of features for ALS

2014-03-27 Thread j.barrett Strausser
For my team it has usually been hetereoscedasticity and time inhomogeneity. On Thu, Mar 27, 2014 at 10:18 AM, Tevfik Aytekin wrote: > Interesting topic, > Ted, can you give examples of those mathematical assumptions > under-pinning ALS which are violated by the real world? > > On Thu, Mar 27,

Re: Number of features for ALS

2014-03-27 Thread Ted Dunning
Least squares techniques in general depend on an assumption of normal distribution of errors. With counts, that is only plausible with large values. Also decomposition a like this make linearity assumptions which imply all items/words are independent. They are clearly not. Sent from my iPhon

Fuzzy KMeans fails on reuters corpus with 4GB max heap size

2014-03-27 Thread tuxdna
I am running Fuzzy KMeans algorithm on Reuters corpus. I am using Mahout 0.7 on Hadoop 1.1 on Ubuntu 12.04 machine. Hadoop cluster consists of two machines * master: 8GB RAM ( 4 cores ) * slave: 4GB RAM ( a KVM vm with only 1 core ) When I run this command, the clustering fails at iteration 3

Re: Number of features for ALS

2014-03-27 Thread Tevfik Aytekin
Interesting topic, Ted, can you give examples of those mathematical assumptions under-pinning ALS which are violated by the real world? On Thu, Mar 27, 2014 at 3:43 PM, Ted Dunning wrote: > How can there be any other practical method? Essentially all of the > mathematical assumptions under-pinni

Re: Number of features for ALS

2014-03-27 Thread Ted Dunning
How can there be any other practical method? Essentially all of the mathematical assumptions under-pinning ALS are violated by the real world. Why would any mathematical consideration of the number of features be much more than heuristic? That said, you can make an information content argument.

Number of features for ALS

2014-03-27 Thread Sebastian Schelter
Hi, does anyone know of a principled approach of choosing the number of features for ALS (other than cross-validation?) --sebastian

RE: reduce is too slow in StreamingKmeans

2014-03-27 Thread fx MA XIAOJUN
Dear Suneel, I am very sorry that I did not reply to you for so long. I just realized that your mail was automatically recognized as spam My data is 700MB. mapred.child.java.opt=-Xmx4g It takes 2 hours for one map to compete its task. 15 map was started for the job. Map runs very fast, I th