Re: Avoiding OOM for large datasets
This is not right. THe sequential version would have finished long before this for any reasonable value of k. I do note, however, that you have set k = 200,000 where you only have 300,000 documents. Depending on which value you set (I don't have the code handy), this may actually be increased inside the streaming k-means when it computes the number of sketch centroids by a factor of roughly 2 log N \approx 2 * 18. This gives far more clusters than you have data points which is silly. Try again with a more reasonably value of k such as 1000. On Wed, Dec 11, 2013 at 7:02 AM, Amir Mohammad Saied amirsa...@gmail.comwrote: Hi, I first tried Streaming K-means with about 5000 news stories, and it worked just fine. Then I tried it over 300,000 news stories and gave it 10GB of RAM. After more than 43 hours, It was still in the last merge-pass when I eventually decided to stop it. I set K to 20 and KM 2522308 (its for detecting similar/related news stories). Using these values, is it expected to take so long? Cheers, amir On Thu, Dec 5, 2013 at 3:38 PM, Amir Mohammad Saied amirsa...@gmail.com wrote: Suneel, Thanks! I tried Streaming K-Means, and now I've two naive questions: 1) If I understand correctly to use the results of streaming k-means I need to iterate over all of my vectors again and assign them to the cluster with the closest centroid to the vector, right? 2) In clustering news, the number of clusters isn't known beforehand. We used to use canopy as a fast approximate clustering technique, but as I understand streaming k-means requires K in advance. How can I avoid guessing K? Regards, Amir On Wed, Dec 4, 2013 at 6:27 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Amir, This has been reported before by several others (and has been my experience too). The OOM happens during Canopy Generation phase of Canopy clustering because it only runs with a single reducer. If you are using Mahout 0.8 (or trunk), suggest that u look at the new Streaming Kmeans clustering which is a quicker and more efficient than the traditional Canopy - KMeans. See the following link for how to run Streaming KMeans. http://stackoverflow.com/questions/17272296/how-to-use-mahout-streaming-k-means On Wednesday, December 4, 2013 1:19 PM, Amir Mohammad Saied amirsa...@gmail.com wrote: Hi, I've been trying to run Mahout (with Hadoop) on our data for quite sometime now. Everything is fine on relatively small data sets, but when I try to do K-Means clustering with the aid of Canopy on like 30 documents, I can't even get past the canopy generation because of OOM. We're going to cluster similar news so T1, and T2 are set to 0.84, and 0.6 (those values lead to desired results on sample data). I tried setting both mapred.map.child.java.opts, and mapred.reduce.child.java.opts to -Xmx4096M, I also exported HADOOP_HEAPSIZE to 4000, and still having issues. I'm running all of this in Hadoop's single node, pseudo-distributed mode on a machine with 16GB of RAM. Searching Internet for solutions I found this[1]. One of the bullet points states that: In all of the algorithms, all clusters are retained in memory by the mappers and reducers So my question is, does Mahout on Hadoop only help in distributing CPU bound operations? What one should do if they have a large dataset, and only a handful of low-RAM commodity nodes? I'm obviously a newbie, thanks for bearing with me. [1] http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3c506307eb.3090...@windwardsolutions.com%3E Cheers, Amir
Re: Slope one algorithm performance
Use a better recommender. Slope one is just there for completeness. Sent from my iPhone On Dec 8, 2013, at 2:24, Siddharth Patnaik spatnai...@gmail.com wrote: What should be done to improve the runtime performance?
Re: SVM Implementation for mahout?
The problem of correlation of features is clearly present in text, but it is not so clear what the effect will be. For naive bayes this has the effect of making the classifier over confident but it usually still works reasonably well. For logistic regression without regularization it can cause the learning algorithm to fail (mahout'so logistic regression is regularized, btw). Empirical evidence dominates theory in this situation. Sent from my iPhone On Dec 8, 2013, at 9:14, Fernando Santos fernandoleandro1...@gmail.com wrote: Now just a theoretical doubt. In a text classification example, what would it mean to have features that are high correlated? I mean, in this case our features are basically words, do you have an example of how these features can not be independant? This concept is not really clear in my mind...
Re: SVM Implementation for mahout?
On Sun, Dec 8, 2013 at 5:50 PM, Fernando Santos fernandoleandro1...@gmail.com wrote: Actually I had never heard of PCA and LDA. I'll take a look on it. PCA and LDA are probably not quite what you want for Naive Bayes, especially in Mahout. There is an assumption of a sparse binary representation for data.
Re: Question about Pearson Correlation in non-Taste mode
See http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf http://arxiv.org/abs/1207.1847 On Fri, Dec 6, 2013 at 1:09 PM, Amit Nithian anith...@gmail.com wrote: Hey Sebastian, Thanks again for the explanation. So now you have me intrigued about something else. Why is it that logliklihood ratio test is a better measure for essentially implicit ratings? Are there resources/research papers you can point me to explaining this? Take care Amit On Sun, Dec 1, 2013 at 9:25 AM, Sebastian Schelter ssc.o...@googlemail.comwrote: Hi Amit, No need to excuse for picking on me, I'm happy about anyone digging into the paper :) The reason, I implemented Pearson in this (flawed) way has to do with the way the parallel algorithm works: It never compares two item vectors in memory, instead it preprocesses the vectors and computes sparse dot products in parallel. The centering which is usually done for Pearson correlation is dependent on which pair of vectors you're currently looking at (and doesn't fit the parallel algorithm). We had an earlier implementation that didn't have this flaw, but was way slower than the current one. Rating prediction on explicit feedback data like ratings for which Pearson correlation is mostly used in CF, is a rather academic topic and in science there are nearly no datasets that really require you to go to Hadoop. On the other hand item prediction on implicit feedback data (like clicks) is the common scenario in the majority of industry usecases, but here count-based similarity measures like the loglikelihood ratio test give much better results. The current implementation of Mahout's distributed itembased recommender is clearly designed and tuned for the latter usecase. I hope that answers your question. --sebastian On 01.12.2013 18:10, Amit Nithian wrote: Thanks guys! So the real question is not so much what's the average of the vector with the missing rating (although yes that was a question) but what's the average of the vector with all the ratings specified but the second rating that is not shared with the first user: [5 - 4] vs [4 5 2]. If we agree that the first is 4.5 then is the second one 11/3 or 3 ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as 11/3. Since Taste (and Lenskit) is sequential, it can (and will only) look at co-occurring ratings whereas the Hadoop implementation doesn't. The paper that Sebastian wrote has a pre-processing step where (for Pearson) you subtract each element of an item-rating vector from the average rating which implies that each item-rating vector is treated independently of each other whereas in the sequential/non-distributed mode it's all considered together. My main reason for posting is because the Taste implementation of item-item similarity differs from the distributed implementation. Since I am totally new to this space and these similarities I wanted to understand if there is a reason for this difference and whether or not it matters. Sounds like from the discussion it doesn't matter but understanding why helps me explain this to others. My guess (and I'm glad Sebastian is on this list so he can help confirm/deny this.. sorry I'm not picking on you just happy to be able to talk to you about your good paper) is that considering co-occuring ratings in a distributed implementation would require access to the full matrix which defeats the parallel nature of computing item-item similarity? Thanks again! Amit On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen sro...@gmail.com wrote: It's not an issue of how to be careful with sparsity and subtracting means, although that's a valuable point in itself. The question is what the mean is supposed to be. You can't think of missing ratings as 0 in general, and the example here shows why: you're acting as if most movies are hated. Instead they are excluded from the computation entirely. m_x should be 4.5 in the example here. That's consistent with literature and the other implementations earlier in this project. I don't know the Hadoop implementation well enough, and wasn't sure from the comments above, whether it does end up behaving as if it's 4.5 or 3. If it's not 4.5 I would call that a bug. Items that aren't co-rated can't meaningfully be included in this computation. On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning ted.dunn...@gmail.com wrote: Good point Amit. Not sure how much this matters. It may be that PearsonCorrelationSimilarity is bad name that should be PearonInspiredCorrelationSimilarity. My guess is that this implementation is lifted directly from the very early recommendation literature and is reflective of the way that it was used back then.
Re: Question about Pearson Correlation in non-Taste mode
The second link was an article I wrote that led eventually to the dissertation (third link). On Fri, Dec 6, 2013 at 5:15 PM, Jason Xin jason@sas.com wrote: Ted, Is this your doctoral Accurate Methods for the Statistics of Surprise and Coincidence , the second one PDF you attached, or you have another one you can forward to me, your doctoral dissertation? Thanks. Jason Xin -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Friday, December 06, 2013 7:56 PM To: user@mahout.apache.org Subject: Re: Question about Pearson Correlation in non-Taste mode See http://tdunning.blogspot.com/2008/03/surprise-and-coincidence.html http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf http://arxiv.org/abs/1207.1847 On Fri, Dec 6, 2013 at 1:09 PM, Amit Nithian anith...@gmail.com wrote: Hey Sebastian, Thanks again for the explanation. So now you have me intrigued about something else. Why is it that logliklihood ratio test is a better measure for essentially implicit ratings? Are there resources/research papers you can point me to explaining this? Take care Amit On Sun, Dec 1, 2013 at 9:25 AM, Sebastian Schelter ssc.o...@googlemail.comwrote: Hi Amit, No need to excuse for picking on me, I'm happy about anyone digging into the paper :) The reason, I implemented Pearson in this (flawed) way has to do with the way the parallel algorithm works: It never compares two item vectors in memory, instead it preprocesses the vectors and computes sparse dot products in parallel. The centering which is usually done for Pearson correlation is dependent on which pair of vectors you're currently looking at (and doesn't fit the parallel algorithm). We had an earlier implementation that didn't have this flaw, but was way slower than the current one. Rating prediction on explicit feedback data like ratings for which Pearson correlation is mostly used in CF, is a rather academic topic and in science there are nearly no datasets that really require you to go to Hadoop. On the other hand item prediction on implicit feedback data (like clicks) is the common scenario in the majority of industry usecases, but here count-based similarity measures like the loglikelihood ratio test give much better results. The current implementation of Mahout's distributed itembased recommender is clearly designed and tuned for the latter usecase. I hope that answers your question. --sebastian On 01.12.2013 18:10, Amit Nithian wrote: Thanks guys! So the real question is not so much what's the average of the vector with the missing rating (although yes that was a question) but what's the average of the vector with all the ratings specified but the second rating that is not shared with the first user: [5 - 4] vs [4 5 2]. If we agree that the first is 4.5 then is the second one 11/3 or 3 ((4+2)/2)? Taste has this as ((4+2)/2) while distributed mode has it as 11/3. Since Taste (and Lenskit) is sequential, it can (and will only) look at co-occurring ratings whereas the Hadoop implementation doesn't. The paper that Sebastian wrote has a pre-processing step where (for Pearson) you subtract each element of an item-rating vector from the average rating which implies that each item-rating vector is treated independently of each other whereas in the sequential/non-distributed mode it's all considered together. My main reason for posting is because the Taste implementation of item-item similarity differs from the distributed implementation. Since I am totally new to this space and these similarities I wanted to understand if there is a reason for this difference and whether or not it matters. Sounds like from the discussion it doesn't matter but understanding why helps me explain this to others. My guess (and I'm glad Sebastian is on this list so he can help confirm/deny this.. sorry I'm not picking on you just happy to be able to talk to you about your good paper) is that considering co-occuring ratings in a distributed implementation would require access to the full matrix which defeats the parallel nature of computing item-item similarity? Thanks again! Amit On Sun, Dec 1, 2013 at 2:55 AM, Sean Owen sro...@gmail.com wrote: It's not an issue of how to be careful with sparsity and subtracting means, although that's a valuable point in itself. The question is what the mean is supposed to be. You can't think of missing ratings as 0 in general, and the example here shows why: you're acting as if most movies are hated. Instead they are excluded from the computation entirely. m_x should be 4.5 in the example here. That's consistent with literature and the other implementations
Re: KMeans cluster analysis
Angelo, The first question is how you intend to define which items are similar. Also, what is the intended use of the clustering? Without knowing that, it is very hard to say how to best do the clustering. For instance, are two records more similar if the record are at the same time of day? Or do you really want to cluster arcs by getting all of the records for a single arc and finding other arcs which have similar characteristics in different weather conditions and time of day? Without some more idea about what is going on, it will not be possible for you to succeed with clustering, nor for us to help you. On Thu, Dec 5, 2013 at 3:38 AM, Angelo Immediata angelo...@gmail.comwrote: Hi First of all I'm sorry if I repeat this question..but it's pretty old one and I really need some help since I'm a really newbie to mahout and hadoop I need to do some cluster analysis by using some data. At the beginning this data can be not too much huge, but after some time they can be really huge (I did some calculation and after 1 year this data cann be around 37 billion of records) Since I have this huge data, I decided to do the cluster analysis by using Mahout on the top of Apache Hadoop and its HDFS. Regarding where to store this big amount of data I decided to use Apache HBase always on the top of Apache Hadoop HDFS Now I need to do this cluster analysi by considering some environment variables. These variable may be the following: - *recordId* = id of the record - *arcId *= id of the arc between 2 points of my street graph - *mediumVelocity *= medium velocity of the considered arc in the specified - *vehiclesNumber* = number of the monitored vehicles in order to get that velocity - *meteo *= weather condition (a numeric representing if there is sun, rain etc...) - *manifestation *= a numeric representing if there is any kind of manifestation (sport manifestation or other) - *day of the week* - *month of the year* - *hour of the day* - *vacation *= a numeric representing if it's a vacation day or a working day So my data are so formatted (raw representation): *recordId arcId mediumVelocity vehiclesNumber meteo manifestation weekDay yearMonth dayHour vacation* 1 1 34.5201 34 2011 10 3 2 15666.53 2 51 20086 2 The clustering should be done by taking care of at least these variables: meteo, manifestation, weekDay, dayHour, vacation No in order to take data from HBase I used the MapReduce funcionalities provided by HBase; basically I wrote this code: My MapperReducer class: package hadoop.mapred; import hadoop.hbase.model.HistoricalDataModel; import java.io.IOException; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.hbase.mapreduce.TableMapper; import org.apache.hadoop.hbase.mapreduce.TableReducer; import org.apache.hadoop.hbase.util.Bytes; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapred.join.TupleWritable; public class HistoricalDataMapRed { public static class HistoricalDataMapper extends TableMapperText, TupleWritable { private static final Log logger = LogFactory.getLog(HistoricalDataMapper.class.getName()); private int numRecords = 0; @SuppressWarnings({ unchecked, rawtypes }) protected void map(Text key, Result result, org.apache.hadoop.mapreduce.Mapper.Context context) throws IOException, InterruptedException { try{ Writable[] vals = new Writable[4]; IntWritable calFest = new IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY, HistoricalDataModel.CALENDARIO_FESTIVO))); vals[0] = calFest; IntWritable calEven = new IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY, HistoricalDataModel.CALENDARIO_EVENTI))); vals[1] = calEven; IntWritable meteo = new IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY, HistoricalDataModel.EVENTO_METEO))); vals[2] = meteo; IntWritable manifestazione = new IntWritable(Bytes.toInt(result.getValue(HistoricalDataModel.HISTORICAL_DATA_FAMILY, HistoricalDataModel.MANIFESTAZIONE))); vals[3] = manifestazione; String chiave = Bytes.toString(result.getRow()); Text text = new Text(); text.set(chiave); context.write(text, new TupleWritable(vals)); numRecords++; if ((numRecords % 1) == 0) { context.setStatus(mapper processed + numRecords + records so far); } }catch(Exception e){ String message = Errore nel mapper; messaggio errore: +e.getMessage();
Re: Outlier detection/Pruning
You should move to 0.8 and explore ball k-means. On Tue, Dec 3, 2013 at 8:44 PM, Prabhakar Srinivasan prabhakar.sriniva...@gmail.com wrote: Hello I am using Mahout 0.7 currently and this question is pertaining to that version. I am using Canopy clustering (CanopyDriver class) first to determine the optimal number of clusters that best fits the dataset and passing that information as parameter to Kmeans clustering (kmeansDriver class). Regards Prabhakar On Tue, Dec 3, 2013 at 6:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you be more specific about which code you are asking about? The ball k-means implementation provides a capability somewhat like this, but perhaps in a more clearly defined way. On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan prabhakar.sriniva...@gmail.com wrote: Hello! Can someone point me to some explanatory documentation for Outlier Detection Removal in Clustering in Mahout. I am unable to understand the internal mechanism of outlier detection just by reading the Javadoc: clusterClassificationThreshold Is a clustering strictness / outlier removal parameter. Its value should be between 0 and 1. Vectors having pdf below this value will not be clustered. What does the pdf represent? Thanks Prabhakar
Re: TF-IDF confusion
Ani, I really don't understand your second point. Here is how I view things ... if you can phrase things in those terms, it might help me understand your question. The TF part of TF-IDF refers to the term frequencies in a document. Typically, each possible word is assigned to a positive integer that represents a position in a vector. A term frequency vector is a sparse vector with counts or functions of counts at locations corresponding to the words in a document. If the document has words that were do not have assigned positions in the vector, they are either ignored or the counts are put into a special UNKNOWN-WORD position. By definition, there is no way that the term frequency vector can be too long or to short. Likewise, a document's length only matters if the counts get too large to store (completely implausible for this to happen since we use a double). The IDF part of TF-IDF refers to weights that are applied to these TF vectors. These weights are conventionally computed by using the log of the number of documents which have the corresponding word. The IDF weighting has one weight for each position in the term frequency vector and thus length is again not a problem. This is why I don't understand your second point. Is it that you mean that many of the words in the document do not have assigned positions in the term frequency vector? If so, that you means that you didn't analyze the corpus ahead of time to get a good dictionary of word locations. Or is it that you are worried that the counts would be large? On Tue, Dec 3, 2013 at 7:03 AM, Ani Tumanyan a...@bnotions.com wrote: Hello everyone, I'm working on a project, where I'm trying to extract topics from news articles. I have around 500,000 articles as a dataset. Here are the steps that I'm following: 1. First of all I'm doing some sort of preprocessing. For this I'm using Behemoth to annotate the document and get rid of non-English documents, 2. Then I'm running Mahout's sparse vector command to generate TF-IDF vectors. The problem with TF-IDF vector is that the number of words for a document is far more than the number of words in TF vectors. Moreover there are some words/terms in TF-IDF vector that didn't appear in that specific document anyway. Is this a correct behaviour or there is something wrong with my approach? Thanks in advance! Ani
Re: Outlier detection/Pruning
Can you be more specific about which code you are asking about? The ball k-means implementation provides a capability somewhat like this, but perhaps in a more clearly defined way. On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan prabhakar.sriniva...@gmail.com wrote: Hello! Can someone point me to some explanatory documentation for Outlier Detection Removal in Clustering in Mahout. I am unable to understand the internal mechanism of outlier detection just by reading the Javadoc: clusterClassificationThreshold Is a clustering strictness / outlier removal parameter. Its value should be between 0 and 1. Vectors having pdf below this value will not be clustered. What does the pdf represent? Thanks Prabhakar
Re: Clustering Spatial Data
Peter, What you say is a bit confusing to me. You say you have centers already. But then you talk about algorithms which find the centers. Also, you say you want to assign points based on centers, but you also say that clusters have different shapes, area, size and point count. Do you mean that assignment should be purely based on proximity to the center and that the shape will be whatever it happens to be as a result? Or do you mean that there is an a priori known shape that has to be taken into account during point assignment? If proximity is the only question, and if you can use great circle distance as your proximity measure, then this problem is fairly easy and can be handled in just a few lines of code. One easy way to handle this is to convert your centers to normalized x, y, z locations using x = cos \lambda cos \phi y = cos \lambda sin \phi z = sin \lambda where \lambda is the latitude and \phi is the longitude. Great circle distance is monotonically related to Euclidean distance in 3-space and thus is inversely monotonically related to dot production. This means you can sort the centers by distance to a point by simply computing x,y,z for the point and then doing the dot product and sorting in descending order. The nice thing with this is that there are no trig functions inside your inner loop. You can also use the haversine formula, but that requires 3-4 trig functions in the inner loop and is likely to be slower. You don't really need Mahout for this at all (unless I completely misunderstand your problem, which is quite possible). On Mon, Dec 2, 2013 at 1:31 AM, Peter K peat...@yahoo.de wrote: Hi there, I've have no experience with mahout but I know that it will solve my problem :) ! I've the following requirements: * No hadoop setup should be necessary. I want a simple approach and I know this is possible with mahout! * I have lots of points (~100 million) but also some RAM (32GB) * I know the clusters upfront via its center positions. * I need to assign every point to exactly one cluster. * Every cluster can have a different shape, area size and point count I've found: http://en.wikipedia.org/wiki/OPTICS_algorithm http://en.wikipedia.org/wiki/DBSCAN Both algorithms do not really pay attention to the fixed cluster center but I think I will start there. Is one of them implemented in mahout? Or do you have another idea or hint/link? Regards, Peter.
Re: Pig vector project
Elephant bird is distinctly superior to Pig Vector for many things (it moved forward, Pig Vector did not). I believe here is also a Twitter internal project known as PigML which is much more what Pig Vector wanted to be. There is also https://github.com/hanborq/pigml, but I think it is very different. You might ping @pbrane (Jake Mannix jake.man...@gmail.com) or @lintool (Jimmy Lin ji...@twitter.com) to see if they have anything to say on the topic. On Mon, Dec 2, 2013 at 4:14 PM, Andrew Musselman andrew.mussel...@gmail.com wrote: You might also look into elephant-bird from Twitter; covers a lot of ground. https://github.com/kevinweil/elephant-bird On Mon, Dec 2, 2013 at 4:10 PM, Sameer Tilak ssti...@live.com wrote: Hi All,We are using Pig top build our data pipeline. I came across the following:https://github.com/tdunning/pig-vector The last commit was 2 yrs ago. Any information on will there be any further work on this project?
Re: Mahout for clustering
Do you want to cluster users or items? For items, the vectorization that you suggest will work reasonably well, especially if you use TF.IDF weighting and normalize the resulting vectors. You can also use one of the matrix decomposition techniques and cluster the resulting vectors. The spectral clustering system that is part of Mahout will do all of this in one step. SVD + streaming k-means + ball k-means should also work well. On Mon, Dec 2, 2013 at 4:22 PM, Sameer Tilak ssti...@live.com wrote: Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion: userid, age, items {code 1, code 2, ….}, few other features... Each item has a unique alphanumeric code. I would like to use mahout for clustering it. Based on my current reading I see following few options 1. Map each alphanumeric item code to a numeric code -- A1 - 0, A2 - 1, A2 -2 etc. Then run the clustering algorithm on the reformatted data and then map the results back onto the real item codes.2. Represent info on item codes as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, age, and this matrix. Not sure if this will work….. We also want to do frequency pattern mining etc. on the same data. Any thoughts on data representation and clustering will be great.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Inline On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote: ... To accomplish this, we used AdaptiveLogisticRegression and trained 46 binary classification models. Our approach has been to do an 80/20 split on the data, holding the 20% back for cross-validation of the models we generate. Sounds reasonable. We've been playing around with a number of different parameters, feature selection, etc. and are able to achieve pretty good results in cross-validation. When you say cross validation, do you mean the magic cross validation that the ALR uses? Or do you mean your 20%? We have a ton of different metrics we're tracking on the results, most significant to this discussion is that it looks like we're achieving very good precision (typically .85 or .9) and a good f1-score (typically again .85 or .9). These are extremely good results. In fact they are good enough I would starting thinking about a target leak. However, when we then take the models generated and try to apply them to some new documents, we're getting many more false positives than we would expect. Documents that should have 2 categories are testing positive for 16, which is well above what I'd expect. By my math I should expect 2 true positives, plus maybe 4.4 (.10 false positives * 44 classes) additional false positives. You said documents. Where do these documents come from? One way to get results just like you describe is if you train on raw news wire that is split randomly between training and test. What can happen is that stories that get edited and republished have a high chance of getting at least one version in both training and test. This means that the supposedly independent test set actually has significant overlap with the training set. If your classifier over-fits, then the test set doesn't catch the problem. Another way to get this sort of problem is if you do your training/test randomly, but the new documents come from a later time. If your classifier is a good classifier, but is highly specific to documents from a particular moment in time, then your test performance will be a realistic estimate of performance for contemporaneous documents but will be much higher than performance on documents from a later point in time. A third option could happen if your training and test sets were somehow scrubbed of poorly structured and invalid documents. This often happens. Then, in the real system, if the scrubbing is not done, the classifier may fail because the new documents are not scrubbed in the same way as the training documents. These are just a few of the ways that *I* have screwed up building classifiers. I am sure that there are more. We suspected that perhaps our models were underfitting or overfitting, hence this post. However, I'll take any and all suggestions for anything else we should be looking at. Well, I think that, almost by definition, you have an overfitting problem of some kind. The question is what kind. The only think that I think that you don't have is a frank target leak in your documents. That would (probably) have given you even higher scores on your test case.
Re: Question about Pearson Correlation in non-Taste mode
Good point Amit. Not sure how much this matters. It may be that PearsonCorrelationSimilarity is bad name that should be PearonInspiredCorrelationSimilarity. My guess is that this implementation is lifted directly from the very early recommendation literature and is reflective of the way that it was used back then. Remember that the context here is prediction of ratings. If you assume that you really want correlation and that missing elements are zero, then this is mathematically wrong. On the other hand, if you assume missing elements are equal to the mean (whatever it is), then this definition is correct. In any case, I don't think that PearsonCorrelationSimilarity should be fixed at this point. First of all, a substantial change here is somewhat risky since there may be people who depend on current behavior. Second, I think that this is almost never a particularly good recommendation algorithm so even if the proposed change is a small improvement, it will have negligible positive effect on the universe of production recommenders. Remember that this function is not a stats routine. It is an embodiment of recommendation practice. Were it the former, I would strongly recommend we fix it. On Sat, Nov 30, 2013 at 10:18 AM, Amit Nithian anith...@gmail.com wrote: Hi Ted, Thanks that is what I would have thought too but I don't think that the Pearson Similarity (in Hadoop mode) does this: in org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.PearsonCorrelationSimilarity around line 31 double average = vector.norm(1) / vector.getNumNonZeroElements(); Which looks like it's taking the sum and dividing by the number of defined elements. Which would make my [5 - 4] average be 4.5. Thanks again Amit On Fri, Nov 29, 2013 at 10:34 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com wrote: Hi Ted, Thanks for your response. I thought that the mean of a sparse vector is simply the mean of the defined elements? Why would the vectors become dense unless you're meaning that all the undefined elements (0?) now will be (0-m_x)? Yes. Just so. All those zero elements become non-zero and the vector is thus non-dense. Looking at the following example: X = [5 - 4] and Y= [4 5 2]. is m_x 4.5 or 3? 3. This is because the elements of X are really 5, 0, and 4. The zero is just not stored, but it still is the value of that element. Is m_y 11/3 or (6/2) because we ignore the 5 since it's counterpart in X is undefined?. 11/3
Re: Test naivebayes task running really slowly and not in distributed mode
Did the training run use both machines? How large is the input for the test run? Is it contained in a single file? On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos fernandoleandro1...@gmail.com wrote: Hello everyone, I'm trying to do a text classification task. My dataset is not that big, I have around 700.000 small comments. Following the 20newsgroups example, I created the vector from the text, splited it and trained the model. Now I'm trying to test it but it is really slow and also I cannot make it to run in the cluster. Whatever I do it always just run in one machine. And I think the testnb algorithm is supposed to run using mapReduce, right? I also tried this example here ( http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/ ) but also, the other box in the cluster is not executing any task. In fact, when I execute the testnb or using the MapReduceClassifier proposed in this tutorial above, I get one job, executing one task and this task runs really slowly (like 6 minutes to achieve 0.13% of the task). I think I must be doing something wrong so that the cluster is not working how it is supposed to be. I have a cluster with 2 box configured with hadoop 0.20.205.0 and using mahout 0.8. I also tried versions 0.7 and 0.6 of mahout but nothing changed. Any help would be aprreciated. The logs I have from this task: *stdout logs* Java HotSpot(TM) 64-Bit Server VM warning: You have loaded library /usr/local/hadoop/lib/libhadoop.so which might have disabled stack guard. The VM will try to fix the stack guard now. It's highly recommended that you fix the library with 'execstack -c libfile', or link it with '-z noexecstack'. *syslog logs* 2013-11-30 17:09:19,191 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2013-11-30 17:09:19,400 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2013-11-30 17:09:19,472 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2013-11-30 17:09:19,474 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5810d963 2013-11-30 17:09:19,543 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2013-11-30 17:09:19,569 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 -- Fernando Santos +55 61 8129 8505
Re: Clustering without Hadoop
The new Ball k-means and streaming k-means implementations have non-Hadoop versions. The streaming k-means implementation also has a threaded implementation that runs without Hadoop. The threaded streaming k-means implementation should be pretty fast. On Sun, Dec 1, 2013 at 7:55 PM, Shan Lu shanlu...@gmail.com wrote: Thanks, Suneel, I'll try this way. In this recommender example: https://github.com/ManuelB/facebook-recommender-demo/blob/master/src/main/java/de/apaxo/bedcon/AnimalFoodRecommender.java#L142 , they only use mahout api. So I am thinking that can I do the clustering similarly. On Sun, Dec 1, 2013 at 10:42 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Shan, All of Mahout implementations use Hadoop API, but if u r trying to run kmeans in sequential (non-MapReduce) mode; pass in runSequential = true instead of false as the last parameter to KMeansDriver.run() or Amit run them in LOCAL_MODE as pointed out earlier by Amit. On Sunday, December 1, 2013 10:28 PM, Shan Lu shanlu...@gmail.com wrote: Thanks for your reply. In the example code, they run the k-means algorithm using org.apache.hadoop.conf.Configuration, org.apache.hadoop.fs.FileSystem, and org.apache.hadoop.fs.Path parameters. Is there any algorithm that doesn't need any Configuration and Path parameter, just use the data in memory? I mean, can I run the k-means algorithm without using the hadoop api, just using java? Thanks. On Sun, Dec 1, 2013 at 9:58 PM, Amit Nithian anith...@gmail.com wrote: When you say without hadoop does that include local mode? You can run these examples in local mode that doesn't require a cluster for testing and poking around. Everything then runs in a single jvm. On Dec 1, 2013 9:18 PM, Shan Lu shanlu...@gmail.com wrote: Hi, I am working on a very simple k-means clustering example. Is there a way to run clustering algorithms in mahout without using Hadoop? I am reading the book Mahout in Action. In chapter 7, the hello world clustering code example, they use == KMeansDriver.run(conf, new Path(testdata/points), new Path(testdata/clusters), new Path(output), new EuclideanDistanceMeasure(), 0.001, 10, true, false); == to run the k-means algorithm. How can I run the k-means algorithm without Hadoop? Thanks! Shan -- Shan Lu ECE Dept., NEU, Boston, MA 02115 -- Shan Lu ECE Dept., NEU, Boston, MA 02115
Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?
The default with the Mahout encoders is two probes. This is unnecessary with the intercept term, of course, if you protect the intercept term from other updates, possible by encoding other data using a view of the original feature vector. For each probe, a different hash is used so each value is put into multiple locations. Multiple probes are useful in general to decrease the effect of the reduced dimensionality of the hashed representation. On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven paul.van.ho...@gmail.comwrote: For an example program using mahout I use the donut.csv sample data from the project ( https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv ). My code looks like this: import org.apache.mahout.math.RandomAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder; import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder; import com.csvreader.CsvReader; public class Runner { //Set the path accordingly! public static final String csvInputDataPath = /path/to/donut.csv; public static void main(String[] args) { FeatureVectorEncoder encoder = new StaticWordValueEncoder(features); ArrayListRandomAccessSparseVector featureVectors = new ArrayListRandomAccessSparseVector(); try { CsvReader csvReader = new CsvReader(csvInputDataPath); csvReader.readHeaders(); while( csvReader.readRecord() ) { Vector featureVector = new RandomAccessSparseVector(30); featureVector.set(0, new Double(csvReader.get(x))); featureVector.set(1, new Double(csvReader.get(y))); featureVector.set(2, new Double(csvReader.get(c))); featureVector.set(3, new Integer(csvReader.get(color))); System.out.println(Before: + featureVector.toString()); encoder.addToVector(csvReader.get(shape).getBytes(), featureVector); System.out.println( After: + featureVector.toString()); featureVectors.add((RandomAccessSparseVector) featureVector); } } catch(Exception e) { e.printStackTrace(); } System.out.println(Program is done.); } } What confuses me is the following output (one sample): Before: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0} After: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0} As you can see, I added just one value shape to the vector. However two dimensions of this vector are encoded with 1.0. On the other hand, for some other data I get the output Before: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0} After: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0} Why? I would expect that _always_ only one dimension gets occupied by 1.0 as this is the standard case for categorial encoding. This way this seems to be wrong. Thanks in advance, Paul
Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?
If you always insert 1's for each element, then you can detect collisions by inserting all your elements (or all elements in each document separately) and looking for the max value in the vector. If you see something 1, you have a collision. But collisions are actually good. The only way to completely avoid them is to use a vector as large as your vocabulary which is often painfully large. You can also view multiple probes not so much as avoiding collisions, but as making the linear transformation from the very large dimensional representation of one dimension per word to the lower hashed representation more likely to be nearly invertible in the sense that the Euclidean metric will be approximately preserved. Think Johnson-Lindenstrauss random projections. On Fri, Nov 29, 2013 at 1:54 AM, Paul van Hoven paul.van.ho...@gmail.comwrote: Hi, thanks for your quick reply. So multiple probes are a protection against collisions? After playing a little with the default length of a RandomAccessSparseVector object I noticed that (of course) collisions occur when the length is too short. Therefore, I'm asking myself if there is a possibility to check if a collision occurred after encoding a new value in the vector? This would give a user the information that the length of the chosen vector is too short. So far, I did not find any method in the api to check for that. 2013/11/29 Ted Dunning ted.dunn...@gmail.com: The default with the Mahout encoders is two probes. This is unnecessary with the intercept term, of course, if you protect the intercept term from other updates, possible by encoding other data using a view of the original feature vector. For each probe, a different hash is used so each value is put into multiple locations. Multiple probes are useful in general to decrease the effect of the reduced dimensionality of the hashed representation. On Fri, Nov 29, 2013 at 1:14 AM, Paul van Hoven paul.van.ho...@gmail.comwrote: For an example program using mahout I use the donut.csv sample data from the project ( https://svn.apache.org/repos/asf/mahout/trunk/examples/src/main/resources/donut.csv ). My code looks like this: import org.apache.mahout.math.RandomAccessSparseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.vectorizer.encoders.FeatureVectorEncoder; import org.apache.mahout.vectorizer.encoders.StaticWordValueEncoder; import com.csvreader.CsvReader; public class Runner { //Set the path accordingly! public static final String csvInputDataPath = /path/to/donut.csv; public static void main(String[] args) { FeatureVectorEncoder encoder = new StaticWordValueEncoder(features); ArrayListRandomAccessSparseVector featureVectors = new ArrayListRandomAccessSparseVector(); try { CsvReader csvReader = new CsvReader(csvInputDataPath); csvReader.readHeaders(); while( csvReader.readRecord() ) { Vector featureVector = new RandomAccessSparseVector(30); featureVector.set(0, new Double(csvReader.get(x))); featureVector.set(1, new Double(csvReader.get(y))); featureVector.set(2, new Double(csvReader.get(c))); featureVector.set(3, new Integer(csvReader.get(color))); System.out.println(Before: + featureVector.toString()); encoder.addToVector(csvReader.get(shape).getBytes(), featureVector); System.out.println( After: + featureVector.toString()); featureVectors.add((RandomAccessSparseVector) featureVector); } } catch(Exception e) { e.printStackTrace(); } System.out.println(Program is done.); } } What confuses me is the following output (one sample): Before: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0} After: {0:0.923307513352484,1:0.0135197141207755,2:0.644866125183976,3:2.0,29:1.0,25:1.0} As you can see, I added just one value shape to the vector. However two dimensions of this vector are encoded with 1.0. On the other hand, for some other data I get the output Before: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:2.0} After: {0:0.711011884035543,1:0.909141522599384,2:0.46035073663368,3:3.0,16:1.0} Why? I would expect that _always_ only one dimension gets occupied by 1.0 as this is the standard case for categorial encoding. This way this seems to be wrong. Thanks in advance, Paul
Re: Question about Pearson Correlation in non-Taste mode
Well, the best way to compute correlation using sparse vectors is to make sure you keep them sparse. To do that, you must avoid subtracting the mean by expanding whatever formulae you are using. For instance, if you are computing (x - m_x) . (y - m_y) (here . means dot product) If you do this directly, then you lose all benefit of sparse vectors since subtracting the means makes each vector dense. What you should compute instead is this alternative form x . y - m_x e . y - m_y e . x + m_x m_y (here e represents a vector full of 1's) The dot product here is sparse and the expression m_x e . y can be computed (at lease in Mahout) in map-reduce idiom as y.aggregate(Functions.PLUS, Functions.mult(m_x)) On Fri, Nov 29, 2013 at 9:31 PM, Amit Nithian anith...@gmail.com wrote: Okay so I rethought my question and realized that the paper never really talked about collaborative filtering but just how to calculate item-item similarity in a scalable fashion. Perhaps this is the reason for why the common ratings aren't used? Because that's not a pre-req for this calculation? Although for my own clarity, I'd still like to get a better understanding of what it means to calculate the correlation between sparse vectors where you're normalizing each vector using a separate denominator. P.S. If my question(s) don't make sense please let me know for it's very possible I am completely misunderstanding something :-). Thanks again! Amit On Wed, Nov 27, 2013 at 8:23 AM, Amit Nithian anith...@gmail.com wrote: Hey Sebastian, Thanks again. Actually I'm glad that I am talking to you as it's your paper and presentation I have questions with! :-) So to clarify my question further, looking at this presentation ( http://isabel-drost.de/hadoop/slides/collabMahout.pdf) you have the following user x item matrix: M A I A 51 4 B -25 P 4 32 If I want to calculate the pearson correlation between Matrix and Inception, I'd have the rating vectors: [5 - 4] vs [4 5 2]. One of the steps in your paper is the normalization step which subtracts the mean item rating from each value and essentially do the L2Norm of this resulting vector (or in other words, the L2 norm of the mean-centered vector ?) The question I have had is what is the average rating for Matrix and Inception? I can see the following: Matrix - 4.5 (9/2), Inception - 3 (6/2) because you only consider shared ratings Matrix - 3 (9/3), Inception - 3.667 (11/3) assuming that the missing rating is 0 Matrix - 4.5 (9/2), Inception - 3.667 (11/3) subtract from the average of all non-zero ratings == This is what I believe the current implementation does. Unfortunately, neither of these yield the 0.47 listed in the presentation but that's a separate issue. In my testing, I see that Mahout Taste (non-distributed) uses the 1st approach while the distributed approach uses the 3rd approach. I am okay with #3; however I just want to understand that this is the case and that it's okay. This is why I was asking about pearson correlation between vectors of different lengths because the average rating is being computed using a denominator (number of users) that is different between the two (2 vs 3). I know you said in practice that people don't use Pearson to compute inferred ratings but this is just for my complete understanding (and since it's the example used in your presentation). This same question applies to cosine as you are doing an L2-Norm of the vector as a pre-processing step and including/excluding non-shared ratings may make a difference. Thanks again! Amit On Wed, Nov 27, 2013 at 7:13 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Amit, Yes, it gives different results. However in practice, most people don't do rating prediction with Pearson coefficient, but use count-based measures like the loglikelihood ratio test. The distributed code doesn't look at vectors of different lengths, but simply assumes non-existent ratings as zero. --sebastian On 27.11.2013 16:09, Amit Nithian wrote: Comparing this against the non distributed (taste) gives different answers for item item similarity as of course the non distributed looks only at corated items. I was more wondering if this difference in practice mattered or not. Also I'm confused on how you can compute the Pearson similarity between two vectors of different length which essentially is going on here I think? Thanks again Amit On Nov 27, 2013 9:06 AM, Sebastian Schelter ssc.o...@googlemail.com wrote: Yes, it is due to the parallel algorithm which only looks at co-ratings from a given user. On 27.11.2013 15:02, Amit Nithian wrote: Thanks Sebastian! Is there a particular reason for that? On Nov 27, 2013 7:47 AM, Sebastian Schelter
Re: Question about Pearson Correlation in non-Taste mode
On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com wrote: Hi Ted, Thanks for your response. I thought that the mean of a sparse vector is simply the mean of the defined elements? Why would the vectors become dense unless you're meaning that all the undefined elements (0?) now will be (0-m_x)? Yes. Just so. All those zero elements become non-zero and the vector is thus non-dense. Looking at the following example: X = [5 - 4] and Y= [4 5 2]. is m_x 4.5 or 3? 3. This is because the elements of X are really 5, 0, and 4. The zero is just not stored, but it still is the value of that element. Is m_y 11/3 or (6/2) because we ignore the 5 since it's counterpart in X is undefined?. 11/3
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Yes. Exactly. On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi vishal.santo...@gmail.comwrote: Absolutely. I will read through. The idea is to first fix the learning rate update equation in OLR. I think this code in OnlineLogisticRegression is the current equation ? @Override public double currentLearningRate() { return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() + stepOffset, forgettingExponent); } I presume that you would like Adagrad-like solution to replace the above ? On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
No problem at all. Kind of funny. On Wed, Nov 27, 2013 at 7:08 AM, Vishal Santoshi vishal.santo...@gmail.comwrote: Sorry to spam, I never meant the Hello to come out as Hell. Given a little disappointment in the mail, I figure I rather spam than be misunderstood, On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi vishal.santo...@gmail.com wrote: Hell Ted, Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Regards, On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn...@gmail.com wrote: Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide for handling multiple passes through the data is very easy to screw up. I think that simply separating the data entirely might be a better approach. b) for truly on-line learning where no repeated passes through the data will ever occur, then cross validation is not the best choice. Much better in those cases to use what Google researchers described in [1]. c) it is clear from several reports that the evolutionary algorithm prematurely shuts down the learning rate. I think that Adagrad-like learning rates are more reliable. See [1] again for one of the more readable descriptions of this. See also [2] for another view on adaptive learning rates. d) item (c) is also related to the way that learning rates are adapted in the underlying OnlineLogisticRegression. That needs to be fixed. e) asynchronous parallel stochastic gradient descent with mini-batch learning is where we should be headed. I do not have time to write it, however. All this aside, I am happy to help in any way that I can given my recent time limits. [1] http://research.google.com/pubs/pub41159.html [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote: Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have been used successfully in production. The evolutionary algorithm seems to be the core of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Would you like to help on this? The amount of work required to get a distributed asynchronous learner up is moderate, but definitely not huge. I think that OnlineLogisticRegression is basically sound, but should get a better learning rate update equation. That would largely make the Adaptive* stuff unnecessary, expecially if OLR could be used in the distributed asynchronous learner.
Re: Good centroid generation algorithm for top-down clustering approach
Have you looked at the streaming k-means work? The basic idea is that you generate a sketch of the data which you can then cluster in-memory. That lets you use very advanced centroid generation algorithms that require lots of processing. On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu chjaso...@gmail.com wrote: Hi all, I'm trying to clustering text documents via top-down approach. I have experienced both random seed and canopy generation, and have seen their pros and cons. I realize that canopy is great for not known exact cluster numbers; nevertheless, the memory need for canopy is great. I was hoping to find something similar to canopy generation and was wondering if there is any other recommendation?
Re: Detecting high bias and variance in AdaptiveLogisticRegression classification
Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide for handling multiple passes through the data is very easy to screw up. I think that simply separating the data entirely might be a better approach. b) for truly on-line learning where no repeated passes through the data will ever occur, then cross validation is not the best choice. Much better in those cases to use what Google researchers described in [1]. c) it is clear from several reports that the evolutionary algorithm prematurely shuts down the learning rate. I think that Adagrad-like learning rates are more reliable. See [1] again for one of the more readable descriptions of this. See also [2] for another view on adaptive learning rates. d) item (c) is also related to the way that learning rates are adapted in the underlying OnlineLogisticRegression. That needs to be fixed. e) asynchronous parallel stochastic gradient descent with mini-batch learning is where we should be headed. I do not have time to write it, however. All this aside, I am happy to help in any way that I can given my recent time limits. [1] http://research.google.com/pubs/pub41159.html [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf On Tue, Nov 26, 2013 at 12:54 PM, optimusfan optimus...@yahoo.com wrote: Hi- We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class. We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs? I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this. Or, any other ideas would be appreciated! Thanks, Ian
Re: Algorithms in Mahout
On Mon, Nov 25, 2013 at 3:14 AM, Manuel Blechschmidt manuel.blechschm...@gmx.de wrote: There are/were multiple kNN implementation in Mahout: Recommender knn http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.6/org/apache/mahout/cf/taste/impl/recommender/knn/Optimizer.java(will be removed for 0.9) stream knn https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/StreamingKMeans.java normal knn Streaming k-means isn't strictly a knn implementation. It is a k-means clustering application.
Re: OnlineLogisticRegression: Are my settings sensible
You are correct that it should work with smaller data as well, but the trade-offs are going to be very different. In particular, some algorithms are completely infeasible at large scale, but are very effective at small scale. Some like those used in glmnet inherently require multiple passes through the data. The Mahout committers have generally elected to spend time on larger scale problems, especially where really good small-scale solutions already exist. That could change if somebody wanted to come in and support some set of algorithms (hint, hint). On Fri, Nov 8, 2013 at 3:15 AM, Andreas Bauer b...@gmx.net wrote: Ok, I'll have a look. Thanks! I know mahout is intended for large scale machine learning, but I guess it shouldn't have problems with such small data either. Ted Dunning ted.dunn...@gmail.com schrieb: On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b...@gmx.net wrote: Hi, Thanks for your comments. I modified the examples from the mahout in action book, therefore I used the hashed approach and that's why i used 100 features. I'll adjust the number. Makes sense. But the book was doing sparse features. You say that I'm using the same CVE for all features, so you mean i should create 12 separate CVE for adding features to the vector like this? Yes. Otherwise you don't get different hashes. With a CVE, the hashing pattern is generated from the name of the variable. For a work encoder, the hashing pattern is generated by the name of the variable (specified at construction of the encoder) and the word itself (specified at encode time). Text is just repeated words except that the weights aren't necessarily linear in the number of times a word appears. In your case, you could have used a goofy trick with a word encoder where the word is the variable name and the value of the variable is passed as the weight of the word. But all of this hashing is really just extra work for you. Easier to just pack your data into a dense vector. Finally, I thought online logistic regression meant that it is an online algorithm so it's fine to train only once. Does it mean, should i invoke the train method over and over again with the same training sample until the next one arrives or how should i make the model converge (or at least try to with the few samples) ? What online really implies is that training data is measured in terms of number of input records instead of in terms of passes through the data. To converge, you have to see enough data. If that means you need to pass through the data several times to fool the learner ... well, it means you have to pass through the data several times. Some online learners are exact in that they always have the exact result at hand for all the data they have seen. Welford's algorithm for computing sample mean and variance is like that. Others approximate an answer. Most systems which are estimating some property of a distribution are necessarily approximate. In fact, even Welford's method for means is really only approximating the mean of the distribution based on what it has seen so far. It happens that it gives you the best possible estimate so far, but that is just because computing a mean is simple enough. With regularized logistic regression, the estimation is trickier and you can only say that the algorithm will converge to the correct result eventually rather than say that the answer is always as good as it can be. Another way to say it is that the key property of on-line learning is that the learning takes a fixed amount of time and no additional memory for each input example. What would you suggest to use for incremental training instead of OLR? Is mahout perhaps the wrong library? Well, for thousands of examples, anything at all will work quite well, even R. Just keep all the data around and fit the data whenever requested. Take a look at glmnet for a very nicely done in-memory L1/L2 regularized learner. A quick experiment indicates that it will handle 200K samples of the sort you are looking in about a second with multiple levels of lambda thrown into the bargain. Versions available in R, Matlab and Fortran (at least). http://www-stat.stanford.edu/~tibs/glmnet-matlab/ This kind of in-memory, single machine problem is just not what Mahout is intended to solve.
Re: Solr-recommender for Mahout 0.9
For recommendation work, I suggest that it would be better to simply code out an explicit OR query. On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi Pat, On Nov 7, 2013, at 7:30pm, Pat Ferrel pat.fer...@gmail.com wrote: Another approach would be to weight the terms in the docs by there Mahout similarity strength. But that will be for another day. My current question is whether Lucene looks at word proximity. I see the query syntax supports proximity but I don’t see that it is default so that’s good. Based on your description of what you do (generate an OR query of N terms) then no, you shouldn't be getting a boost from proximity. Note that with edismax you can specify a phrase boost, but it will be on the entire set of terms being searched, so unlikely to come into play even if you were using that. -- Ken On Nov 7, 2013, at 12:41 PM, Dyer, James james.d...@ingramcontent.com wrote: Best to my knowledge, Lucene does not care about the position of a keyword within a document. You could bucket the ids into several fields. Then use a dismax query to boost the top-tier ids more than then second, etc. A more fine-grained approach would probably involve a custom Similarity class that scales the score based on its position in the document. If we did this, it might be simpler to index as 1 single-valued field so each id was position+1 rather than position+100, etc. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Pat Ferrel [mailto:pat.fer...@gmail.com] Sent: Thursday, November 07, 2013 1:46 PM To: user@mahout.apache.org Subject: Re: Solr-recommender for Mahout 0.9 Interesting to think about ordering and adjacentness. The index ids are sorted by Mahout strength so the first id is the most similar to the row key and so forth. But the query is ordered buy recency. In both cases the first id is in some sense the most important. Does Solr/Lucene care about closeness to the top of doc for queries or indexed docs? I don't recall any mention of this. However adjacentness has no meaning in recommendations though I think it's used in default queries so I may have to account for that. The object returned is an ordered list of ids. I use only the IDs now but there are cases when the contents are also of interest; shopping cart/watchlist queries for example. On Nov 7, 2013, at 10:00 AM, Dyer, James james.d...@ingramcontent.com wrote: The multivalued field will obey the positionIncrementGap value you specify (default=100). So for querying purposes, those id's will be 100 (or whatever you specified) positions apart. So a phrase search for adjacent ids would not match, unless you set the slop for = positionIncrementGap. Other than this, both scenarios index the same. For stored fields, solr returns an array of values for multivalued fields, which is convienent when writing a UI. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Dominik Hübner [mailto:cont...@dhuebner.com] Sent: Thursday, November 07, 2013 11:23 AM To: user@mahout.apache.org Subject: Re: Solr-recommender for Mahout 0.9 Does anyone know what the difference is between keeping the ids in a space delimited string and indexing a multivalued field of ids? I recently tried the latter since ... it felt right, however I am not sure which of both has which advantages. On 07 Nov 2013, at 18:18, Pat Ferrel pat.fer...@gmail.com wrote: I have dismax (no edismax) but am not using it yet, using the default query, which does use 'AND'. I had much the same though as I slept on it. Changing to OR is now working much much better. So obvious it almost bit me, not good in this case... With only a trivially small amount of testing I'd say we have a new recommender on the block. If anyone would like to help eyeball test the thing let me know off-list. There are a few instructions I'll need to give. And it can't handle much load right now due to intentional design limits. On Nov 7, 2013, at 6:11 AM, Dyer, James james.d...@ingramcontent.com wrote: Pat, Can you give us the query it generates when you enter vampire werewolf zombie, q/qt/defType ? My guess is you're using the default query parser with q.op=AND , or, you're using dismax/edismax with a high mm (min-must-match) value. James Dyer Ingram Content Group (615) 213-4311 -Original Message- From: Pat Ferrel [mailto:pat.fer...@gmail.com] Sent: Wednesday, November 06, 2013 5:53 PM To: s...@apache.org Schelter; user@mahout.apache.org Subject: Re: Solr-recommender for Mahout 0.9 Done, BTW I have the thing running on a demo site but am getting very poor results that I think are related to the Solr setup. I'd appreciate any ideas. The sample data has 27,000 items and something like 4000 users. The preference data is
Re: Decaying score for old preferences when using the .refresh()
On Thu, Nov 7, 2013 at 12:50 AM, Gokhan Capan gkhn...@gmail.com wrote: This particular approach is discussed, and proven to increase the accuracy in Collaborative filtering with Temporal Dynamics by Yehuda Koren. The decay function is parameterized per user, keeping track of how consistent the user behavior is. Note that user-level temporal dynamics does not actually improve the accuracy of ranking. It improves the accuracy of ratings. Since recommendation quality is primarily a precision@20 sort of activity, improving ratings does no good at all. Item-level temporal dynamics is a different beast.
Re: OnlineLogisticRegression: Are my settings sensible
Why is FEATURE_NUMBER != 13? With 12 features that are already lovely and continuous, just stick them in elements 1..12 of a 13 long vector and put a constant value at the beginning of it. Hashed encoding is good for sparse stuff, but confusing for your case. Also, it looks like you only pass through the (very small) training set once. The OnlineLogisticRegression is unlikely to converge very well with such a small number of examples. Finally, in the hashed representation that you are using, you use exactly the same CVE to put all 15 (12?) of the variables into the vector. Since you are using the same CVE, all of these values will be put into exactly the same location which is going to kill performance since you will get the effect of summing all your variables together. On Thu, Nov 7, 2013 at 1:48 PM, Andreas Bauer b...@gmx.net wrote: Hi, I’m trying to use OnlineLogisticRegression for a two-class classification problem, but as my classification results are not very good, I wanted to ask for support to find out if my settings are correct and if I’m using Mahout correctly. Because if I’m doing it correctly then probably my features are crap... In total I have 12 features. All are continuous values and all are normalized/standardized (has not effect on the classification performance at the moment). Training samples keep flowing in at constant rate (i.e. incremental training), but in total it won’t be more than a few thousand (class split pos/negative 30:70). My performance measure do not really get good, e.g. with approx. 3600 training samples I get f-measure(beta=0.5): 0.38 precision: 0.33 recall: 0.47 The parameters I use are lambda=0.0001 offset=1000 alpha=1 decay_exponent=0.9 learning_rate=50 FEATURE_NUMBER = 100; CATEGORIES_NUMBER = 2; Java code snip: private OnlineLogisticRegression olr; private ContinuousValueEncoder continousValueEncoder; private static final FeatureVectorEncoder BIAS = new ConstantValueEncoder(Intercept“); … public Training() { olr = new OnlineLogisticRegression(CATEGORIES_NUMBER, FEATURE_NUMBER,new L1()); //L2 or ElasticBandPrior do not affect the performance olr.lambda(lambda).learningRate(learning_rate).stepOffset(offset).decayExponent(decay_exponent); this.continousValueEncoder = new ContinuousValueEncoder(ContinuousValueEncoder); this.continousValueEncoder.setProbes(20); …. } public void train(TrainingSample sample, int target){ DenseVector denseVector = new DenseVector(FEATURE_NUMBER); //sample.getFeatureValue1-15() return a double value this.continousValueEncoder.addToVector((byte[]) null, sample.getFeatureValue1(), denseVector); …. this.continousValueEncoder.addToVector((byte[]) null, sample.getFeatureValue15(), denseVector); BIAS.addToVector((byte[]) null, 1, denseVector); olr.train(target, denseVector); } It is also interesting to notice, that when I use the model both test and classification yield always probabilities of 1.0 or 0.99xxx for either class. result = this.olr.classifyFull(input); LOGGER.debug(TrainingSink test: classify real category: + realCategory + olr classifier result: + result.maxValueIndex() + prob: + result.maxValue()); Would be great if you could give me some advise. Many thanks, Andreas
Re: OnlineLogisticRegression: Are my settings sensible
On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b...@gmx.net wrote: Hi, Thanks for your comments. I modified the examples from the mahout in action book, therefore I used the hashed approach and that's why i used 100 features. I'll adjust the number. Makes sense. But the book was doing sparse features. You say that I'm using the same CVE for all features, so you mean i should create 12 separate CVE for adding features to the vector like this? Yes. Otherwise you don't get different hashes. With a CVE, the hashing pattern is generated from the name of the variable. For a work encoder, the hashing pattern is generated by the name of the variable (specified at construction of the encoder) and the word itself (specified at encode time). Text is just repeated words except that the weights aren't necessarily linear in the number of times a word appears. In your case, you could have used a goofy trick with a word encoder where the word is the variable name and the value of the variable is passed as the weight of the word. But all of this hashing is really just extra work for you. Easier to just pack your data into a dense vector. Finally, I thought online logistic regression meant that it is an online algorithm so it's fine to train only once. Does it mean, should i invoke the train method over and over again with the same training sample until the next one arrives or how should i make the model converge (or at least try to with the few samples) ? What online really implies is that training data is measured in terms of number of input records instead of in terms of passes through the data. To converge, you have to see enough data. If that means you need to pass through the data several times to fool the learner ... well, it means you have to pass through the data several times. Some online learners are exact in that they always have the exact result at hand for all the data they have seen. Welford's algorithm for computing sample mean and variance is like that. Others approximate an answer. Most systems which are estimating some property of a distribution are necessarily approximate. In fact, even Welford's method for means is really only approximating the mean of the distribution based on what it has seen so far. It happens that it gives you the best possible estimate so far, but that is just because computing a mean is simple enough. With regularized logistic regression, the estimation is trickier and you can only say that the algorithm will converge to the correct result eventually rather than say that the answer is always as good as it can be. Another way to say it is that the key property of on-line learning is that the learning takes a fixed amount of time and no additional memory for each input example. What would you suggest to use for incremental training instead of OLR? Is mahout perhaps the wrong library? Well, for thousands of examples, anything at all will work quite well, even R. Just keep all the data around and fit the data whenever requested. Take a look at glmnet for a very nicely done in-memory L1/L2 regularized learner. A quick experiment indicates that it will handle 200K samples of the sort you are looking in about a second with multiple levels of lambda thrown into the bargain. Versions available in R, Matlab and Fortran (at least). http://www-stat.stanford.edu/~tibs/glmnet-matlab/ This kind of in-memory, single machine problem is just not what Mahout is intended to solve.
Re: Scheduled tasks in Mahout
No. Scheduling is outside of Mahout's scope. On Wed, Oct 30, 2013 at 12:55 PM, Cassio Melo melo.cas...@gmail.com wrote: I wonder if Mahout (more precisely org.apache.mahout.cf.taste package) has any helper class to execute scheduled tasks like fetch data, compute similarity, etc. Thank you Cassio
Re: TravellingSaleman
Actually that isn't quite correct. Watchmaker was removed. That was a genetic algorithm implementation. EP or evolutionary programming still has an implementation in Mahout in the class org.apache.mahout.ep.EvolutionaryProcess This algorithm is documented here: http://arxiv.org/abs/0803.3838 On Tue, Oct 29, 2013 at 9:33 AM, Suneel Marthi suneel_mar...@yahoo.comwrote: EP has been removed as of mahout 0.7 Sent from my iPhone On Oct 29, 2013, at 9:31 AM, Pavan K Narayanan pavan.naraya...@gmail.com wrote: Hi, is the evolutionary algorithm package is still in active development in Mahout? I am interested in running a sample TSP with some benchmark data using 0.7. I entered $ bin/mahout org.apache.mahout.ga.watchmaker.travellingsalesman.TravellingSalesman and got unknown program chosen error. I was actually hoping it would show all the options that we can use with traveling salesman. can anyone please give me the correct syntax? it is not even to be found in list of valid program names. Regards,
Re: Mahout 0.8 Random Forest Accuracy
Tim, Yes, RF's are ensemble learners, but that doesn't mean that you couldn't wrap them up with other classifiers to have a higher level ensemble. On Sat, Oct 19, 2013 at 6:48 AM, Tim Peut t...@timpeut.com wrote: Thanks for the info and suggestions everyone. On 19 October 2013 01:00, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser j.barrett.straus...@gmail.com wrote: How difficult would it be to wrap the RF classifier into an ensemble learner? It is callable. Should be relatively easy. I'm still becoming familiar with machine learning terminology so please forgive my ignorance. I thought that random forests are, by nature, ensemble learners? What exactly do you mean by this?
Re: Mahout 0.8 Random Forest Accuracy
On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote: Has anyone found that Mahout's random forest doesn't perform as well as other implementations? If not, is there any reason why it wouldn't perform as well? This is disappointing, but not entirely surprising. There has been considerably less effort applied to Mahouts random forest package than the comparable R packages. Note, particularly that the Mahout implementation is not regularized. That could well be a big difference.
Re: Mahout 0.8 Random Forest Accuracy
On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser j.barrett.straus...@gmail.com wrote: How difficult would it be to wrap the RF classifier into an ensemble learner? It is callable. Should be relatively easy.
Re: Clustering of text data on external categories
Search engines do cool things. On Fri, Oct 11, 2013 at 7:42 AM, Jens Bonerz jbon...@googlemail.com wrote: what a nice idea :-) really like that approach 2013/10/11 Ted Dunning ted.dunn...@gmail.com You don't need Mahout for this. A very easy way to do this is to gather all the words for each category into a document. Thus: CatA:selling buying sales payment CatB:gathering collecting CatC:information data info Then put these into a text retrieval engine so that you have one document per category. When you get a new document to categorize, just use the document as a query and you will get a list of possible categories back. Make sure you set the default query mode to OR for this. See http://wiki.apache.org/solr/SolrQuerySyntax for more on the syntax. On Fri, Oct 11, 2013 at 5:04 AM, Kasi Subrahmanyam kasisubbu...@gmail.comwrote: Hi, I have a problem that i would like to implement in mahout clustering. I have input text documents with data like below. Document1: This is the first document of selling information. Document2: This is the second document of gathering information. I also have another look up file with data like below selling:CatA gathering:CatB. information:CatC NOw i would like to cluster the documents with output being genrated as Document1:CatA,CatC Document2:CatB,CatC Please let me know how to achieve this. Thanks, Subbu
Re: Naive bayes and character n-grams
For language detection, you are going to have a hard time doing better than one of the standard packages for the purpose. See here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones dean.m.jo...@gmail.com wrote: Hi Si, On 10 October 2013 07:59, simon.2.thomp...@bt.com wrote: what do you mean by character n-grams? If you mean things like ab or ui2 then given that there are so few characters compared to words is there a problem that can't be solved without a look-up table for ny (where y 4ish ) Or are you looking at y 4 ish because if so then do you run into the issue of a sudden space explosion? Yes, just tokens in a text broken up into sequences of their constituent characters. In my initial tests, language detection works well where n=3, particularly when including the head and tail bigrams. So I need something to generate the required sequence files from my training data.
Re: Naive bayes and character n-grams
Cool. Sounds like you are ahead of the game. Sent from my iPhone On Oct 10, 2013, at 13:15, Dean Jones dean.m.jo...@gmail.com wrote: On 10 October 2013 12:46, Ted Dunning ted.dunn...@gmail.com wrote: For language detection, you are going to have a hard time doing better than one of the standard packages for the purpose. See here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html Thanks for the pointer Ted. I'm a big fan of the Tika project, we use it for content extraction already. For various reasons though, we have rolled our own language detector (mainly, neither of these packages cover all of the languages we need to identify - language-detection doesn't do Catalan, Tika doesn't do Welsh). Dean.
Re: Naive bayes and character n-grams
Yes. Should work to use character n-grams. There are oddities in the stats because the different n-grams are not independent, but Naive Bayes methods are in such a state of sin that it shouldn't hurt any worse. No... I don't think that there is a capability built in to generate the character n-grams. Should be relatively trivial to build. On Wed, Oct 9, 2013 at 3:18 AM, Dean Jones dean.m.jo...@gmail.com wrote: Hello folks, I see that it's possible to use mahout to train a naive bayes classifier using n-grams as features (or I guess, strictly speaking, mahout can be used to generate sequence files containing n-grams; I suspect the naive bayes trainer is indifferent to the form of features it trains on). Is there any facility to generate character n-grams instead of word n-grams? Thanks, Dean.
Re: Solr-recommender
Mike, Thanks for the vote of confidence! On Wed, Oct 9, 2013 at 6:13 AM, Michael Sokolov msoko...@safaribooksonline.com wrote: Just to add a note of encouragement for the idea of better integration between Mahout and Solr: On safariflow.com, we've recently converted our recommender, which computes similarity scores w/Mahout, from storing scores and running queries w/Postgres, to doing all that in Solr. It's been a big improvement, both in terms of indexing speed, and more importantly, the flexibility of the queries we can write. I believe that having scoring built in to the query engine is a key feature for recommendations. More and more I am coming to believe that recommendation should just be considered as another facet of search: as one among many variables the system may take into account when presenting relevant information to the user. In our system, we still clearly separate search from recommendations, and we probably will always do that to some extent, but I think we will start to blend the queries more so that there will be essentially a continuum of query options including more or less user preference data. I think what I'm talking about may be a bit different than what Pat is describing (in implementation terms), since we do LLR calculations off-line in Mahout and then bulk load them into Solr. We took one of Ted's earlier suggestions to heart, and simply ignored the actual numeric scores: we index the top N similar items for each item. Later we may incorporate numeric scores in Solr as term weights. If people are looking for things to do :) I think that would be a great software contribution that could spur this effort onward since it's difficult to accomplish right now given the Solr/Lucene indexing interfaces, but is already supported by the underlying data model and query engine. -Mike On 10/2/13 12:19 PM, Pat Ferrel wrote: Excellent. From Ellen's description the first Music use may be an implicit preference based recommender using synthetic data? I'm quickly discovering how flexible Solr use is in many of these cases. Here's another use you may have thought of: Shopping cart recommenders, as goes the intuition, are best modeled as recommending from similar item-sets. If you store all shopping carts as your training data (play lists, watch lists etc.) then as a user adds things to their cart you query for the most similar past carts. Combine the results intelligently and you'll have an item set recommender. Solr is built to do this item-set similarity. We tried to do this for a ecom site with pure Mahout but the similarity calc in real time stymied us. We knew we'd need Solr but couldn't devote the resources to spin it up. On the Con-side Solr has a lot of stuff you have to work around. It also does not have the ideal similarity measure for many uses (cosine is ok but llr would probably be better). You don't want stop word filtering, stemming, white space based tokenizing or n-grams. You would like explicit weighting. A good thing about Solr is how well it integrates with virtually any doc store independent of the indexing and query. A bit of an oval peg for a round hole. It looks like the similarity code is replaceable if not pluggable. Much of the rest could be trimmed away by config or adherence to conventions I suspect. In the demo site I'm working on I've had to adopt some slightly hacky conventions that I'll describe some day. On Oct 1, 2013, at 10:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, Ellen and some folks in Britain have been working with some data I produced from synthetic music fans. On Tue, Oct 1, 2013 at 2:22 PM, Pat Ferrel p...@occamsmachete.com wrote: Hi Ellen, On Oct 1, 2013, at 12:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: As requested, Pat, meet Ellen. Ellen, meet Pat. On Tue, Oct 1, 2013 at 8:46 AM, Pat Ferrel pat.fer...@gmail.com wrote: Tunneling (rat-holing?) into the cross-recommender and Solr+Mahout version. Things to note: 1) The pure Mahout XRecommenderJob needs a cross-LLR or a cross-similairty job. Currently there is only cooccurrence for sparsification, which is far from optimal. This might take the form of a cross RSJ with two DRMs as input. I can't commit to this but would commit to adding it to the XRecommenderJob. 2) output to Solr needs a lot of options implemented and tested. The hand-run test should be made into some junits. I'm slowly doing this. 3) the Solr query API is unimplemented unless someone else is working on that. I'm building one in a demo site but it looks to me like a static recommender API is not going to be all that useful and maybe a document describing how to do it with the Solr query interface would be best, especially for a first step. The reasoning here is that it is so tempting to mix in metadata to the recommendation query that a static API is not so obvious. For the demo site the recommender API
Re: Solr-recommender
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 10/9/13 3:08 PM, Pat Ferrel wrote: Solr uses cosine similarity for it's queries. The implementation on github uses Mahout LLR for calculating the item-item similarity matrix but when you do the more-like-this query at runtime Solr uses cosine. This can be fixed in Solr, not sure how much work. It's not clear to me whether it's worth fixing this or not. It would certainly complicate scoring calculations when mixing with traditional search terms. I am pretty convinced it is not worth fixing. This is particularly true because when you fix one count at 1 and take the limiting form of LLR, you get something quite similar to LLR in any case. This means that Solr's current query is very close to what we want theoretically ... certainly at least as close as theory is to practice.
Re: Solr-recommender
On Wed, Oct 9, 2013 at 2:07 PM, Pat Ferrel p...@occamsmachete.com wrote: 2) What you are doing is something else that I was calling a shopping-cart recommender. You are using the item-set in the current cart and finding similar, what, items? A different way to tackle this is to store all other shopping carts then use the current cart contents as a more-like-this query against past carts. This will give you items-purchased-together by other users. If you have enough carts it might give even better results. In any case they will be different. Or the shopping cart can be used as a query for the current indicator fields. That gives you an item-based recommendation from shopping cart contents. I am not sure that the more-like-this query buys all that much versus an ordinary query on the indicator fields.
Re: Solr-recommender
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: It sounds like you are doing item-item similarities for recommendations, not actually calculating user-history based recs, is that true? Yes that's true so far. Our recommender system has the ability to provide recs based on user history, but we have not deployed this in our app yet. My plan was simply to query based on all the items in the user's basket - not sure that this would require a different back end? We're not at the moment considering user-user similarity measures. The items in the basket really are kind of a history (a history of hte items placed in the basket). It is quite reasonable to use those as a query against indicator fields. It would be nice to generate indicators (aka binarized item-item LLR similarities) from a number of different actions such as view, dwell, scroll, add-to-basket and see which ones or which combos give you the best recommendation.
Re: What are the best settings for my clustering task
It is there, at the very least as part of the streaming k-means code. The abbreviation bkm has been used in the past. In looking at the code just now I don't find any command line invocation of bkm. It should be quite simple to write one and it would be very handy to have a way to run streaming k-means without a map reduce step as well. As such it might be good to have a new option to streaming k-means to use just bkm in a single thread, to use threaded streaming k-means on a single machine or to use MapR reduce streaming k-means. You up for trying to make a patch? Sent from my iPhone On Oct 6, 2013, at 12:37, Jens Bonerz jbon...@googlemail.com wrote: Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it in the list of available programs when calling the mahout binary... 2013/10/3 Ted Dunning ted.dunn...@gmail.com What you are seeing here are the cluster centroids themselves, not the cluster assignments. Streaming k-means is a single pass algorithm to derive these centroids. Typically, the next step is to cluster these centroids using ball k-means. *Those* results can then be applied back to the original (or new) input vectors to get cluster assignments for individual input vectors. I don't have command line specifics handy, but you seem to have done very well already at figuring out the details. On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote: I created a series of scripts to try out streamingkmeans in mahout an increased the number of clusters to a high amount as suggested by Ted. Everything seems to work. However, I can't figure out how to access the actual cluster data at the end of the process. It just gives me output that I cannot really understand... I would expect my product_ids being referenced to cluster ids... Example of the procedure's output: hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Input Path: file:MahoutCluster/part-r-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 8678, weight = 3.00, vector = {37:26.83479118347168,6085:8.162049293518066,4785:10.3130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307... Key: 1: Value: key = 3118, weight = 14.00, vector = {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958... Key: 2: Value: key = 2867, weight = 3.00, vector = {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.61132911475,7950:8.364990234375,8018:8.68657398223877,15433:7.959... Key: 3: Value: key = 6295, weight = 1.00, vector = {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1... Key: 4: Value: key = 6725, weight = 4.00, vector = {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143... Key: 5:.. this is my recipe: Step 1 Create a seqfile from my data with Python. Its the product_id (key) and the short normalized descripti (value) that is written into the sequence file. Step 2 create vectors from that data with the following command: mahout seq2sparse \ -i productClusterSequenceData/productClusterSequenceData.seq \ -o productClusterSequenceData/vectors \ Step 3 Cluster the vectors using streamingkeans with this command: mahout streamingkmeans \ -i productClusterSequenceData/vectors/tfidf-vectors \ -o MahoutCluster \ --tempDir /tmp \ -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 1 -km 50 \ Step 4 Export the streamingkmeans cluster data into a textfile (for an extract of the result see above) mahout seqdumper \ -i MahoutCluster similarProducts.txt What am I missing? 2013/10/3 Ted Dunning ted.dunn...@gmail.com Yes. That will work. The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N \approx 30 so the sketch will have at about 300,000 weighted centroids in it. The final clustering will have to process these centroids to produce the desired 5,000 clusters. Since 300,000 is a relatively small number of data points, this clustering step should proceed
Re: Editing Dictionary Vector Generated
Why do you say that this is unacceptable? If the phrase is the most common way that the word English is used, this isn't such a bad thing. In general, with machine learning, the idea is to let the data speak. If the data say something you don't like, you have to be careful about contradicting it. That said, you might be happier with something other than naive bayes classifieds (which I am guessing you are using). For instance, with regularized logistic regression, if the bigram is sufficiently predictive then the model will prefer to put zero weight on the constituent unigrams. Sent from my iPhone On Oct 4, 2013, at 9:50, Puneet Arora arorapuneet2...@gmail.com wrote: anti is marked as negative which also acceptable but it is also taking English as negative which is not acceptable
Re: What are the best settings for my clustering task
What you are seeing here are the cluster centroids themselves, not the cluster assignments. Streaming k-means is a single pass algorithm to derive these centroids. Typically, the next step is to cluster these centroids using ball k-means. *Those* results can then be applied back to the original (or new) input vectors to get cluster assignments for individual input vectors. I don't have command line specifics handy, but you seem to have done very well already at figuring out the details. On Oct 3, 2013, at 7:30 AM, Jens Bonerz wrote: I created a series of scripts to try out streamingkmeans in mahout an increased the number of clusters to a high amount as suggested by Ted. Everything seems to work. However, I can't figure out how to access the actual cluster data at the end of the process. It just gives me output that I cannot really understand... I would expect my product_ids being referenced to cluster ids... Example of the procedure's output: hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally Input Path: file:MahoutCluster/part-r-0 Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.clustering.streaming.mapreduce.CentroidWritable Key: 0: Value: key = 8678, weight = 3.00, vector = {37:26.83479118347168,6085:8.162049293518066,4785:10.3130493164,2493:19.677349090576172,2494:16.06648826599121,9659:9.568963050842285,20877:9.307... Key: 1: Value: key = 3118, weight = 14.00, vector = {19457:5.646900812784831,8774:4.746263821919759,9738:1.022495985031128,13301:5.762300491333008,14947:0.6774413585662842,8787:6.841406504313151,14958... Key: 2: Value: key = 2867, weight = 3.00, vector = {15873:10.955257415771484,1615:4.029662132263184,20963:4.979445934295654,3978:5.61132911475,7950:8.364990234375,8018:8.68657398223877,15433:7.959... Key: 3: Value: key = 6295, weight = 1.00, vector = {17113:10.955257415771484,15347:9.568963050842285,15348:10.955257415771484,19845:7.805374622344971,7945:10.262109756469727,15356:18.090286254882812,1... Key: 4: Value: key = 6725, weight = 4.00, vector = {10570:7.64715051651001,14915:6.126943588256836,14947:4.064648151397705,14330:9.414812088012695,18271:2.7172491550445557,14335:19.677349090576172,143... Key: 5:.. this is my recipe: Step 1 Create a seqfile from my data with Python. Its the product_id (key) and the short normalized descripti (value) that is written into the sequence file. Step 2 create vectors from that data with the following command: mahout seq2sparse \ -i productClusterSequenceData/productClusterSequenceData.seq \ -o productClusterSequenceData/vectors \ Step 3 Cluster the vectors using streamingkeans with this command: mahout streamingkmeans \ -i productClusterSequenceData/vectors/tfidf-vectors \ -o MahoutCluster \ --tempDir /tmp \ -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure \ -sc org.apache.mahout.math.neighborhood.FastProjectionSearch \ -k 1 -km 50 \ Step 4 Export the streamingkmeans cluster data into a textfile (for an extract of the result see above) mahout seqdumper \ -i MahoutCluster similarProducts.txt What am I missing? 2013/10/3 Ted Dunning ted.dunn...@gmail.com Yes. That will work. The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N \approx 30 so the sketch will have at about 300,000 weighted centroids in it. The final clustering will have to process these centroids to produce the desired 5,000 clusters. Since 300,000 is a relatively small number of data points, this clustering step should proceed relatively quickly. On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz jbon...@googlemail.com wrote: thx for your elaborate answer. so if the upper bound on the final number of clusters is unknown in the beginning, what would happen, if I define a very high number that is guaranteed to be the estimated number of clusters. for example if I set it to 10.000 clusters if an estimate of 5.000 is likely, will that work? 2013/10/2 Ted Dunning ted.dunn...@gmail.com The way that the new streaming k-means works is that there is a first sketch pass which only requires an upper bound on the final number of clusters you will want. It adaptively creates more or less clusters depending on the data and your bound. This sketch is guaranteed to be computed within at most one map-reduce pass. There is a threaded version that runs (fast) on a single machine. The threaded version
Re: Editing Dictionary Vector Generated
On Fri, Oct 4, 2013 at 6:13 AM, Puneet Arora arorapuneet2...@gmail.comwrote: yes you guessed correct that I am using naive bayes, but how can I handle this type of problem. I didn't hear about a problem. You said you didn't like weights on words like English to reflect the fact that they are used in certain contexts. I said that this is the way it should work. Unless you demonstrate that you increase accuracy by changing the weights, I don't know how to go further. Other algorithms are specifically designed so that if the weights on English are redundant, then they will be set to near zero. Naive bayes purposely ignores such redundancy in order to be simpler.
Re: What are the best settings for my clustering task
The way that the new streaming k-means works is that there is a first sketch pass which only requires an upper bound on the final number of clusters you will want. It adaptively creates more or less clusters depending on the data and your bound. This sketch is guaranteed to be computed within at most one map-reduce pass. There is a threaded version that runs (fast) on a single machine. The threaded version is liable to be faster than the map-reduce version for moderate or smaller data sizes. That sketch can then be used to do all kinds of things that rely on Euclidean distance and still get results within a small factor of the same algorithm applied to all of the data. Typically this second phase is a ball k-means algorithm, but it could easily be a dp-means algorithm [1] if you want a variable number of clusters. Indeed, you could run many dp-means passes with different values of lambda on the same sketch. Note that the sketch is small enough that in-memory clustering is entirely viable and is very fast. For the problem you describe, however, you probably don't need the sketch approach at all and can probably apply ball k-means or dp-means directly. Running many k-means clusterings with differing values of k should be entirely feasible as well with such data sizes. [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz jbon...@googlemail.com wrote: Isn't the streaming k-means just a different approach to crunch through the data? In other words, the result of streaming k-means should be comparable to using k-means in multiple chained map reduce cycles? I just read a paper about the k-means clustering and its underlying algorithm. According to that paper, k-means relies on a preknown/predefined amount of clusters as an input parameter. Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf In my current scenario however, the number of clusters is unknown at the beginning. Maybe k-means is just not the right algorithm for clustering similar products based on their short description text? What else could I use? 2013/10/1 Ted Dunning ted.dunn...@gmail.com At such small sizes, I would guess that the sequential version of the streaming k-means or ball k-means would be better options. On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.com wrote: Hello all, I am currently trying create clusters from a group of 50.000 strings that contain product descriptions (around 70-100 characters length each). That group of 50.000 consists of roughly 5.000 individual products and ten varying product descriptions per product. The product descriptions are already prepared for clustering and contain a normalized brand name, product model number, etc. What would be a good approach to maximise the amound of found clusters (the best possible value would be 5.000 clusters with 10 products each) I adapted the reuters cluster script to read in my data and managed to create a first set of clusters. However, I have not managed to maximise the cluster count. The question is: what do I need to tweak with regard to the available mahout settings, so the clusters are created as precisely as possible? Many regards! Jens -- View this message in context: http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html Sent from the Mahout User List mailing list archive at Nabble.com.
Re: What are the best settings for my clustering task
Yes. That will work. The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N \approx 30 so the sketch will have at about 300,000 weighted centroids in it. The final clustering will have to process these centroids to produce the desired 5,000 clusters. Since 300,000 is a relatively small number of data points, this clustering step should proceed relatively quickly. On Wed, Oct 2, 2013 at 10:21 AM, Jens Bonerz jbon...@googlemail.com wrote: thx for your elaborate answer. so if the upper bound on the final number of clusters is unknown in the beginning, what would happen, if I define a very high number that is guaranteed to be the estimated number of clusters. for example if I set it to 10.000 clusters if an estimate of 5.000 is likely, will that work? 2013/10/2 Ted Dunning ted.dunn...@gmail.com The way that the new streaming k-means works is that there is a first sketch pass which only requires an upper bound on the final number of clusters you will want. It adaptively creates more or less clusters depending on the data and your bound. This sketch is guaranteed to be computed within at most one map-reduce pass. There is a threaded version that runs (fast) on a single machine. The threaded version is liable to be faster than the map-reduce version for moderate or smaller data sizes. That sketch can then be used to do all kinds of things that rely on Euclidean distance and still get results within a small factor of the same algorithm applied to all of the data. Typically this second phase is a ball k-means algorithm, but it could easily be a dp-means algorithm [1] if you want a variable number of clusters. Indeed, you could run many dp-means passes with different values of lambda on the same sketch. Note that the sketch is small enough that in-memory clustering is entirely viable and is very fast. For the problem you describe, however, you probably don't need the sketch approach at all and can probably apply ball k-means or dp-means directly. Running many k-means clusterings with differing values of k should be entirely feasible as well with such data sizes. [1] http://www.cs.berkeley.edu/~jordan/papers/kulis-jordan-icml12.pdf On Wed, Oct 2, 2013 at 9:11 AM, Jens Bonerz jbon...@googlemail.com wrote: Isn't the streaming k-means just a different approach to crunch through the data? In other words, the result of streaming k-means should be comparable to using k-means in multiple chained map reduce cycles? I just read a paper about the k-means clustering and its underlying algorithm. According to that paper, k-means relies on a preknown/predefined amount of clusters as an input parameter. Link: http://books.nips.cc/papers/files/nips22/NIPS2009_1085.pdf In my current scenario however, the number of clusters is unknown at the beginning. Maybe k-means is just not the right algorithm for clustering similar products based on their short description text? What else could I use? 2013/10/1 Ted Dunning ted.dunn...@gmail.com At such small sizes, I would guess that the sequential version of the streaming k-means or ball k-means would be better options. On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.com wrote: Hello all, I am currently trying create clusters from a group of 50.000 strings that contain product descriptions (around 70-100 characters length each). That group of 50.000 consists of roughly 5.000 individual products and ten varying product descriptions per product. The product descriptions are already prepared for clustering and contain a normalized brand name, product model number, etc. What would be a good approach to maximise the amound of found clusters (the best possible value would be 5.000 clusters with 10 products each) I adapted the reuters cluster script to read in my data and managed to create a first set of clusters. However, I have not managed to maximise the cluster count. The question is: what do I need to tweak with regard to the available mahout settings, so the clusters are created as precisely as possible? Many regards! Jens -- View this message in context: http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html Sent from the Mahout User List mailing list archive at Nabble.com. -- CEO Hightech Marketing Group Cell/Mobile: +49 173 539 3588 Hightech Marketing Group Frankenstraße 32 50354 Huerth Germany Phone: +49 (0)2233 – 619 2741 Fax: +49 (0)2233 – 619 27419 Web: www.hightechmg.com
Re: What are the best settings for my clustering task
At such small sizes, I would guess that the sequential version of the streaming k-means or ball k-means would be better options. On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.comwrote: Hello all, I am currently trying create clusters from a group of 50.000 strings that contain product descriptions (around 70-100 characters length each). That group of 50.000 consists of roughly 5.000 individual products and ten varying product descriptions per product. The product descriptions are already prepared for clustering and contain a normalized brand name, product model number, etc. What would be a good approach to maximise the amound of found clusters (the best possible value would be 5.000 clusters with 10 products each) I adapted the reuters cluster script to read in my data and managed to create a first set of clusters. However, I have not managed to maximise the cluster count. The question is: what do I need to tweak with regard to the available mahout settings, so the clusters are created as precisely as possible? Many regards! Jens -- View this message in context: http://lucene.472066.n3.nabble.com/What-are-the-best-settings-for-my-clustering-task-tp4092807.html Sent from the Mahout User List mailing list archive at Nabble.com.
Re: Multidimensional log-likelihood similarity
Yes. You can turn the normal item-item relationships around to get this. What you have is an item x feature matrix. Normally, one has a user x item matrix in cooccurrence analysis and you get an item x item matrix. If you consider the features to be users in the computation, then the resulting indicator matrix would be just what you want. The basic idea is that items would be related if they share features. Two items that have the same feature would be said to co-occur on that feature. Finding anomalous cooccurrence would be what you need to do to find items that co-occur on many features. This works by building a small 2x2 matrix that relates item A and item B. The entries would be feature counts. The upper left entry of the matrix is the number of features that A and B both have, the upper right is the number of features that B has that A does not and so on. Put another way, the columns represent features that A has or does not have respectively and the rows represent the features that B has or does not have respectively. Items that give high root log-likelihood ratio values should considered connected. Those that have small values for root LLR should be considered not connected. The value of the root-LLR should be discarded after thresholding and should not be considered a measure of the strength of the relationship. I would recommend the same down-sampling that the rowSimilarityJob already does. On Sun, Sep 29, 2013 at 3:40 AM, Mridul Kapoor mridulkap...@gmail.comwrote: Hi I have records - items - with many features. Something like ID, feature1, feature2, ..., featureN Can I leverage Mahout's log-likelihood similarity metrics for calculating the K-Most similar items to a given item X? - Thanks Mridul
Re: Mahout in one PC - multiple cores processor
Just runs in one process. Sent from my iPhone On Sep 20, 2013, at 11:32, Fernando Santos fernandoleandro1...@gmail.com wrote: Thanks for the help guys. But these parts of Mahout that don't work with Hadoop also works with some other distributed file system or it just runs in one process? 2013/9/20 Ted Dunning ted.dunn...@gmail.com It also depends on what you are doing. Several parts of Mahout have non Hadoop versions. On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com wrote: It is always possible to run mahout without a cluster on a single machine but donot expect too much performance gain on it if you are using a huge data set.Such a set up is primarily meant for development and testing purpose on small datasets. If you have a machine with many cores , you can configure hadoop in pseudo cluster mode and then point mahout to hadoop directory . Set the number of map and reduce slots in the hadoop conf file to properly utilize the cores of your processor. Thanks, Parnab On Fri, Sep 20, 2013 at 5:27 PM, Fernando Santos fernandoleandro1...@gmail.com wrote: Hello everyone, I'm working with some classification tasks that are taking long time do be processed. So looking for a solution I found Mahout. Does anyone know if using Mahout without any cluster, just in my computer, it gives better performance than not using it? I mean, is it possible to treat the different cores of my computer's processor as they were a cluster of other machines? Thanks! -- Fernando Santos +55 61 8129 8505 -- Fernando Santos +55 61 8129 8505
Re: Mahout in one PC - multiple cores processor
It also depends on what you are doing. Several parts of Mahout have non Hadoop versions. On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com wrote: It is always possible to run mahout without a cluster on a single machine but donot expect too much performance gain on it if you are using a huge data set.Such a set up is primarily meant for development and testing purpose on small datasets. If you have a machine with many cores , you can configure hadoop in pseudo cluster mode and then point mahout to hadoop directory . Set the number of map and reduce slots in the hadoop conf file to properly utilize the cores of your processor. Thanks, Parnab On Fri, Sep 20, 2013 at 5:27 PM, Fernando Santos fernandoleandro1...@gmail.com wrote: Hello everyone, I'm working with some classification tasks that are taking long time do be processed. So looking for a solution I found Mahout. Does anyone know if using Mahout without any cluster, just in my computer, it gives better performance than not using it? I mean, is it possible to treat the different cores of my computer's processor as they were a cluster of other machines? Thanks! -- Fernando Santos +55 61 8129 8505
Re: Clustering algorithms
Right now the best in terms of speed without losing quality in Mahout is the streaming k-means implementation. One exciting possibility is that you probably can combine a streaming k-means pre-pass with a regularized k-means algorithm in order to get results more like Lingo. You could also follow with a DP-means pass to get an idea of optimal number of clusters. The idea with streaming k-means is that a first pass does a rough clustering into a whole lot of clusters. This pass is fast because only approximate search is needed. It is also adaptive so you only have to specify very roughly how many clusters you will probably be interested in having later. The output is an approximate k-means clustering with many more clusters than you asked for. This output can then be clustered in memory using any weighted clustering algorithm you care to use. For k-means and certain kinds of data, you can even get nice probabilistic accuracy bounds for the combo. On Tue, Sep 17, 2013 at 12:06 PM, Mike Hugo m...@piragua.com wrote: Hello, I'm new to mahout but have been working with Solr, Carrot2 and clustering documents with the Lingo algorithm. This has worked well for us for clustering small sets of search results, but we are now branching out into wanting to cluster larger sets of documents (millions of documents to 10s of millions of document for now). Could someone point me in the right direction as to which of the clustering algorithms I should take a look at first (that would be similar to Lingo)? Thanks, Mike
Re: Tuning parameters for ALS-WR
On Wed, Sep 11, 2013 at 12:07 AM, Sean Owen sro...@gmail.com wrote: 2. Do we have to tune the similarityclass parameter in item-based CF? If so, do we compare the mean average precision values based on validation data, and then report the same for the test set? Yes you are conceptually looking over the entire hyper-parameter space. If the similarity metric is one of those, you are trying different metrics. Grid search, just brute-force trying combinations, works for 1-2 hyper-parameters. Otherwise I'd try randomly choosing parameters, really, or else it will take way too long to explore. You try to pick hyper-parameters 'nearer' to those that have yielded better scores. Or use a real exploration algorithm. For my favorite (hear that horn blowing?) see this article on recorded step meta-mutation.http://arxiv.org/abs/0803.3838 The idea is a randomized search, but with something akin to momentum. This lets you search nasty landscapes with pretty pretty good robustness and smooth ones with fast convergence. The code and theory are simple and there is an implementation in Mahout.
Re: Tuning parameters for ALS-WR
You definitely need to separate into three sets. Another way to put it is that with cross validation, any learning algorithm needs to have test data withheld from it. The remaining data is training data to be used by the learning algorithm. Some training algorithms such as the one that you describe divide their training data into portions so that they can learn hyper-parameters separately from parameters. Whether the learning algorithm does this or uses some other technique to come to a final value for the model has no bearing on whether the original test data is withheld and because the test data has to be unconditionally withheld, any sub-division of the training data cannot include any of the test data. In your case, you hold back 25% test data. Then you divide the remaining 75% into 25% validation and 50% training. The validation set has to be separate from the 50% in order to avoid over-fitting, but the test data has to be separate from the training+validation for the same reason. On Tue, Sep 10, 2013 at 4:22 PM, Parimi Rohit rohit.par...@gmail.comwrote: Hi All, I was wondering if there is any experimental design to tune the parameters of ALS algorithm in mahout, so that we can compare its recommendations with recommendations from another algorithm. My datasets have implicit data and would like to use the following design for tuning the ALS parameters (alphs, lambda, numfeatures). 1. Split the data such that for each user, 50% of the clicks go to train, 25% go to validation, 25% goes to test. 2. Create the user and item features by applying the ALS algorithm on training data, and test on the validation set. (We can pick the parameters which minimizes the RMSE score, in-case of implicit data, Pui - XY’) 3. Once we find the parameters which give the best RMSE value on validation, use the user and item matrices generated for those parameters to predict the top k items and test it with the items in the test set (compute mean average precision). Although the above setting looks good, I have few questions 1. Do we have to follow this setting, to compare algorithms? Can't we report the parameter combination for which we get highest mean average precision for the test data, when trained on the train set, with out any validation set. 2. Do we have to tune the similarityclass parameter in item-based CF? If so, do we compare the mean average precision values based on validation data, and then report the same for the test set? My ultimate objective is to compare different algorithms but I am confused as to how to compare the best results (based on parameter tuning) between algorithms. Are there any publications that explain this in detail? Any help/comments about the design of experiments is much appreciated. Thanks, Rohit
Re: Solr recommender
On Fri, Sep 6, 2013 at 9:33 AM, Pat Ferrel pat.fer...@gmail.com wrote: One of the unique things about the Solr recommender is online recs. Two scenarios come to mind: 1) ask the user to pick from among a list of videos, taking the picks as preferences and making recs. Make more and see if recs improve. 2) watch the users' detail views during a browsing session and make recs based on those in realtime. A sort of are you looking for something like this? recommender. For #1 I've seen several examples (BTW very few give instant recs). Not sure how they pick what to rate. It seems to me a mix of popular and the videos with the most varying ratings would be best. Since we have thumbs up and down it would be simple to find individual videos with a high degree of both love and hate. Intuitively this would seem to help find the birds of a feather among the reviewers and help put the user in with the right set with the fewest preferences required. For #1, Ken's suggestion of clustering seems quite reasonable. The only diff is that I would tend to pick something near the centroid of the cluster *and* that is very popular. You need to have something people will recognize. Clustering can be done by doing SVD or ALS on the user x thing matrix first or by directly clustering the columns of the user x thing matrix after some kind of IDF weighting. I think that only the streaming k-means currently does well on sparse vectors. #2 seems straightforward. No idea if it will be useful. If #2 doesn't seem useful is may be modified to become the typical, makes recs based on all reviews but also includes recent reviews not yet in the training data. That's OK since we'd want to do it anyway. For #2, I think that this is a great example of multi-modal recommendations. You have browsing behavior and your tomatoes-reviews behavior. Combining that allows you to recommend for people who have only one kind of behavior. Of course, our viewing behavior will be very sparse to start.
Re: Hadoop implementation of ParallelSGDFactorizer
That means If I Recall Correctly. It is an internet slang. See also http://en.wiktionary.org/wiki/Appendix:English_internet_slang On Sat, Sep 7, 2013 at 12:39 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Sebastian, what is IIRC? On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: IIRC the algorithm behind ParallelSGDFactorizer needs shared memory, which is not given in a shared-nothing environment. On 07.09.2013 19:08, Tevfik Aytekin wrote: Hi, There seems to be no Hadoop implementation of ParallelSGDFactorizer. ALSWRFactorizer has a Hadoop implementation. ParallelSGDFactorizer (since it is based on stochastic gradient descent) is much faster than ALSWRFactorizer. I don't know Hadoop much. But it seems to me that a Hadoop implementation of ParallelSGDFactorizer will also be much faster than the Hadoop implementaion of ALSWRFactorizer. Is there a specific reason for why there is no Hadoop implementation of ParallelSGDFactorizer? Is it because since Hadoop operations are already slow the slowness of ALSWRFactorizer does not matter much. Or is it simply because nobody has implemented it yet? Thanks Tevfik
Re: Mahout readable output
Darius comments are good. You also have to think about what similar means to you. From the data you describe, I see several possibilities: - geo-location from machine id (if it includes IP address) - content from the query - frequency of posting - diurnal phase of posting (tells us time zone) Once you know what similar means, you can meaningfully talk about next steps. If you assume that only query content matters, then I would go towards several ways. - cluster directly based on query histories using IDF weighting (likely to be kinda sorta lousy results) - use cooccurrence analysis to augment query histories and repeat the clustering - use SVD or ALS to generate user vectors and query term vectors and cluster users using user vectors and then look for coherence. If you want to use geo, the question of scaling comes in. If you want to use time, you have to derive some sort of features. I find latent variable methods useful for this. On Fri, Sep 6, 2013 at 1:25 AM, Darius Miliauskas dariui.miliaus...@gmail.com wrote: Dear Vishal, can you give some code how you performed your mentioned steps: #) Created custom VectorIterable by inheriting IterableVector. #) Created custom VectorItertor by inheriting AbstractIteratorVector #) Model class which will be responsible to pass attribute values (username or data etc) to custom VectorIterator #) Custom VectorIterator.computeNext() will read line, create dense vector having size equal to number of attribute in a row. Can you compile the code? Best, Darius 2013/9/6 Vishal Danech vishal.dan...@gmail.com Hi I have a custom log data which contains following details. 1) UserName 2) MachineId 3) DateTime 4) Data - which contains text - search term etc I would like to use this data to know #) how much time they are spending on browsing etc. #) User based search pattern First problem can be addressed using Hive query. For second problem, I suppose clustering can be applied and for this I have converted data to vectors. I have used dense vector and applied Canopy algorithm on it. I got an output which I provided as an input to ClusterDump utility but the out I got was not in readable form, I figured out that I need to use named vectors so that Key can be displayed as a output. Here I am facing issue, how to use NamedVector ? I am performing following steps to generate vectors.. #) Created custom VectorIterable by inheriting IterableVector. #) Created custom VectorItertor by inheriting AbstractIteratorVector #) Model class which will be responsible to pass attribute values (username or data etc) to custom VectorIterator #) Custom VectorIterator.computeNext() will read line, create dense vector having size equal to number of attribute in a row. Please let me know how to add NamedVector here so that I can get some readable output from ClusterDump utility. -- Thanks and Regards Vishal Danech
Re: Solr recommender
On Sat, Sep 7, 2013 at 2:35 PM, Pat Ferrel p...@occamsmachete.com wrote: ... Clustering can be done by doing SVD or ALS on the user x thing matrix first or by directly clustering the columns of the user x thing matrix after some kind of IDF weighting. I think that only the streaming k-means currently does well on sparse vectors. Was thinking about filtering out all but the top x% of items to get things the user is likely to have heard about if not seen. Do this before any factorizing or clustering. Hmm... My reflex would be to trim *after* clustering so that clustering has the benefit of the long-tail. ... For #2, I think that this is a great example of multi-modal recommendations. You have browsing behavior and your tomatoes-reviews behavior. Combining that allows you to recommend for people who have only one kind of behavior. Of course, our viewing behavior will be very sparse to start. Yes, that's why I'm not convinced it will be useful but an interesting experiment now that we have the online Solr recommender. Soon we'll have category and description metadata from the crawler. We can experiment with things like category boosting if a category trend emerges during the browsing session and I suspect it often does--maybe release date etc. The ease of mixing metadata with behavior is another thing worth experimenting with. Cool. And remember meta-data becomes behavior when you interact with an item since you have just interacted with the meta-data as well. Btw... I am spinning up a team internally and a team at a partner site to help with the Mahout demo. I am trying to generate realistic music consumption data this weekend as well.
Re: lucene.vectors not working
Ahh... That makes a lot of sense. On Thu, Sep 5, 2013 at 11:38 PM, Lauren Massa-Lochridge laurl...@ieee.orgwrote: Ted Dunning ted.dunning at gmail.com writes: OK. So the easy answer strikes out. On Sat, Aug 3, 2013 at 5:04 AM, Swami Kevala swami.kevala at ishafoundation.org wrote: Ted Dunning ted.dunning at gmail.com writes: Does your index actually have term vectors? On Fri, Aug 2, 2013 at 9:00 PM, Swami Kevala swami.kevala at ishafoundation.org wrote: Well yes... I used the example data that was supplied with the Solr 4.3.1 installation. I checked the schema before posting the example docs to the index, and it already had the option termVectors=true set for the includes field by default I've had the same error message only once, using a schema I've had in use over multiple version upgrades. I.e. a schema known to be correctly configured for term vectors. I hadn't noticed that only a miniscule count of documents had been indexed. If I recall correctly, it was well 100. I never use the example data, but I would check to see that its all really indexed or try a larger data set in case something changed relative to the example data. Lauren Massa-Lochridge AC7IONABL3
Re: using KmeansDriver with HDFS
On Wed, Sep 4, 2013 at 6:58 PM, Alan Krumholz alan_krumh...@yahoo.com.mxwrote: I pulled that code (org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:215)and I think is trying to read a file from one of the paths I passed to the method but with a new instance of the configuration object (not the configuration object I passed to the method but one that doesn't have my HDFS configured) This is quite plausibly a bug. This is a common error when using the HDFS API. Have you checked what happens with 0.8?
Re: Has anyone implemented true L-LDA out of Mahout?
I haven't seen any discussion of this other than what you reference. On Thu, Sep 5, 2013 at 7:59 AM, Henry Lee honesthe...@gmail.com wrote: I am about to implement Jake Mannix's suggestion out of Twitter fork. Has anyone already implemented true L-LDA out of Mahout? http://markmail.org/message/cm2a6rnxblj5azuh over this fork? https://github.com/twitter/mahout/blob/master/core/src/main/java/org/apache/mahout/clustering/lda/cvb/CVB0PriorMapper.java Thanks, Henry Lee
Re: Tweaking ALS models to filter out highly related items when an item has been purchased
I think that Dominik's comments are exactly on target. As far as implementation is concerned, I think that it is very important to not distort the basic recommendation algorithm with business rules like this. It is much better to post-process the results to impose your will directly. One exception to this is that I think it is reasonable to use ordered cooccurrence and also repeated cooccurrence here for some hints here. This lets you determine likely accessories (purchased after the main item, mostly) and also find razor-blades (highly repetitive purchases). You still have the problem of flooding with similar items. The diversity that you are talking about is a critical quality in recommendation results. The basic intuition is that recommendation results are not individual recommendations, but are included in a portfolio of recommendations. You need the diversity in this portfolio because if you are wrong about an item, the likelihood of being wrong about very similar items is high. If you flood the first and second pages with these similar items, then you don't have room for the alternative items that might well be correct. My approach in the past was to define heuristic definitions for too similar and do a pass over the sorted recommendation results giving each item that passes the too-similar criterion a penalty score. When done with this, I re-sort the results and the duplicative content falls to the bottom of the recommendations. On Thu, Sep 5, 2013 at 1:15 AM, Dominik Hübner cont...@dhuebner.com wrote: Just a quick a assumption, maybe I have not thought this through enough: 1. Users probably tend to compare products = similar VIEWS 2. User as well might tend to PURCHASE accessory products, like the laptop bag you mentioned May be you could filter out products that have a similarity computed from the product views, but leave those similar, based on purchases, in your recommendation set? Nevertheless, I guess this will be strongly depending on the domain the data comes from. On Sep 5, 2013, at 10:07 AM, Nick Pentreath nick.pentre...@gmail.com wrote: Hi all Say I have a set of ecommerce data (views, purchases etc). I've built my model using implicit feedback ALS. Now, I want to add a little bit of smart filtering. Filtering based on not recommending something that has been purchased is straightforward, but I'd like to also filter so as not to recommend highly similar items to someone who has purchased an item. In other words, if someone has just purchased a laptop, then I'd like to not recommend other laptops. Ideally while still recommending related items such as laptop bags, mouse etc etc. (this is just an example). Now, I could filter based on metadata tags like category, but assuming I don't always have that data, then simplistically I have the option of filtering out products based on those that have high cosine similarity to the purchased products. However, this risks filtering out good similar products (like the laptop bags) as well as the bad similar products. I'm experimenting with building a second variant of the model that effectively downweights views to near zero, hence leaving something sort of like a purchased together model variant. Then recommendations can be made using this model when a user purchases an item (or perhaps a re-scorer that is a weighted variant of model A and model B but that tends to weight model B - the purchased together model - higher) Are there other mechanisms to tweak the ALS model such that it tends towards recommending related products (but not highly similar of the exact same narrow product type)? Any other ideas about how best to go about this? Many thanks Nick
Re: ALS and SVD feature vectors
On Wed, Sep 4, 2013 at 10:59 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Now, what happens in the case of SVD? The vectors are normal by definition. Are singular values used at all, or just left and right singular vectors? SVD does not take weights so it cannot ignore or weigh out a non-observation, which is why it is not well suited for matrix completion problem per se There are multiple ways to read the use of weights here. In the original posting, I think the gist was how to treat the singular values, not how to weight different observations. Mahout's SSVD allows the singular values to be kept separate, to be applied entirely to the left or right singular values or to be split across both in a square root sort of way.
Re: Cannot build source version mahout-distribution-0.8
You also have to watch out in the case of web errors. Maven can store an error message instead of a well formed file in your repo leading to all kinds of confusion. Try deleting thus *rm -rf ~/.m2/repository/com/ibm* On Tue, Aug 27, 2013 at 7:37 AM, Stevo Slavić ssla...@gmail.com wrote: Hello Michael, Seems like temporary Maven Central repo mirror(s) issue. I've just tried several times to open with browser http://repo1.maven.org/maven2/org/apache/maven/plugins/ and sometimes it responds well, and few times it returns empty page. So, please try again. Kind regards, Stevo Slavic. On Tue, Aug 27, 2013 at 3:59 PM, Michael Wechner michael.wech...@wyona.comwrote: Hi I have downloaded http://mirror.switch.ch/**mirror/apache/dist/mahout/0.8/** mahout-distribution-0.8-src.**zip http://mirror.switch.ch/mirror/apache/dist/mahout/0.8/mahout-distribution-0.8-src.zip and tried to build it with mvn -DskipTests clean install on Mac OS X 10.6.8 with Java 1.6.0_45 and Maven 3.0.4 but reveived the following errors: [INFO] --**--** [INFO] Reactor Summary: [INFO] [INFO] Mahout Build Tools ..**.. SUCCESS [13.168s] [INFO] Apache Mahout ..**... SUCCESS [2.823s] [INFO] Mahout Math ..**. SUCCESS [1:02.822s] [INFO] Mahout Core ..**. SUCCESS [1:26.430s] [INFO] Mahout Integration ..**.. FAILURE [1:45.435s] [INFO] Mahout Examples ..**. SKIPPED [INFO] Mahout Release Package SKIPPED [INFO] --**--** [INFO] BUILD FAILURE [INFO] --**--** [INFO] Total time: 4:31.448s [INFO] Finished at: Tue Aug 27 15:19:31 CEST 2013 [INFO] Final Memory: 27M/123M [INFO] --**--** [ERROR] Failed to execute goal on project mahout-integration: Could not resolve dependencies for project org.apache.mahout:mahout-**integration:jar:0.8: Could not transfer artifact com.ibm.icu:icu4j:jar:49.1 from/to central ( http://repo.maven.apache.org/**maven2 http://repo.maven.apache.org/maven2): GET request of: com/ibm/icu/icu4j/49.1/icu4j-**49.1.jar from central failed: Premature end of Content-Length delimited message body (expected: 7407144; received: 4098921 - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/**confluence/display/MAVEN/** DependencyResolutionException http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn goals -rf :mahout-integration Does anybody else experience the same problem? Thanks Michael
Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations
Yes. Correlation is a problem because tables like 1 0 0 10^6 and 10 0 0 10^6 produce the same correlation. LLR correctly distinguishes these cases. On Mon, Aug 19, 2013 at 7:16 AM, Pat Ferrel p...@occamsmachete.com wrote: Which is why LLR would be really nice in two action cross-similairty case. The cross-corelation sparsification via cooccurrence is probably pretty weak, no? On Aug 18, 2013, at 11:53 AM, Ted Dunning ted.dunn...@gmail.com wrote: Outside of the context of your demo, suppose that you have events a, b, c and d. Event a is the one we are centered on and is relatively rare. Event b is not so rare, but has weak correlation with a. Event c is as rare as a, but correlates strongly with it. Even d is quite common, but has no correlation with a. The 2x2 matrices that you would get would look something like this. In each of these, a and NOT a are in rows while other and NOT other are in columns. versus b, llrRoot = 8.03 b NOT b a *10* *10* NOT a *1000* *99000* versus c, llrRoot = 11.5 c NOT c a *10* *10* NOT a *30* *99970* versus d, llrRoot = 0 d NOT d a *10* *10* NOT a *5* *5* Note that what we are holding constant here is the prevalence of a (20 times) and the distribution of a under the conditions of the other symbol. What is being varied is the distribution of the other symbol in the NOT a case. On Sun, Aug 18, 2013 at 10:50 AM, B Lyon bradfl...@gmail.com wrote: Thanks folks for taking a look. I haven't sat down to try it yet, but wondering how hard it is to construct (realizable and realistic) k11, k12, k21, k22 values for three binary sequences X, Y, Z where (X,Y) and (Y,Z) have same co-occurrence, but you can tweak k12 and k21 so that the LLR values are extremely different in both directions. I assume that k22 doesn't matter much in practice since things are sparse and k22 is huge. Well, obviously, I guess you could simply switch the k12/k21 values between the two sequence pairs to flip the order at will... which is information that co-occurrence of course does not know about. On Sat, Aug 17, 2013 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is nice. As you say, k11 is the only part that is used in cooccurrence and it doesn't weight by prevalence, either. This size analysis is hard to demonstrate much difference because it is hard to show interesting values of LLR without absurdly string coordination between items. On Fri, Aug 16, 2013 at 8:21 PM, B Lyon bradfl...@gmail.com wrote: As part of trying to get a better grip on recommenders, I have started a simple interactive visualization that begins with the raw data of user-item interactions and goes all the way to being able to twiddle the interactions in a test user vector to see the impact on recommended items. This is for simple user interacted with an item case rather than numerical preferences for items. The goal is to show the intermediate pieces and how they fit together via popup text on mouseovers and dynamic highlighting of the related pieces. I am of course interested in feedback as I keep tweaking on it - not sure I got all the terminology quite right yet, for example, and might have missed some other things I need to know about. Note that this material is covered in Chapter 6.2 in MIA in the discussion on distributed recommenders. It's on googledrive here (very much a work-in-progress): https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/ (apologies to small resolution screens) This is based only on the co-occurrence matrix, rather than including the other similarity measures, although in working through this, it seems that the other ones can just be interpreted as having alternative definitions of what * means in matrix multiplication of A^T*A, where A is the user-item matrix... and as an aside to me begs the interesting question of [purely hypotheticall?] situations where LLR and co-occurrence are at odds with each other in making recommendations, as co-occurrence seems to be just using the k11 term that is part of the LLR calculation. My goal (at the moment at least) is to eventually continue this for the solr-recommender project that started as few weeks ago, where we have the additional cross-matrix, as well as a kind of regrouping of pieces for solr. -- BF Lyon http://www.nowherenearithaca.com -- BF Lyon http://www.nowherenearithaca.com
Re: Setting up a recommender
Pat, That really sounds great. I should find some time (who needs sleep) to generate music logs for you as well. On Mon, Aug 19, 2013 at 8:31 AM, Pat Ferrel p...@occamsmachete.com wrote: There are three things I could work on my free time: 1) test this on a bigger data set gathered from rotten tomatoes, which only has B data (movie thumbs up) 2) begin work on the Solr query and service integration, rather than the current loose LucidWorks Search integration. 3) make sure everything is set up for different item spaces in B and A. Planning to tackle in this order, unless someone speaks up. On Aug 16, 2013, at 1:39 PM, Pat Ferrel pat.fer...@gmail.com wrote: Works on a cluster but have only tested on the trivial test data set. On Aug 13, 2013, at 4:49 PM, Pat Ferrel p...@occamsmachete.com wrote: OK single action recs are working so output to Solr with only [B'B] and B. On Aug 13, 2013, at 10:52 AM, Pat Ferrel pat.fer...@gmail.com wrote: Corrections inline On Aug 13, 2013, at 10:49 AM, Pat Ferrel pat.fer...@gmail.com wrote: I finally got some time to work on this and have a first cut at output to Solr working on the github repo. It only works on 2-action input but I'll have that cleaned up soon so it will work with one action. Solr indexing has not been tested yet and the field names and/or types may need tweaking. It takes the result of the previous drop: 1) DRMs for B (user history or B items action1) and A (user history of A items action2) 2) DRMs for [B'B] using LLR, and [B'A] using cooccurrence There are two final outputs created using mapreduce but requiring 2 in-memory hashmaps. I think this will work on a cluster (the hashmaps are instantiated on each node) but haven't tried yet. It orders items in #2 fields by strength of link, which is the similarity value used in [B'B] or [B'A]. It would be nice to order #1 by recency but there is no provision for passing through timestamps at present so they are ordered by the strength of preference. This is probably not useful and so can be ignored. Ordering by recency might be useful for truncating queries by recency while leaving the training data containing 100% of available history. 1) It joins #1 DRMs to produce a single set of docs in CSV form, which looks like this: id,history_b,history_a u1,iphone ipad,iphone ipad galaxy ... 2) it joins #2 DRMs to produce a single set of docs in CSV form, which looks like this: id,b_b_links,b_a_links iphone,iphone ipad,iphone ipad galaxy … It may work on a cluster, I haven't tried yet. As soon as someone has some large-ish sample log files I'll give them a try. Check the sample input files in the resources dir for format. https://github.com/pferrel/solr-recommender On Aug 13, 2013, at 10:17 AM, Pat Ferrel p...@occamsmachete.com wrote: When I started looking at this I was a bit skeptical. As a Search engine Solr may be peerless, but as yet another NoSQL db? However getting further into this I see one very large benefit. It has one feature that sets it completely apart from the typical NoSQL db. The type of queries you do return fuzzy results--in the very best sense of that word. The most interesting queries are based on similarity to some exemplar. Results are returned in order of similarity strength, not ordered by a sort field. Wherever similarity based queries are important I'll look at Solr first. SolrJ looks like an interesting way to get Solr queries on POJOs. It's probably at least an alternative to using docs and CSVs to import the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just
Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations
Outside of the context of your demo, suppose that you have events a, b, c and d. Event a is the one we are centered on and is relatively rare. Event b is not so rare, but has weak correlation with a. Event c is as rare as a, but correlates strongly with it. Even d is quite common, but has no correlation with a. The 2x2 matrices that you would get would look something like this. In each of these, a and NOT a are in rows while other and NOT other are in columns. versus b, llrRoot = 8.03 b NOT b a *10* *10* NOT a *1000* *99000* versus c, llrRoot = 11.5 c NOT c a *10* *10* NOT a *30* *99970* versus d, llrRoot = 0 d NOT d a *10* *10* NOT a *5* *5* Note that what we are holding constant here is the prevalence of a (20 times) and the distribution of a under the conditions of the other symbol. What is being varied is the distribution of the other symbol in the NOT a case. On Sun, Aug 18, 2013 at 10:50 AM, B Lyon bradfl...@gmail.com wrote: Thanks folks for taking a look. I haven't sat down to try it yet, but wondering how hard it is to construct (realizable and realistic) k11, k12, k21, k22 values for three binary sequences X, Y, Z where (X,Y) and (Y,Z) have same co-occurrence, but you can tweak k12 and k21 so that the LLR values are extremely different in both directions. I assume that k22 doesn't matter much in practice since things are sparse and k22 is huge. Well, obviously, I guess you could simply switch the k12/k21 values between the two sequence pairs to flip the order at will... which is information that co-occurrence of course does not know about. On Sat, Aug 17, 2013 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is nice. As you say, k11 is the only part that is used in cooccurrence and it doesn't weight by prevalence, either. This size analysis is hard to demonstrate much difference because it is hard to show interesting values of LLR without absurdly string coordination between items. On Fri, Aug 16, 2013 at 8:21 PM, B Lyon bradfl...@gmail.com wrote: As part of trying to get a better grip on recommenders, I have started a simple interactive visualization that begins with the raw data of user-item interactions and goes all the way to being able to twiddle the interactions in a test user vector to see the impact on recommended items. This is for simple user interacted with an item case rather than numerical preferences for items. The goal is to show the intermediate pieces and how they fit together via popup text on mouseovers and dynamic highlighting of the related pieces. I am of course interested in feedback as I keep tweaking on it - not sure I got all the terminology quite right yet, for example, and might have missed some other things I need to know about. Note that this material is covered in Chapter 6.2 in MIA in the discussion on distributed recommenders. It's on googledrive here (very much a work-in-progress): https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/ (apologies to small resolution screens) This is based only on the co-occurrence matrix, rather than including the other similarity measures, although in working through this, it seems that the other ones can just be interpreted as having alternative definitions of what * means in matrix multiplication of A^T*A, where A is the user-item matrix... and as an aside to me begs the interesting question of [purely hypotheticall?] situations where LLR and co-occurrence are at odds with each other in making recommendations, as co-occurrence seems to be just using the k11 term that is part of the LLR calculation. My goal (at the moment at least) is to eventually continue this for the solr-recommender project that started as few weeks ago, where we have the additional cross-matrix, as well as a kind of regrouping of pieces for solr. -- BF Lyon http://www.nowherenearithaca.com -- BF Lyon http://www.nowherenearithaca.com
Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations
This is nice. As you say, k11 is the only part that is used in cooccurrence and it doesn't weight by prevalence, either. This size analysis is hard to demonstrate much difference because it is hard to show interesting values of LLR without absurdly string coordination between items. On Fri, Aug 16, 2013 at 8:21 PM, B Lyon bradfl...@gmail.com wrote: As part of trying to get a better grip on recommenders, I have started a simple interactive visualization that begins with the raw data of user-item interactions and goes all the way to being able to twiddle the interactions in a test user vector to see the impact on recommended items. This is for simple user interacted with an item case rather than numerical preferences for items. The goal is to show the intermediate pieces and how they fit together via popup text on mouseovers and dynamic highlighting of the related pieces. I am of course interested in feedback as I keep tweaking on it - not sure I got all the terminology quite right yet, for example, and might have missed some other things I need to know about. Note that this material is covered in Chapter 6.2 in MIA in the discussion on distributed recommenders. It's on googledrive here (very much a work-in-progress): https://googledrive.com/host/0B2GQktu-wcTiWHRwZFJacjlqODA/ (apologies to small resolution screens) This is based only on the co-occurrence matrix, rather than including the other similarity measures, although in working through this, it seems that the other ones can just be interpreted as having alternative definitions of what * means in matrix multiplication of A^T*A, where A is the user-item matrix... and as an aside to me begs the interesting question of [purely hypotheticall?] situations where LLR and co-occurrence are at odds with each other in making recommendations, as co-occurrence seems to be just using the k11 term that is part of the LLR calculation. My goal (at the moment at least) is to eventually continue this for the solr-recommender project that started as few weeks ago, where we have the additional cross-matrix, as well as a kind of regrouping of pieces for solr. -- BF Lyon http://www.nowherenearithaca.com
Re: Install mahout 0.8 with hadoop 2.0
Honest feedback is always welcome on this mailing list. Don't ever worry about flames for that. Don't forget that mr v1 is an option with hadoop 2. Confusing as that may be. Iterative algos are, as you say, very important. My current inclination is to lean toward a downpour style of implementation. That fits well with yarn but it also actually fits reasonably with mr v1. Sent from my iPhone On Aug 13, 2013, at 20:13, Carlos Mundi cmu...@gmail.com wrote: Anyway, I apologize if anyone takes offense. None is meant, so please flame me off-list if you must. But since I self-identify as a member of the small demand set Ted Dunning describes, I figure I can chime in. As always, YMMV.
Re: Install mahout 0.8 with hadoop 2.0
No. There is very small demand for Mahout on Hadoop 2.0 so far and the forward/backward incompatibility of 2.0 has made it difficult to motivate moving to 2.0. The bigtop guys built a maven profile for 0.23 some time ago. I don't know the status of that. I don't think that the differences are huge ... it is just the standard Hadoop forklift-the-world upgrade experience. On Tue, Aug 13, 2013 at 6:49 AM, Sergey Svinarchuk ssvinarc...@hortonworks.com wrote: Hi all, Somebody compile and install mahout with hadoop 2.0? If yes, that what changes you make in mahout, that it have 100% passed unit tests and successful work with hadoop 2.0? Thanks
Re: RowSimilarityJob, sampleDown method problem
Why do you think this? On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote: Mahout 0.9 snapshot RowSimilarityJob.java , sampleDown method line 291 or 300 double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; return either 0.0 or 1.0, not fraction. needs (double) casting BR Sam
Re: RowSimilarityJob, sampleDown method problem
Ouch. Sorry... your original posting made it sound like you *wanted* it to be 0.0 or 1.0. This is a bug. Can you file a JIRA? On Tue, Aug 13, 2013 at 12:04 PM, sam wu swu5...@gmail.com wrote: say column a has 1000 entries, maxPref=700 rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; we get rowSampleRate =0.0 ( not 0.7) do we totally skip this column or sample column entries with .7 probalility (roughly get 700 entries) On Tue, Aug 13, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: Why do you think this? On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote: Mahout 0.9 snapshot RowSimilarityJob.java , sampleDown method line 291 or 300 double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; return either 0.0 or 1.0, not fraction. needs (double) casting BR Sam
Re: Help regarding Seq2sparse utility
Ah. I get it. Ish. I think, but am not entirely sure that there are two outputs possible that you might be seeing. One is the centroids of the vectors themselves. These tend to densify, but I am not sure if these actually are dense vectors (I would tend to think so). That might be what you are seeing. The second is the assignment of your original vectors to the nearest cluster. Here, the vector is just your original vector. This output could be in the form of a cluster id followed by the id's on all the vectors in that cluster. That doesn't look like what you are seeing. Can you say what the actual commands you are running? Without that, it is a bit hard to say what you are seeing. On Sun, Aug 11, 2013 at 10:57 PM, Ashwini P ashwini.a...@gmail.com wrote: Hi Ted, My apologies for not framing the question on clusterdumper properly. I am getting the output from clusterdumper in the expected format. A sample vector from the clusterdumper output is as shown below: 1.0: /all-exchanges-strings.lc.txt = [amex:0.161, ase:0.161, asx:0.161, biffex:0.161, bse:0.161, cboe:0.161, cbt:0.161, cme:0.161, comex:0.161, cse:0.161, fox:0.136, fse:0.161, hkse:0.161, ipe:0.161, jse:0.161, klce:0.161, klse:0.161, liffe:0.161, lme:0.161, lse:0.161, mase:0.161, mise:0.161, mnse:0.161, mose:0.161, nasdaq:0.161, nyce:0.161, nycsce:0.161, nymex:0.161, nyse:0.161, ose:0.161, pse:0.161, set:0.136, simex:0.161, sse:0.161, stse:0.161, tose:0.161, tse:0.161, wce:0.161, zse:0.161] What I originally wanted to know is that are this vectors just the way clusterdumper prints them( i.e. are they dense vectors) or are they sparse vectors and the clusterdumper iterates over the non-zero values and prints only those values. If they are sparse vectors, Can you kindly tell me in which directory are the vectors generated by the algorithm so I can read them. If the vectors are in dense format then I need to convert them to sparse vectors. As can be seen from the clusterdump outsput sample above,only the features which have non-zero values for each vector are being printed. the set of features which have non-zero values will differ from vector to vector. Consider we have 3 vectors f1,f2,f3 each with a set of nonzero features s1,s2 and s3 respectively. What I want is a set S={s1 U s2 U s3} i.e. S is the union of the sets of non-zero features for each vector so that I can convert the dense vectors to sparse vectors. Your thoughts on this are welcome. Thanks, Ashvini On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning ted.dunn...@gmail.com wrote: Aside from your issues with clusterdumper, the values you want can be had from a sparse vector using v.iterateNonZero() and v.norm(0). The issue with clusterdumper is odd. Are you saying that the display shows all the components of the vector? Or that there is an in-memory representation that has been densified? On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P ashwini.a...@gmail.com wrote: Hello, I am new to mahout. I want to know how I can get the list of features that where extracted from the corpus by seq2sparse and the count of the total number of features. My problem is that when I view the clustering output using clusterdumper I get only dense vectors for each point that belongs in the cluster but I want the sparse vector for each point. What I want to know is that are the vectors output from the clustering algorithm stored as dense vector or is the clusterdumper converting the vectors to dense vectors. If the clustering algorithm generates sparse vectors I can directly use them or else I will have to convert the vectors from dense to sparse for which I need the information mentioned in the above paragraph. Your suggestions on this are welcome. Thanks, Ashvini
Re: Clustering for customer segmentation
The tasks that you need to do include: a) group your history by user id b) extract the features you want to use from each user history c) repeat clustering and adjusting the scaling of your features until you are happy If you have a few hundred examples of customers broken down by the segmentation that you want, then one thing that you might look at is this paper: http://www.cs.cmu.edu/~epxing/papers/Old_papers/xing_nips02_metric.pdf It shows a method for learning a metric that optimizes clustering of labeled and unlabeled points. Mahout currently does not have support for this kind of metric learning, but it would make an excellent addition. On Sat, Aug 10, 2013 at 11:54 AM, Martin, Nick nimar...@pssd.com wrote: Hi all, I'm new to Mahout and wondering if anyone could point me in the right direction for doing customer purchase behavior clustering in Mahout. Seems most of what I encounter in online and book examples for clustering is text/document based. Basically, I'd like to be able to explore passing n years of customer transaction data into one of the clustering algorithms and have my customer population be segmented into similar groups. Key determinants of similarity would be things like sales volume, purchase frequency, sales channel, profitability, tenure, category mix, etc. Anywhere I can see examples of this kind of thing? Thanks!! Nick Sent from my iPhone
Re: Clustering for customer segmentation
On Mon, Aug 12, 2013 at 12:52 PM, Martin, Nick nimar...@pssd.com wrote: I'd love to contribute so I'll get on JIRA and sign up for the dev@mailing list to start getting a feel for that process. Sounds like you already know the drill. Welcome!
Re: Setting up a recommender
Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index and uses SolrJ for querying help at all for the Solr recommendation approach? It supports multiple fields of String, Text, or boolean flags. Best Gokhan On Wed, Aug 7, 2013 at 9:42 PM, Pat Ferrel pat.fer...@gmail.com wrote: Also a question about user history. I was planning to write these into separate directories so Solr could fetch them from different sources but it occurs to me that it would be better to join A and B by user ID and output a doc per user ID with three fields, id, A item history, and B item history. Other fields could be added for users metadata. Sound correct? This is what I'll do unless someone stops me. On Aug 7, 2013, at 11:25 AM, Pat Ferrel p...@occamsmachete.com wrote: Once you have a sample or example of what you think the log file version will look like, can you post it? It would be great to have example lines for two actions with or without the same item IDs. I'll make sure we can digest it. I thought more about the ingest part and I don't think the one-item-space is actually a problem. It just means one item dictionary. A and B will have the right content, all I have to do is make sure the right ranks are input to the MM, Transpose, and RSJ. This in turn is only one extra count of the # of items in A's item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Help regarding Seq2sparse utility
Aside from your issues with clusterdumper, the values you want can be had from a sparse vector using v.iterateNonZero() and v.norm(0). The issue with clusterdumper is odd. Are you saying that the display shows all the components of the vector? Or that there is an in-memory representation that has been densified? On Sun, Aug 11, 2013 at 9:24 PM, Ashwini P ashwini.a...@gmail.com wrote: Hello, I am new to mahout. I want to know how I can get the list of features that where extracted from the corpus by seq2sparse and the count of the total number of features. My problem is that when I view the clustering output using clusterdumper I get only dense vectors for each point that belongs in the cluster but I want the sparse vector for each point. What I want to know is that are the vectors output from the clustering algorithm stored as dense vector or is the clusterdumper converting the vectors to dense vectors. If the clustering algorithm generates sparse vectors I can directly use them or else I will have to convert the vectors from dense to sparse for which I need the information mentioned in the above paragraph. Your suggestions on this are welcome. Thanks, Ashvini
Re: Changing weightings in kmeans
Check out the streaming k-means code. It provides capabilities for weighted samples. On Sat, Aug 10, 2013 at 6:57 AM, William Moran echofo...@gmail.com wrote: Hi, How would I go about changing the weighting of certain words when preparing data for kmeans? Also, in clusterdumps I have already made, some of my clusters are marked 'VL-' and some are 'CL-'. I believe this is to do with convergence, is it bad if the clusters have not converged and if so how can I ensure they do converge? Thanks (P.S. I did send a question similar to this a while ago but I'm not sure it worked)
Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender
On Fri, Aug 9, 2013 at 12:30 PM, Matt Molek mpmo...@gmail.com wrote: From some local IR precision/recall testing, I've found that user based recommenders do better on my data, so I'd like to stick with user based if I can. I know precision/recall measures aren't always that important when dealing with recommendation, but in the case I'm using the recommender for, I think it's worth maximizing. I'm getting more than double the precision out of the user based recommenders. What kind of user based recommender are you using? Most competitive user based recommenders can be restated as item-based recommenders. Those are much easier to deploy.
Re: Is OnlineSummarizer mergeable?
I just looked at the source for QDigest from streamlib. I think that the memory usage could be trimmed substantially, possibly by as much as 5:1 by using more primitive friendly structures. On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I need percentiles. Ideally not pre-defined ones, because one person may want e.g. 70th pctile, while somebody else might want 75th pctile for the same metric. Deal breakers: High memory footprint. (high means higher than QDigest from stream-lib for us and we could test and compare with QDigest relatively easily with live data) Algos that create data structures that cannot be merged Loss of accuracy that is not predictably small or configurable Thank you, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Wednesday, August 7, 2013 11:48 PM Subject: Re: Is OnlineSummarizer mergeable? Otis, What statistics do you need? What guarantees? On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest. Utopia? Thanks, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Wednesday, August 7, 2013 4:51 PM Subject: Re: Is OnlineSummarizer mergeable? It isn't as mergeable as I would like. If you have randomized record selection, it should be possible, but perverse ordering can cause serious errors. It would be better to use something like a Q-digest. http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is OnlineSummarizer algo mergeable? Say that we compute a percentile for some metric for time 12:00-12:01 and store that somewhere, then we compute it for 1201-12:02 and store that separately, and so on. Can we then later merge these computed and previously stored percentile instances and get an accurate value? Thanks, Otis -- Performance Monitoring -- http://sematext.com/spm Solr ElasticSearch Support -- http://sematext.com/
Re: RecommenderJob Recommending an Item Already Preferred by a User
That might slow down the job enormously for certain nasty inputs. The more that I think about things, the more convinced I am that there should be a post-processing pass to enforce things like not recommending input items. The recommendation algorithm itself should not be distorted to do this if it is unnatural (and forcing a user to not use sampling is a great example ... there should be two controls here). I think that the original point is also correct, however. The user should not be forced to implement this very common step. As such I think that the recommender code should still support doing this, but it really ought to be as an output filter. On Wed, Aug 7, 2013 at 9:19 AM, Sebastian Schelter s...@apache.org wrote: if you also set --maxPrefsPerUserInItemSimilarity to a number higher than the max preferences per user, no sampling should occur. This might slow down the job however. 2013/8/7 Rafal Lukawiecki ra...@projectbotticelli.com Is there a set of parameters which I could pass to RecommenderJob to avoid that random sampling, in order to create a test case for the issue I have experienced? Would setting --maxSimilaritiesPerItem and/or --maxPrefsPerUserInItemSimilarity help? Many thanks. On 7 Aug 2013, at 16:12, Sebastian Schelter ssc.o...@googlemail.com wrote: It could affect the results even in this case, as we also sample the preferences when computing similar items. On 07.08.2013 17:07, Rafal Lukawiecki wrote: Thank you, Sebastian. Would the random sampling affect the results of RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the actual, maximum number of preferences expressed by every user. Rafal On 7 Aug 2013, at 15:48, Sebastian Schelter ssc.o...@googlemail.com wrote: The code in trunk allows to you to specify a randomSeed, the older versions don't unfortunately. On 07.08.2013 16:35, Rafal Lukawiecki wrote: Hi Sebastian, The quantity of returned duplicates is much too large to be caused just by sampling's randomness. I wonder if this could be related to something that is platform-specific, as in Windows vs. *nix representation of input files, data types etc. For argument's sake, is it possible to fix the seed of the random aspect of the sampling so I could feed the same input through two platforms and compare the results? Rafal On 7 Aug 2013, at 15:20, Sebastian Schelter ssc.o...@googlemail.com wrote: Hi Rafal, this sounds really strange, the bug should not have anything to do with the version of Hadoop that you are running. You could sometimes not see it due to the random sampling of the preferences. --sebastian On 07.08.2013 13:53, Rafal Lukawiecki wrote: Sebastian, I've been doing a little more digging regarding the issue of preferences being calculated for already preferred items. I re-run the jobs using the same data and the same parameters on a different installation of Hadoop, and the problem seems to have gone away. For now it looks like the issue arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work a little more to ensure my results, but if they stood up, should I still report it as a Mahout issue? Rafal -- Rafal Lukawiecki Strategic Consultant and Director Project Botticelli Ltd On 1 Aug 2013, at 17:31, Sebastian Schelter s...@apache.org wrote: Setting it to the maximum number should be enough. Would be great if you can share your dataset and tests. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Should I have set that parameter to a value much much larger than the maximum number of actually expressed preferences by a user? I'm working on an anonymised data set. If it works as an error test case, I'd be happy to share it for your re-test. I am still hoping it is my error, not Mahout's. Rafal -- Rafal Lukawiecki Pardon brevity, mobile device. On 1 Aug 2013, at 17:19, Sebastian Schelter s...@apache.org wrote: Ok, please file a bug report detailing what you've tested and what results you got. Just to clarify, setting maxPrefsPerUser to a high number still does not help? That surprises me. 2013/8/1 Rafal Lukawiecki ra...@projectbotticelli.com Hi Sebastian, I've rechecked the results, and, I'm afraid that the issue has not gone away, contrary to my yesterday's enthusiastic response. Using 0.8 I have retested with and without --maxPrefsPerUser 9000 parameter (no user has more than 5000 prefs). I have also supplied the prefs file, without the preference value, that is as: user,item (one per line) as a --filterFile, with and without the -maxPrefsPerUser, and
Re: How to get human-readable output for large clustering?
Mahout is a library. You can link against any version you like and still have a perfectly valid Hadoop program. On Wed, Aug 7, 2013 at 11:51 AM, Adam Baron adam.j.ba...@gmail.com wrote: Suneel, Unfortunately no, we're still on Mahout 0.7. My team is one of many teams which share a large, centrally administrated Hadoop cluster. The admins are pretty strict about only installing official CDH releases. I don't believe Mahout 0.8 is in an official CDH release yet. Has the ClusterDumper code changed in 0.8? Regards, Adam On Tue, Aug 6, 2013 at 9:00 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Adam, Pardon my asking again if this has already been answered - Are you running against Mahout 0.8? -- *From:* Adam Baron adam.j.ba...@gmail.com *To:* user@mahout.apache.org; Suneel Marthi suneel_mar...@yahoo.com *Sent:* Tuesday, August 6, 2013 6:56 PM *Subject:* Re: How to get human-readable output for large clustering? Suneel, I was trying -n 25 and -b 100 when I sent my e-mail about it not working for me. Just tried -n 20 and got the same error message. Any other ideas? Thanks, Adam On Mon, Aug 5, 2013 at 7:40 PM, Suneel Marthi suneel_mar...@yahoo.com wrote: Adam/Florian, Could you try running the clusterdump by limiting the number of terms from clusterdump, by specifying -n 20 (outputs the 20 top terms)? From: Adam Baron adam.j.ba...@gmail.com To: user@mahout.apache.org Sent: Monday, August 5, 2013 8:03 PM Subject: Re: How to get human-readable output for large clustering? Florian, Any luck finding an answer over the past 5 months? I'm also dealing with similar out of memory errors when I run clusterdump. I'm using 50,000 features and tried k=500. The kmeans command ran fine, but then I got the dreaded OutOfMemory error on with the clusterdump command: 2013-08-05 18:46:01,686 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:434) at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:387) at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139) at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:118) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:2114) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2242) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) at com.google.common.collect.Iterators$5.hasNext(Iterators.java:543) at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) at org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:103) at org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.getRepresentativePoints(RepresentativePointsMapper.java:97) at org.apache.mahout.clustering.evaluation.RepresentativePointsMapper.setup(RepresentativePointsMapper.java:87) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:138) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:673) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:331) at org.apache.hadoop.mapred.Child$4.run(Child.java:268) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.hadoop.mapred.Child.main(Child.java:262) Thanks, Adam On Mon, Mar 11, 2013 at 8:42 AM, Florian Laws flor...@florianlaws.de wrote: Hi, I have
Re: Regarding starting up our project
If you are doing a student project, it may be best for you to do this as a separate github project that *depends* on Mahout rather than trying to build a modification to Mahout in the first instance. The reasons that I say this include: a) the Apache process will probably be foreign to you at first and will significantly slow you down as a result. b) the enthusiasm for your code by the community will depend very much on whether you can convince us that your code will be high quality and you will be around to help maintain it. Purely because this is a student project, you will have a very hard time doing this this. That will also slow down your progress. c) the level of review for your code will be variable, but it you are able to get reviews, they are likely to be more stringent than you are used to. This can be disheartening and, again, can slow you down. d) the best route to guarantee the success for your school project is to get something working well as soon as possible. This implies that (a-c) can seriously decrease your success rate. Taking all of this together, what I suggest is that you start by developing as a separate project. This will let you get started instantly and make progress immediately. Being separate does not mean that you will lack support from the Mahout community, you can still invite reviews and commentary on your approach and your code. All it means is that you won't be slowed down by the whole community process and are more likely to have a successful project. If your project is successful and if your code fits into the Mahout style and structure, then moving from a separate project into the mahout mainline is relatively easy for a self-contained project like a neural network implementation. All of this said, you should look at the archives of the mailing list. Yexi just recently put up some code to do much of what you suggest and you should comment on the code review. You should also decide how that code affects your project. On Wed, Aug 7, 2013 at 11:46 PM, Sushanth Bhat(MT2012147) sushanth.b...@iiitb.org wrote: Hi, We are planning to implement Neural networks algorithm in Mahout. We are doing as a part of Machine Learning course project. As we don't have much knowledge about Mahout, can anyone please help us how to get start with implementation of algorithm. Thanks and regards, Sushanth Bhat IIIT-Bangalore
Re: Regarding starting up our project
On Thu, Aug 8, 2013 at 1:31 PM, Sushanth Bhat(MT2012147) sushanth.b...@iiitb.org wrote: One more doubt I have that do we need to start our project without Mahout library, I mean just implementing algorithm? I would suggest that Mahout would be very useful for your project. Use Maven and include Mahout math as a dependency. If you do a map-reduce implementation of neural nets, add Mahout core as well.
Re: Is OnlineSummarizer mergeable?
I was about to point you at that pull request. How droll. Didn't know it was from you guys. On Thu, Aug 8, 2013 at 3:35 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, Yes, that's what we did recently, too: https://github.com/clearspring/stream-lib/pull/47 ... but it's still a little too phat...which is what made me think of your OnlineSummarizer as a possible, slimmer alternative. Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Thursday, August 8, 2013 8:27 AM Subject: Re: Is OnlineSummarizer mergeable? I just looked at the source for QDigest from streamlib. I think that the memory usage could be trimmed substantially, possibly by as much as 5:1 by using more primitive friendly structures. On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I need percentiles. Ideally not pre-defined ones, because one person may want e.g. 70th pctile, while somebody else might want 75th pctile for the same metric. Deal breakers: High memory footprint. (high means higher than QDigest from stream-lib for us and we could test and compare with QDigest relatively easily with live data) Algos that create data structures that cannot be merged Loss of accuracy that is not predictably small or configurable Thank you, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Wednesday, August 7, 2013 11:48 PM Subject: Re: Is OnlineSummarizer mergeable? Otis, What statistics do you need? What guarantees? On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest. Utopia? Thanks, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Wednesday, August 7, 2013 4:51 PM Subject: Re: Is OnlineSummarizer mergeable? It isn't as mergeable as I would like. If you have randomized record selection, it should be possible, but perverse ordering can cause serious errors. It would be better to use something like a Q-digest. http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is OnlineSummarizer algo mergeable? Say that we compute a percentile for some metric for time 12:00-12:01 and store that somewhere, then we compute it for 1201-12:02 and store that separately, and so on. Can we then later merge these computed and previously stored percentile instances and get an accurate value? Thanks, Otis -- Performance Monitoring -- http://sematext.com/spm Solr ElasticSearch Support -- http://sematext.com/
Re: Evaluating Precision and Recall of Various Similarity Metrics
Rafal, The major problems with these sorts of metrics with recommendations include a) different algorithms pull up different data and you don't have any deeply scored reference data. The problem is similar to search except without test collections. There are some partial solutions to this b) recommendations are typically very strongly dependent on feedback from data that they themselves sample. This means, for instance, that a system with dithering will often out-perform the same system without dithering. Dithering is a form of noise added to the result of a recommender so the quality of the system with dithering logically has to be worse than the system without. The system with dithering performs much better, however, because it is able to gather broader information and thus learns about things that the version without dithering would never find. Problem (b) is the strongly limiting case because dithering can make a bigger change than almost any reasonable algorithmic choice. Sadly, problem (a) is the one attacked in most academic research. On Thu, Aug 8, 2013 at 10:34 AM, Rafal Lukawiecki ra...@projectbotticelli.com wrote: Hi Sebastian—thank you for your suggestions, incl considering other similarity measures like LoglikelihoodRation. I still hope to do a comparison of all of the available ones, under our data. I realise the importance (and also some limitations) of A/B in production testing, but having a broader way to test the recommender would have been useful. I suppose, I am used to looking at lift/profit charts, cross-validation, RMSE etc metrics of accuracy and reliability when working with data mining models, such as decision trees or clustering, but also using this technique for association rules evaluation, where I'd be hoping that the model correctly predicts basket completions. I am curious if there is anything along this line of thinking for evaluating recommenders that do not expose explicit models. Many thanks, very much indeed, for all your replies. Rafal On 8 Aug 2013, at 17:58, Sebastian Schelter s...@apache.org wrote: Hi Rafal, you are right, unfortunately there is no tooling available for doing holdout tests with RecommenderJob. It would be an awesome contribution to Mahout though. Ideally, you would want to split your dataset in a way that you retain some portion of the interactions of each user and then see how much of the held-out interactions you can reproduce. You should be aware that this is basically a test of how good a recommender can reproduce what already happened. If you get recommendations for items that are not in your held out data, this does not automatically mean that they are wrong. They might be very interesting things that the user simply hasn't had a chance to look at yet. The real performance of a recommender can only be found via extensive A/B testing in production systems. Btw, I would strongly recommend that you use a more sophisticated similarity than cooccurrence count, e..g LoglikelihoodRation. Best, Sebastian 2013/8/8 Rafal Lukawiecki ra...@projectbotticelli.com I'd like to compare the accuracy, precision and recall of various vector similarity measures with regards to our data sets. Ideally, I'd like to do that for RecommenderJob, including CooccurrenceCount. However, I don't think RecommenderJob supports calculation of the performance metrics. Alternatively, I could use the evaluator logic in the non-Hadoop-based Item-based recommenders, but they do not seem to support the option of using CooccurrenceCount as a measure, or am I wrong? Reading archived conversations from here, I can see others have asked a similar question in 2011 ( http://comments.gmane.org/gmane.comp.apache.mahout.user/9758) but there seems no clear guidance. Also, I am unsure if it is valid to split the data set into training/testing that way, as testing users' key characteristic is the items they have preferred—and there is no model to fit them to, so to speak, or they would become anonymous users if we stripped their preferences. Am I right in thinking that I could test RecommenderJob by feeding X random preferences of a user, having hidden the remainder of their preferences, and see if the hidden items/preferences would become their recommendations? However, that approach would change what a user likes (by hiding their preferences for testing purposes) and I'd be concerned about the value of the recommendation. Am I in a loop? Is there a way to somehow tap into the recommendation to get an accuracy metric out? Did anyone, perhaps, share a method or a script (R, Python, Java) for evaluating RecommenderJob results? Many thanks, Rafal Lukawiecki
Re: Arff files to Naive Bayes
On Wed, Aug 7, 2013 at 3:56 PM, John Meagher john.meag...@gmail.com wrote: Continuous values are being used now in addition to a large set of boolean flags. I think I could convert the continuous values to some sort of bucketed values that could be represented as additional flags. If that was the case would the format need to be ... id1 flaga flagb id2 flagb flagc Yes.
Re: Content-Based Recommendation Approaches
On Wed, Aug 7, 2013 at 7:29 AM, cont...@dhuebner.com wrote: This typically won't be fast enough if you have something like a random forest, but if your final targeting model is logistic regression, it probably will be fast enough. So usually I do need to train a custom model for each user independently? Not necessarily. Usually you need a global model that has user x item interaction variables. It isn't unusual to need a per user adjustment model, but if you can make that rare, you can do better. From the linear user x item interaction model, for instance, you may be able to convert the model into a sparse weighted query that could retrieve items from an inverted index such as Solr. This might also be possible with a per user model, but I would have to think about that.
Re: Is OnlineSummarizer mergeable?
It isn't as mergeable as I would like. If you have randomized record selection, it should be possible, but perverse ordering can cause serious errors. It would be better to use something like a Q-digest. http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is OnlineSummarizer algo mergeable? Say that we compute a percentile for some metric for time 12:00-12:01 and store that somewhere, then we compute it for 1201-12:02 and store that separately, and so on. Can we then later merge these computed and previously stored percentile instances and get an accurate value? Thanks, Otis -- Performance Monitoring -- http://sematext.com/spm Solr ElasticSearch Support -- http://sematext.com/
Re: Setting up a recommender
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am working on this issue now. The current state is that I can bring in a bunch of track names and links to artist names and so on. This would provide the basic set of items (artists, genres, tracks and tags). There is a hitch in bringing in the data needed to generate the logs since that part of MB is not Apache compatible. I am working on that issue. Technically, the data is in a massively normalized relational form right now, but it isn't terribly hard to denormalize into a form that we need.
Re: Is OnlineSummarizer mergeable?
Otis, What statistics do you need? What guarantees? On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest. Utopia? Thanks, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Wednesday, August 7, 2013 4:51 PM Subject: Re: Is OnlineSummarizer mergeable? It isn't as mergeable as I would like. If you have randomized record selection, it should be possible, but perverse ordering can cause serious errors. It would be better to use something like a Q-digest. http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is OnlineSummarizer algo mergeable? Say that we compute a percentile for some metric for time 12:00-12:01 and store that somewhere, then we compute it for 1201-12:02 and store that separately, and so on. Can we then later merge these computed and previously stored percentile instances and get an accurate value? Thanks, Otis -- Performance Monitoring -- http://sematext.com/spm Solr ElasticSearch Support -- http://sematext.com/
Re: up-to-date book or tutorial
There is a considerable amount of discussion going on about a new edition of Mahout in Action. On Wed, Aug 7, 2013 at 12:36 PM, Piero Giacomelli pgiac...@gmail.comwrote: Basically all my examples will be based on mahout 0.8. So for example the k-means clustering will be used with the updated version. I think that by the end of august the preorder will be available Il giorno 07/ago/2013 21:23, Suneel Marthi suneel_mar...@yahoo.com ha scritto: Congrats on the book, Piero. Would this book be based on Mahout 0.8 (and exclude stuff that has been marked as deprecated in 0.8) From: Piero Giacomelli pgiac...@gmail.com To: user@mahout.apache.org Sent: Wednesday, August 7, 2013 3:18 PM Subject: Re: up-to-date book or tutorial Packt will publish a cookbook on mahout in a couple of month Il giorno 06/ago/2013 10:53, Prasad, Girijesh g.pra...@ulster.ac.uk ha scritto: I am looking for an up-to-date book or tutorial. Is the Mahout in Action http://www.manning.com/owen/ the only best option? Earlier I saw a promotion code but I am unable to find any more. Please advise. With best wishes, Girijesh. - This email and any attachments are confidential and intended solely for the use of the addressee and may contain information which is covered by legal, professional or other privilege. If you have received this email in error please notify the system manager at postmas...@ulster.ac.uk and delete this email immediately. Any views or opinions expressed are solely those of the author and do not necessarily represent those of the University of Ulster. The University's computer systems may be monitored and communications carried out on them may be recorded to secure the effective operation of the system and for other lawful purposes. The University of Ulster does not guarantee that this email or any attachments are free from viruses or 100% secure. Unless expressly stated in the body of a separate attachment, the text of email is not intended to form a binding contract. Correspondence to and from the University may be subject to requests for disclosure by 3rd parties under relevant legislation. The University of Ulster was founded by Royal Charter in 1984 and is registered with company number RC000726 and VAT registered number GB672390524.The primary contact address for the University of Ulster in Northern Ireland is,Cromore Road, Coleraine, Co. Londonderry BT52 1SA
Re: Arff files to Naive Bayes
By non-text, do you mean continuous values? Or sparse sets of tokens? The general idea for Naive Bayes is that it requires input consisting of sparse sets of tokens. On Wed, Aug 7, 2013 at 2:00 PM, John Meagher john.meag...@gmail.com wrote: I'm just starting work with Mahout and I'm struggling getting an example of a non-text based Naive Bayes classifier up and running. The input will be feature vectors generated outside of Mahout. As a test I'm using arff files (anything else CSV-ish will work). I've been able to convert things into vectors in a few different ways, but can't figure out what is needed to get the trainnb command to work. Does the label index need to be generated through some manual process or something other than the arff.vector or trainnb command? Is there a specific format needed for the input arff files? Specific columns in a specific order? Here's what I've tried so far in both 0.7 from CDH4 and 0.8 direct from Apache: $ wget http://repository.seasr.org/Datasets/UCI/arff/iris.arff $ mahout arff.vector --input iris.arff --output iris.model --dictOut iris.labels This works and seems to be right so far This is the command I think I need to train the Naive Bayes model. It fails when creating the label index with the exception below. $ mahout trainnb -i iris.model/ -o iris.training -el -li iris.training.labels ... Exception in thread main java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:123) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) ... Thanks for the help, John
Re: Is OnlineSummarizer mergeable?
Ouch. You didn't mention accuracy. I will assume a standard sort of 2-3% accuracy or better and let you correct me if necessary. I could meet all but one or two of those requirements several different ways. For instance, very high or low quantiles can be met with stacked min-sets or max-sets. The idea is that you keep the highest k values and the highest k 10x downsampled data and so on. This is pretty good for down to the 90+%-ile (or up to the 10th %-ile). This structure merges without loss of accuracy. For well-defined quantiles like 25-50-75, then the Mahout OnlineSummarizer is excellent. You can choose your arbitrary quantile ahead of time and you can sometimes merge (but perverse data can kill you). And then the QDigest. It is, by definition, as big as a QDigest, but is mergeable and allows any quantile. Also cool, is the fact that you can pick the quantile late in the process. Maybe the answer is to make the QDigest structure smaller. How well is the streamlib implementation cranked down? Is it really tight? On Wed, Aug 7, 2013 at 3:04 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I need percentiles. Ideally not pre-defined ones, because one person may want e.g. 70th pctile, while somebody else might want 75th pctile for the same metric. Deal breakers: High memory footprint. (high means higher than QDigest from stream-lib for us and we could test and compare with QDigest relatively easily with live data) Algos that create data structures that cannot be merged Loss of accuracy that is not predictably small or configurable Thank you, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Wednesday, August 7, 2013 11:48 PM Subject: Re: Is OnlineSummarizer mergeable? Otis, What statistics do you need? What guarantees? On Wed, Aug 7, 2013 at 1:26 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Ted, I'm actually trying to find an alternative to QDigest (the stream-lib impl specifically) because even though it seems good, we have to deal with crazy volumes of data in SPM (performance monitoring service, see signature)... I'm hoping we can find something that has both a lower memory footprint than QDigest AND that is mergeable a la QDigest. Utopia? Thanks, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Wednesday, August 7, 2013 4:51 PM Subject: Re: Is OnlineSummarizer mergeable? It isn't as mergeable as I would like. If you have randomized record selection, it should be possible, but perverse ordering can cause serious errors. It would be better to use something like a Q-digest. http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf On Wed, Aug 7, 2013 at 4:21 AM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, Is OnlineSummarizer algo mergeable? Say that we compute a percentile for some metric for time 12:00-12:01 and store that somewhere, then we compute it for 1201-12:02 and store that separately, and so on. Can we then later merge these computed and previously stored percentile instances and get an accurate value? Thanks, Otis -- Performance Monitoring -- http://sematext.com/spm Solr ElasticSearch Support -- http://sematext.com/
Re: Content-Based Recommendation Approaches
On Tue, Aug 6, 2013 at 5:27 PM, Dominik Hübner cont...@dhuebner.com wrote: I wonder how model based approaches might be scaled to a large number of users. My understanding is that I would have to train some model like a decision tree or naive bayes (or regression … etc.) for each user and do the prediction for each item using this model. Is there any common approach to get those techniques scaling up with larger datasets? Yes. There are several approaches. One of the most effective is rescoring. You use a performant recommender such as a search engine based recommender and then rescore the top few hundred items using a more detailed model. This typically won't be fast enough if you have something like a random forest, but if your final targeting model is logistic regression, it probably will be fast enough. In any case, there are also tricks you can pull in the evaluation of certain classes of models. For instance, with logistic regression, you can remove the link function (doesn't change ordering) and you can ignore all user specific features and weights (this doesn't change ordering either). This leaves you with a relatively small number of computations in the form of a sparse by dense dot product.
Re: solr-recommender, recent changes to ToItemVectorsMapper
Concur here. Obviously CrossRowSimilarityJob and RowSimilarityJob will be able to share some down-stream code. But there are economies in RSJ that probably can't apply to CRSJ. On Mon, Aug 5, 2013 at 7:20 AM, Sebastian Schelter s...@apache.org wrote: I think the downsampling belongs into RowSimilarityJob. But I also think that we need a special CrossRowSimilarityJob that computes B'A and also downsamples them during the computation. Furthermore it should compute LLR similarities between the rows not dot products. --sebastian On 05.08.2013 16:14, Pat Ferrel wrote: OK, iI see it in my build now. Also not sufficient repos in the pom. Looks like some major refactoring of RowSimilarity is in progress. Sebastian, are you sure downsampling belongs in RowSimilairty? It won't be applied to [B'A]? If so I'll update to the lastest Mahout trunk. On Aug 4, 2013, at 8:57 PM, B Lyon bradfl...@gmail.com wrote: Hi Pat Below is the compilation error - it's what led me to look at the SAMPLE_SIZE stuff in the first place, where I confirmed via javap that the downloaded mahout jar did not have it any more and then I started looking at the svn source. Mebbe I've got something else misconfigured somehow, although I don't see how it would compile if it's looking for that static field that's removed. [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.5.1:compile (default-compile) on project solr-recommender: Compilation failure: Compilation failure: [ERROR] /Users/bradflyon/Documents/solr-recommender/src/main/java/finderbots/recommenders/hadoop/PrepareActionMatrixesJob.java:[120,71] cannot find symbol [ERROR] symbol : variable SAMPLE_SIZE [ERROR] location: class org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper [ERROR] /Users/bradflyon/Documents/solr-recommender/src/main/java/finderbots/recommenders/hadoop/PrepareActionMatrixesJob.java:[168,71] cannot find symbol [ERROR] symbol : variable SAMPLE_SIZE [ERROR] location: class org.apache.mahout.cf.taste.hadoop.preparation.ToItemVectorsMapper [ On Sun, Aug 4, 2013 at 8:57 PM, Pat Ferrel pat.fer...@gmail.com wrote: Just updated to today's Mahout trunk and everything works for me. Can you send me the error? Sebastian, do we really want this limit in RowSimilairty? It will not be applied to [B'A] unless you also do a mod to give us RowSimilairty on two matrices. Now that would be very nice indeed… On Aug 3, 2013, at 9:48 PM, B Lyon bradfl...@gmail.com wrote: Hi Pat I was going to just play with building the sold-recommender stuff in its current wip state and noticed a compile error (running mvn install) I think because the 0.9 snapshot has some changes on July 30th http://svn.apache.org/viewvc?view=revisionrevision=1508302 Basically, back on June 18, Ted noticed that the downsampling might not be being done at the right place to actually avoid overwork due to perversely prolific users (thread is here: http://web.archiveorange.com/archive/v/z6zxQatCzHoFxbdLF0of), and someone else (Sebastian Schelter) has already acted on this (July 30) to move the downsampling to somewhere else (Mahout-1289 - https://issues.apache.org/jira/browse/MAHOUT-1289), which (among other things) removes the SAMPLE_SIZE static variable from ToItemVectorsMapper. I don't know how the general changes affect what you were setting up/playing with. Let me know if I've missed something here.