Re: Avoiding OOM for large datasets

2013-12-11 Thread Ted Dunning
This is not right. THe sequential version would have finished long before this for any reasonable value of k. I do note, however, that you have set k = 200,000 where you only have 300,000 documents. Depending on which value you set (I don't have the code handy), this may actually be increased

Re: Slope one algorithm performance

2013-12-08 Thread Ted Dunning
Use a better recommender. Slope one is just there for completeness. Sent from my iPhone On Dec 8, 2013, at 2:24, Siddharth Patnaik spatnai...@gmail.com wrote: What should be done to improve the runtime performance?

Re: SVM Implementation for mahout?

2013-12-08 Thread Ted Dunning
The problem of correlation of features is clearly present in text, but it is not so clear what the effect will be. For naive bayes this has the effect of making the classifier over confident but it usually still works reasonably well. For logistic regression without regularization it can

Re: SVM Implementation for mahout?

2013-12-08 Thread Ted Dunning
On Sun, Dec 8, 2013 at 5:50 PM, Fernando Santos fernandoleandro1...@gmail.com wrote: Actually I had never heard of PCA and LDA. I'll take a look on it. PCA and LDA are probably not quite what you want for Naive Bayes, especially in Mahout. There is an assumption of a sparse binary

Re: Question about Pearson Correlation in non-Taste mode

2013-12-06 Thread Ted Dunning
that aren't co-rated can't meaningfully be included in this computation. On Sun, Dec 1, 2013 at 8:29 AM, Ted Dunning ted.dunn...@gmail.com wrote: Good point Amit. Not sure how much this matters. It may be that PearsonCorrelationSimilarity is bad name that should

Re: Question about Pearson Correlation in non-Taste mode

2013-12-06 Thread Ted Dunning
, or you have another one you can forward to me, your doctoral dissertation? Thanks. Jason Xin -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Friday, December 06, 2013 7:56 PM To: user@mahout.apache.org Subject: Re: Question about Pearson Correlation in non

Re: KMeans cluster analysis

2013-12-05 Thread Ted Dunning
Angelo, The first question is how you intend to define which items are similar. Also, what is the intended use of the clustering? Without knowing that, it is very hard to say how to best do the clustering. For instance, are two records more similar if the record are at the same time of day?

Re: Outlier detection/Pruning

2013-12-05 Thread Ted Dunning
to determine the optimal number of clusters that best fits the dataset and passing that information as parameter to Kmeans clustering (kmeansDriver class). Regards Prabhakar On Tue, Dec 3, 2013 at 6:00 PM, Ted Dunning ted.dunn...@gmail.com wrote: Can you be more specific about which code you

Re: TF-IDF confusion

2013-12-03 Thread Ted Dunning
Ani, I really don't understand your second point. Here is how I view things ... if you can phrase things in those terms, it might help me understand your question. The TF part of TF-IDF refers to the term frequencies in a document. Typically, each possible word is assigned to a positive

Re: Outlier detection/Pruning

2013-12-03 Thread Ted Dunning
Can you be more specific about which code you are asking about? The ball k-means implementation provides a capability somewhat like this, but perhaps in a more clearly defined way. On Tue, Dec 3, 2013 at 9:34 AM, Prabhakar Srinivasan prabhakar.sriniva...@gmail.com wrote: Hello! Can someone

Re: Clustering Spatial Data

2013-12-02 Thread Ted Dunning
Peter, What you say is a bit confusing to me. You say you have centers already. But then you talk about algorithms which find the centers. Also, you say you want to assign points based on centers, but you also say that clusters have different shapes, area, size and point count. Do you mean

Re: Pig vector project

2013-12-02 Thread Ted Dunning
Elephant bird is distinctly superior to Pig Vector for many things (it moved forward, Pig Vector did not). I believe here is also a Twitter internal project known as PigML which is much more what Pig Vector wanted to be. There is also https://github.com/hanborq/pigml, but I think it is very

Re: Mahout for clustering

2013-12-02 Thread Ted Dunning
Do you want to cluster users or items? For items, the vectorization that you suggest will work reasonably well, especially if you use TF.IDF weighting and normalize the resulting vectors. You can also use one of the matrix decomposition techniques and cluster the resulting vectors. The spectral

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-12-02 Thread Ted Dunning
Inline On Mon, Dec 2, 2013 at 8:55 AM, optimusfan optimus...@yahoo.com wrote: ... To accomplish this, we used AdaptiveLogisticRegression and trained 46 binary classification models. Our approach has been to do an 80/20 split on the data, holding the 20% back for cross-validation of the

Re: Question about Pearson Correlation in non-Taste mode

2013-12-01 Thread Ted Dunning
, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com wrote: Hi Ted, Thanks for your response. I thought that the mean of a sparse vector is simply the mean of the defined elements? Why would the vectors become dense unless

Re: Test naivebayes task running really slowly and not in distributed mode

2013-12-01 Thread Ted Dunning
Did the training run use both machines? How large is the input for the test run? Is it contained in a single file? On Sat, Nov 30, 2013 at 11:22 AM, Fernando Santos fernandoleandro1...@gmail.com wrote: Hello everyone, I'm trying to do a text classification task. My dataset is not that

Re: Clustering without Hadoop

2013-12-01 Thread Ted Dunning
The new Ball k-means and streaming k-means implementations have non-Hadoop versions. The streaming k-means implementation also has a threaded implementation that runs without Hadoop. The threaded streaming k-means implementation should be pretty fast. On Sun, Dec 1, 2013 at 7:55 PM, Shan Lu

Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

2013-11-29 Thread Ted Dunning
The default with the Mahout encoders is two probes. This is unnecessary with the intercept term, of course, if you protect the intercept term from other updates, possible by encoding other data using a view of the original feature vector. For each probe, a different hash is used so each value is

Re: RandomAccessSparseVector setting 1.0 in 2 dims for 1 feature value?

2013-11-29 Thread Ted Dunning
after encoding a new value in the vector? This would give a user the information that the length of the chosen vector is too short. So far, I did not find any method in the api to check for that. 2013/11/29 Ted Dunning ted.dunn...@gmail.com: The default with the Mahout encoders is two probes

Re: Question about Pearson Correlation in non-Taste mode

2013-11-29 Thread Ted Dunning
Well, the best way to compute correlation using sparse vectors is to make sure you keep them sparse. To do that, you must avoid subtracting the mean by expanding whatever formulae you are using. For instance, if you are computing (x - m_x) . (y - m_y) (here . means dot product) If you do

Re: Question about Pearson Correlation in non-Taste mode

2013-11-29 Thread Ted Dunning
On Fri, Nov 29, 2013 at 10:16 PM, Amit Nithian anith...@gmail.com wrote: Hi Ted, Thanks for your response. I thought that the mean of a sparse vector is simply the mean of the defined elements? Why would the vectors become dense unless you're meaning that all the undefined elements (0?) now

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-28 Thread Ted Dunning
public double currentLearningRate() { return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() + stepOffset, forgettingExponent); } I presume that you would like Adagrad-like solution to replace the above ? On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning ted.dunn

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Ted Dunning
of OnlineLogisticRegression, which in turn builds up to Adaptive/Cross Fold. b) for truly on-line learning where no repeated passes through the data.. What would it take to get to an implementation ? How can any one help ? Regards, On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning ted.dunn

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-27 Thread Ted Dunning
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi vishal.santo...@gmail.com Are we to assume that SGD is still a work in progress and implementations ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ? They are too raw to be accepted uncritically, for sure. They have

Re: Good centroid generation algorithm for top-down clustering approach

2013-11-26 Thread Ted Dunning
Have you looked at the streaming k-means work? The basic idea is that you generate a sketch of the data which you can then cluster in-memory. That lets you use very advanced centroid generation algorithms that require lots of processing. On Tue, Nov 26, 2013 at 6:29 AM, Chih-Hsien Wu

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

2013-11-26 Thread Ted Dunning
Well, first off, let me say that I am much less of a fan now of the magical cross validation approach and adaptation based on that than I was when I wrote the ALR code. There are definitely legs in the ideas, but my implementation has a number of flaws. For example: a) the way that I provide

Re: Algorithms in Mahout

2013-11-25 Thread Ted Dunning
On Mon, Nov 25, 2013 at 3:14 AM, Manuel Blechschmidt manuel.blechschm...@gmx.de wrote: There are/were multiple kNN implementation in Mahout: Recommender knn

Re: OnlineLogisticRegression: Are my settings sensible

2013-11-08 Thread Ted Dunning
, Andreas Bauer b...@gmx.net wrote: Ok, I'll have a look. Thanks! I know mahout is intended for large scale machine learning, but I guess it shouldn't have problems with such small data either. Ted Dunning ted.dunn...@gmail.com schrieb: On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b

Re: Solr-recommender for Mahout 0.9

2013-11-08 Thread Ted Dunning
For recommendation work, I suggest that it would be better to simply code out an explicit OR query. On Thu, Nov 7, 2013 at 8:11 PM, Ken Krugler kkrugler_li...@transpac.comwrote: Hi Pat, On Nov 7, 2013, at 7:30pm, Pat Ferrel pat.fer...@gmail.com wrote: Another approach would be to weight

Re: Decaying score for old preferences when using the .refresh()

2013-11-07 Thread Ted Dunning
On Thu, Nov 7, 2013 at 12:50 AM, Gokhan Capan gkhn...@gmail.com wrote: This particular approach is discussed, and proven to increase the accuracy in Collaborative filtering with Temporal Dynamics by Yehuda Koren. The decay function is parameterized per user, keeping track of how consistent

Re: OnlineLogisticRegression: Are my settings sensible

2013-11-07 Thread Ted Dunning
Why is FEATURE_NUMBER != 13? With 12 features that are already lovely and continuous, just stick them in elements 1..12 of a 13 long vector and put a constant value at the beginning of it. Hashed encoding is good for sparse stuff, but confusing for your case. Also, it looks like you only pass

Re: OnlineLogisticRegression: Are my settings sensible

2013-11-07 Thread Ted Dunning
On Thu, Nov 7, 2013 at 9:45 PM, Andreas Bauer b...@gmx.net wrote: Hi, Thanks for your comments. I modified the examples from the mahout in action book, therefore I used the hashed approach and that's why i used 100 features. I'll adjust the number. Makes sense. But the book was doing

Re: Scheduled tasks in Mahout

2013-10-30 Thread Ted Dunning
No. Scheduling is outside of Mahout's scope. On Wed, Oct 30, 2013 at 12:55 PM, Cassio Melo melo.cas...@gmail.com wrote: I wonder if Mahout (more precisely org.apache.mahout.cf.taste package) has any helper class to execute scheduled tasks like fetch data, compute similarity, etc. Thank

Re: TravellingSaleman

2013-10-29 Thread Ted Dunning
Actually that isn't quite correct. Watchmaker was removed. That was a genetic algorithm implementation. EP or evolutionary programming still has an implementation in Mahout in the class org.apache.mahout.ep.EvolutionaryProcess This algorithm is documented here: http://arxiv.org/abs/0803.3838

Re: Mahout 0.8 Random Forest Accuracy

2013-10-19 Thread Ted Dunning
Tim, Yes, RF's are ensemble learners, but that doesn't mean that you couldn't wrap them up with other classifiers to have a higher level ensemble. On Sat, Oct 19, 2013 at 6:48 AM, Tim Peut t...@timpeut.com wrote: Thanks for the info and suggestions everyone. On 19 October 2013 01:00, Ted

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning
On Fri, Oct 18, 2013 at 7:48 AM, Tim Peut t...@timpeut.com wrote: Has anyone found that Mahout's random forest doesn't perform as well as other implementations? If not, is there any reason why it wouldn't perform as well? This is disappointing, but not entirely surprising. There has been

Re: Mahout 0.8 Random Forest Accuracy

2013-10-18 Thread Ted Dunning
On Fri, Oct 18, 2013 at 3:50 PM, j.barrett Strausser j.barrett.straus...@gmail.com wrote: How difficult would it be to wrap the RF classifier into an ensemble learner? It is callable. Should be relatively easy.

Re: Clustering of text data on external categories

2013-10-11 Thread Ted Dunning
Search engines do cool things. On Fri, Oct 11, 2013 at 7:42 AM, Jens Bonerz jbon...@googlemail.com wrote: what a nice idea :-) really like that approach 2013/10/11 Ted Dunning ted.dunn...@gmail.com You don't need Mahout for this. A very easy way to do this is to gather all the words

Re: Naive bayes and character n-grams

2013-10-10 Thread Ted Dunning
For language detection, you are going to have a hard time doing better than one of the standard packages for the purpose. See here: http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html On Thu, Oct 10, 2013 at 1:01 AM, Dean Jones dean.m.jo...@gmail.com wrote: Hi Si,

Re: Naive bayes and character n-grams

2013-10-10 Thread Ted Dunning
Cool. Sounds like you are ahead of the game. Sent from my iPhone On Oct 10, 2013, at 13:15, Dean Jones dean.m.jo...@gmail.com wrote: On 10 October 2013 12:46, Ted Dunning ted.dunn...@gmail.com wrote: For language detection, you are going to have a hard time doing better than one

Re: Naive bayes and character n-grams

2013-10-09 Thread Ted Dunning
Yes. Should work to use character n-grams. There are oddities in the stats because the different n-grams are not independent, but Naive Bayes methods are in such a state of sin that it shouldn't hurt any worse. No... I don't think that there is a capability built in to generate the character

Re: Solr-recommender

2013-10-09 Thread Ted Dunning
of the rest could be trimmed away by config or adherence to conventions I suspect. In the demo site I'm working on I've had to adopt some slightly hacky conventions that I'll describe some day. On Oct 1, 2013, at 10:38 PM, Ted Dunning ted.dunn...@gmail.com wrote: Pat, Ellen and some folks

Re: Solr-recommender

2013-10-09 Thread Ted Dunning
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: On 10/9/13 3:08 PM, Pat Ferrel wrote: Solr uses cosine similarity for it's queries. The implementation on github uses Mahout LLR for calculating the item-item similarity matrix but when you do the

Re: Solr-recommender

2013-10-09 Thread Ted Dunning
On Wed, Oct 9, 2013 at 2:07 PM, Pat Ferrel p...@occamsmachete.com wrote: 2) What you are doing is something else that I was calling a shopping-cart recommender. You are using the item-set in the current cart and finding similar, what, items? A different way to tackle this is to store all other

Re: Solr-recommender

2013-10-09 Thread Ted Dunning
On Wed, Oct 9, 2013 at 12:54 PM, Michael Sokolov msoko...@safaribooksonline.com wrote: It sounds like you are doing item-item similarities for recommendations, not actually calculating user-history based recs, is that true? Yes that's true so far. Our recommender system has the ability to

Re: What are the best settings for my clustering task

2013-10-06 Thread Ted Dunning
iPhone On Oct 6, 2013, at 12:37, Jens Bonerz jbon...@googlemail.com wrote: Hmmm.. has ballkmeans made it already into the 0.8 release? can't find it in the list of available programs when calling the mahout binary... 2013/10/3 Ted Dunning ted.dunn...@gmail.com What you are seeing here

Re: Editing Dictionary Vector Generated

2013-10-04 Thread Ted Dunning
Why do you say that this is unacceptable? If the phrase is the most common way that the word English is used, this isn't such a bad thing. In general, with machine learning, the idea is to let the data speak. If the data say something you don't like, you have to be careful about

Re: What are the best settings for my clustering task

2013-10-04 Thread Ted Dunning
MahoutCluster similarProducts.txt What am I missing? 2013/10/3 Ted Dunning ted.dunn...@gmail.com Yes. That will work. The sketch will then contain 10,000 x log N centroids. If N = 10^9, log N \approx 30 so the sketch will have at about 300,000 weighted centroids

Re: Editing Dictionary Vector Generated

2013-10-04 Thread Ted Dunning
On Fri, Oct 4, 2013 at 6:13 AM, Puneet Arora arorapuneet2...@gmail.comwrote: yes you guessed correct that I am using naive bayes, but how can I handle this type of problem. I didn't hear about a problem. You said you didn't like weights on words like English to reflect the fact that they

Re: What are the best settings for my clustering task

2013-10-02 Thread Ted Dunning
on their short description text? What else could I use? 2013/10/1 Ted Dunning ted.dunn...@gmail.com At such small sizes, I would guess that the sequential version of the streaming k-means or ball k-means would be better options. On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon

Re: What are the best settings for my clustering task

2013-10-02 Thread Ted Dunning
happen, if I define a very high number that is guaranteed to be the estimated number of clusters. for example if I set it to 10.000 clusters if an estimate of 5.000 is likely, will that work? 2013/10/2 Ted Dunning ted.dunn...@gmail.com The way that the new streaming k-means works

Re: What are the best settings for my clustering task

2013-10-01 Thread Ted Dunning
At such small sizes, I would guess that the sequential version of the streaming k-means or ball k-means would be better options. On Mon, Sep 30, 2013 at 2:14 PM, mercutio7979 jbon...@googlemail.comwrote: Hello all, I am currently trying create clusters from a group of 50.000 strings that

Re: Multidimensional log-likelihood similarity

2013-09-29 Thread Ted Dunning
Yes. You can turn the normal item-item relationships around to get this. What you have is an item x feature matrix. Normally, one has a user x item matrix in cooccurrence analysis and you get an item x item matrix. If you consider the features to be users in the computation, then the resulting

Re: Mahout in one PC - multiple cores processor

2013-09-21 Thread Ted Dunning
? 2013/9/20 Ted Dunning ted.dunn...@gmail.com It also depends on what you are doing. Several parts of Mahout have non Hadoop versions. On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com wrote: It is always possible to run mahout without a cluster on a single

Re: Mahout in one PC - multiple cores processor

2013-09-20 Thread Ted Dunning
It also depends on what you are doing. Several parts of Mahout have non Hadoop versions. On Fri, Sep 20, 2013 at 5:53 AM, parnab kumar parnab.2...@gmail.com wrote: It is always possible to run mahout without a cluster on a single machine but donot expect too much performance gain on it if

Re: Clustering algorithms

2013-09-17 Thread Ted Dunning
Right now the best in terms of speed without losing quality in Mahout is the streaming k-means implementation. One exciting possibility is that you probably can combine a streaming k-means pre-pass with a regularized k-means algorithm in order to get results more like Lingo. You could also

Re: Tuning parameters for ALS-WR

2013-09-11 Thread Ted Dunning
On Wed, Sep 11, 2013 at 12:07 AM, Sean Owen sro...@gmail.com wrote: 2. Do we have to tune the similarityclass parameter in item-based CF? If so, do we compare the mean average precision values based on validation data, and then report the same for the test set? Yes you are

Re: Tuning parameters for ALS-WR

2013-09-10 Thread Ted Dunning
You definitely need to separate into three sets. Another way to put it is that with cross validation, any learning algorithm needs to have test data withheld from it. The remaining data is training data to be used by the learning algorithm. Some training algorithms such as the one that you

Re: Solr recommender

2013-09-07 Thread Ted Dunning
On Fri, Sep 6, 2013 at 9:33 AM, Pat Ferrel pat.fer...@gmail.com wrote: One of the unique things about the Solr recommender is online recs. Two scenarios come to mind: 1) ask the user to pick from among a list of videos, taking the picks as preferences and making recs. Make more and see if

Re: Hadoop implementation of ParallelSGDFactorizer

2013-09-07 Thread Ted Dunning
That means If I Recall Correctly. It is an internet slang. See also http://en.wiktionary.org/wiki/Appendix:English_internet_slang On Sat, Sep 7, 2013 at 12:39 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Sebastian, what is IIRC? On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter

Re: Mahout readable output

2013-09-07 Thread Ted Dunning
Darius comments are good. You also have to think about what similar means to you. From the data you describe, I see several possibilities: - geo-location from machine id (if it includes IP address) - content from the query - frequency of posting - diurnal phase of posting (tells us time

Re: Solr recommender

2013-09-07 Thread Ted Dunning
On Sat, Sep 7, 2013 at 2:35 PM, Pat Ferrel p...@occamsmachete.com wrote: ... Clustering can be done by doing SVD or ALS on the user x thing matrix first or by directly clustering the columns of the user x thing matrix after some kind of IDF weighting. I think that only the streaming

Re: lucene.vectors not working

2013-09-06 Thread Ted Dunning
Ahh... That makes a lot of sense. On Thu, Sep 5, 2013 at 11:38 PM, Lauren Massa-Lochridge laurl...@ieee.orgwrote: Ted Dunning ted.dunning at gmail.com writes: OK. So the easy answer strikes out. On Sat, Aug 3, 2013 at 5:04 AM, Swami Kevala swami.kevala at ishafoundation.org

Re: using KmeansDriver with HDFS

2013-09-05 Thread Ted Dunning
On Wed, Sep 4, 2013 at 6:58 PM, Alan Krumholz alan_krumh...@yahoo.com.mxwrote: I pulled that code (org.apache.mahout.clustering.classify.ClusterClassifier.readFromSeqFiles(ClusterClassifier.java:215)and I think is trying to read a file from one of the paths I passed to the method but with a

Re: Has anyone implemented true L-LDA out of Mahout?

2013-09-05 Thread Ted Dunning
I haven't seen any discussion of this other than what you reference. On Thu, Sep 5, 2013 at 7:59 AM, Henry Lee honesthe...@gmail.com wrote: I am about to implement Jake Mannix's suggestion out of Twitter fork. Has anyone already implemented true L-LDA out of Mahout?

Re: Tweaking ALS models to filter out highly related items when an item has been purchased

2013-09-05 Thread Ted Dunning
I think that Dominik's comments are exactly on target. As far as implementation is concerned, I think that it is very important to not distort the basic recommendation algorithm with business rules like this. It is much better to post-process the results to impose your will directly. One

Re: ALS and SVD feature vectors

2013-09-04 Thread Ted Dunning
On Wed, Sep 4, 2013 at 10:59 AM, Dmitriy Lyubimov dlie...@gmail.com wrote: Now, what happens in the case of SVD? The vectors are normal by definition. Are singular values used at all, or just left and right singular vectors? SVD does not take weights so it cannot ignore or weigh out a

Re: Cannot build source version mahout-distribution-0.8

2013-08-27 Thread Ted Dunning
You also have to watch out in the case of web errors. Maven can store an error message instead of a well formed file in your repo leading to all kinds of confusion. Try deleting thus *rm -rf ~/.m2/repository/com/ibm* On Tue, Aug 27, 2013 at 7:37 AM, Stevo Slavić ssla...@gmail.com wrote:

Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations

2013-08-19 Thread Ted Dunning
-similairty case. The cross-corelation sparsification via cooccurrence is probably pretty weak, no? On Aug 18, 2013, at 11:53 AM, Ted Dunning ted.dunn...@gmail.com wrote: Outside of the context of your demo, suppose that you have events a, b, c and d. Event a is the one we are centered

Re: Setting up a recommender

2013-08-19 Thread Ted Dunning
the data from Mahout. On Aug 12, 2013, at 2:32 PM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. That would be interesting. On Mon, Aug 12, 2013 at 1:25 PM, Gokhan Capan gkhn...@gmail.com wrote: A little digression: Might a Matrix implementation backed by a Solr index

Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations

2013-08-18 Thread Ted Dunning
values between the two sequence pairs to flip the order at will... which is information that co-occurrence of course does not know about. On Sat, Aug 17, 2013 at 10:30 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is nice. As you say, k11 is the only part that is used in cooccurrence

Re: Draft Interactive Viz for Exploring Co-occurrence, Recommender calculations

2013-08-17 Thread Ted Dunning
This is nice. As you say, k11 is the only part that is used in cooccurrence and it doesn't weight by prevalence, either. This size analysis is hard to demonstrate much difference because it is hard to show interesting values of LLR without absurdly string coordination between items. On Fri,

Re: Install mahout 0.8 with hadoop 2.0

2013-08-14 Thread Ted Dunning
-identify as a member of the small demand set Ted Dunning describes, I figure I can chime in. As always, YMMV.

Re: Install mahout 0.8 with hadoop 2.0

2013-08-13 Thread Ted Dunning
No. There is very small demand for Mahout on Hadoop 2.0 so far and the forward/backward incompatibility of 2.0 has made it difficult to motivate moving to 2.0. The bigtop guys built a maven profile for 0.23 some time ago. I don't know the status of that. I don't think that the differences are

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Ted Dunning
Why do you think this? On Tue, Aug 13, 2013 at 11:56 AM, sam wu swu5...@gmail.com wrote: Mahout 0.9 snapshot RowSimilarityJob.java , sampleDown method line 291 or 300 double rowSampleRate = Math.min(maxObservationsPerRow, observationsPerRow) / observationsPerRow; return either 0.0

Re: RowSimilarityJob, sampleDown method problem

2013-08-13 Thread Ted Dunning
, observationsPerRow) / observationsPerRow; we get rowSampleRate =0.0 ( not 0.7) do we totally skip this column or sample column entries with .7 probalility (roughly get 700 entries) On Tue, Aug 13, 2013 at 11:58 AM, Ted Dunning ted.dunn...@gmail.com wrote: Why do you think this? On Tue

Re: Help regarding Seq2sparse utility

2013-08-12 Thread Ted Dunning
features for each vector so that I can convert the dense vectors to sparse vectors. Your thoughts on this are welcome. Thanks, Ashvini On Mon, Aug 12, 2013 at 10:55 AM, Ted Dunning ted.dunn...@gmail.com wrote: Aside from your issues with clusterdumper, the values you want can be had from

Re: Clustering for customer segmentation

2013-08-12 Thread Ted Dunning
The tasks that you need to do include: a) group your history by user id b) extract the features you want to use from each user history c) repeat clustering and adjusting the scaling of your features until you are happy If you have a few hundred examples of customers broken down by the

Re: Clustering for customer segmentation

2013-08-12 Thread Ted Dunning
On Mon, Aug 12, 2013 at 12:52 PM, Martin, Nick nimar...@pssd.com wrote: I'd love to contribute so I'll get on JIRA and sign up for the dev@mailing list to start getting a feel for that process. Sounds like you already know the drill. Welcome!

Re: Setting up a recommender

2013-08-12 Thread Ted Dunning
item space. This should be a very easy change If my thinking is correct. On Aug 7, 2013, at 8:09 AM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer

Re: Help regarding Seq2sparse utility

2013-08-11 Thread Ted Dunning
Aside from your issues with clusterdumper, the values you want can be had from a sparse vector using v.iterateNonZero() and v.norm(0). The issue with clusterdumper is odd. Are you saying that the display shows all the components of the vector? Or that there is an in-memory representation that

Re: Changing weightings in kmeans

2013-08-10 Thread Ted Dunning
Check out the streaming k-means code. It provides capabilities for weighted samples. On Sat, Aug 10, 2013 at 6:57 AM, William Moran echofo...@gmail.com wrote: Hi, How would I go about changing the weighting of certain words when preparing data for kmeans? Also, in clusterdumps I have

Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-09 Thread Ted Dunning
On Fri, Aug 9, 2013 at 12:30 PM, Matt Molek mpmo...@gmail.com wrote: From some local IR precision/recall testing, I've found that user based recommenders do better on my data, so I'd like to stick with user based if I can. I know precision/recall measures aren't always that important when

Re: Is OnlineSummarizer mergeable?

2013-08-08 Thread Ted Dunning
From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org; Otis Gospodnetic otis_gospodne...@yahoo.com Sent: Wednesday, August 7, 2013 11:48 PM Subject: Re: Is OnlineSummarizer mergeable? Otis, What statistics do you need? What

Re: RecommenderJob Recommending an Item Already Preferred by a User

2013-08-08 Thread Ted Dunning
That might slow down the job enormously for certain nasty inputs. The more that I think about things, the more convinced I am that there should be a post-processing pass to enforce things like not recommending input items. The recommendation algorithm itself should not be distorted to do this if

Re: How to get human-readable output for large clustering?

2013-08-08 Thread Ted Dunning
Mahout is a library. You can link against any version you like and still have a perfectly valid Hadoop program. On Wed, Aug 7, 2013 at 11:51 AM, Adam Baron adam.j.ba...@gmail.com wrote: Suneel, Unfortunately no, we're still on Mahout 0.7. My team is one of many teams which share a

Re: Regarding starting up our project

2013-08-08 Thread Ted Dunning
If you are doing a student project, it may be best for you to do this as a separate github project that *depends* on Mahout rather than trying to build a modification to Mahout in the first instance. The reasons that I say this include: a) the Apache process will probably be foreign to you at

Re: Regarding starting up our project

2013-08-08 Thread Ted Dunning
On Thu, Aug 8, 2013 at 1:31 PM, Sushanth Bhat(MT2012147) sushanth.b...@iiitb.org wrote: One more doubt I have that do we need to start our project without Mahout library, I mean just implementing algorithm? I would suggest that Mahout would be very useful for your project. Use Maven and

Re: Is OnlineSummarizer mergeable?

2013-08-08 Thread Ted Dunning
still a little too phat...which is what made me think of your OnlineSummarizer as a possible, slimmer alternative. Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com

Re: Evaluating Precision and Recall of Various Similarity Metrics

2013-08-08 Thread Ted Dunning
Rafal, The major problems with these sorts of metrics with recommendations include a) different algorithms pull up different data and you don't have any deeply scored reference data. The problem is similar to search except without test collections. There are some partial solutions to this b)

Re: Arff files to Naive Bayes

2013-08-08 Thread Ted Dunning
On Wed, Aug 7, 2013 at 3:56 PM, John Meagher john.meag...@gmail.com wrote: Continuous values are being used now in addition to a large set of boolean flags. I think I could convert the continuous values to some sort of bucketed values that could be represented as additional flags. If that

Re: Content-Based Recommendation Approaches

2013-08-07 Thread Ted Dunning
On Wed, Aug 7, 2013 at 7:29 AM, cont...@dhuebner.com wrote: This typically won't be fast enough if you have something like a random forest, but if your final targeting model is logistic regression, it probably will be fast enough. So usually I do need to train a custom model for each user

Re: Is OnlineSummarizer mergeable?

2013-08-07 Thread Ted Dunning
It isn't as mergeable as I would like. If you have randomized record selection, it should be possible, but perverse ordering can cause serious errors. It would be better to use something like a Q-digest. http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf On Wed, Aug 7, 2013 at

Re: Setting up a recommender

2013-08-07 Thread Ted Dunning
On Tue, Aug 6, 2013 at 7:57 AM, Pat Ferrel pat.fer...@gmail.com wrote: 4) To add more metadata to the Solr output will be left to the consumer for now. If there is a good data set to use we can illustrate how to do it in the project. Ted may have some data for this from musicbrainz. I am

Re: Is OnlineSummarizer mergeable?

2013-08-07 Thread Ted Dunning
/ HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: user@mahout.apache.org user@mahout.apache.org Sent: Wednesday, August 7, 2013 4:51 PM Subject: Re: Is OnlineSummarizer mergeable? It isn't as mergeable as I would like. If you

Re: up-to-date book or tutorial

2013-08-07 Thread Ted Dunning
There is a considerable amount of discussion going on about a new edition of Mahout in Action. On Wed, Aug 7, 2013 at 12:36 PM, Piero Giacomelli pgiac...@gmail.comwrote: Basically all my examples will be based on mahout 0.8. So for example the k-means clustering will be used with the updated

Re: Arff files to Naive Bayes

2013-08-07 Thread Ted Dunning
By non-text, do you mean continuous values? Or sparse sets of tokens? The general idea for Naive Bayes is that it requires input consisting of sparse sets of tokens. On Wed, Aug 7, 2013 at 2:00 PM, John Meagher john.meag...@gmail.com wrote: I'm just starting work with Mahout and I'm

Re: Is OnlineSummarizer mergeable?

2013-08-07 Thread Ted Dunning
that create data structures that cannot be merged Loss of accuracy that is not predictably small or configurable Thank you, Otis Performance Monitoring for Solr / ElasticSearch / Hadoop / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn

Re: Content-Based Recommendation Approaches

2013-08-06 Thread Ted Dunning
On Tue, Aug 6, 2013 at 5:27 PM, Dominik Hübner cont...@dhuebner.com wrote: I wonder how model based approaches might be scaled to a large number of users. My understanding is that I would have to train some model like a decision tree or naive bayes (or regression … etc.) for each user and do

Re: solr-recommender, recent changes to ToItemVectorsMapper

2013-08-05 Thread Ted Dunning
Concur here. Obviously CrossRowSimilarityJob and RowSimilarityJob will be able to share some down-stream code. But there are economies in RSJ that probably can't apply to CRSJ. On Mon, Aug 5, 2013 at 7:20 AM, Sebastian Schelter s...@apache.org wrote: I think the downsampling belongs into

<    1   2   3   4   5   6   7   8   9   10   >