Re: distributed cholesky on mahout

2018-04-19 Thread Ted Dunning
There was a variant of cholesky decomposition in Mahout at one time not so long ago. I would guess that it is still there. It is difficult to make a truly distributed version of QR decomposition, but for the purposes of the randomized SVD in Mahout, it wasn't actually necessary to have a true QR.

Re: "LLR with time"

2017-11-12 Thread Ted Dunning
hows and products, it's also helps that there is traffic driven > from external sources. > > Thanks for the detailed hints - now it's time to see what comes out of > this. > > Johannes > > On Sun, Nov 12, 2017 at 7:52 AM, Ted Dunning <ted.dunn...@gmail.com> > wrote

Re: "LLR with time"

2017-11-11 Thread Ted Dunning
years of batch :) > > Thanks for your thoughts, I am happy I can rule something out given the > domain (poisson llr). Luckily the domain I'm working on is event > recommendations, so there is a natural deterministic item expiry (as > compared to christmas like stuff). > > Aga

Re: "LLR with time"

2017-11-11 Thread Ted Dunning
rsonalized but would yield “hot in > Greece” > I think that this is a good approach. > > Ted’s “Christmas video” tag is what I was calling a business rule and can > be added to either of the above techniques. > But the (not) hotness feature might help with automated this.

Re: New logo

2017-05-06 Thread Ted Dunning
On Sat, May 6, 2017 at 2:43 PM, Scott C. Cote wrote: > Will you be wearing “one of those t-shirts” on Monday in Houston :) ? > Not likely. It is in the archive.

Re: New logo

2017-05-06 Thread Ted Dunning
ng old mahout > >color > >> > palatte if one were to dab their brush in the appropriate colors. > >This > >> > could also be represented in any single color. (Not sure what that > >does > >> to > >> > our TM, is it ok i

Re: New logo

2017-04-27 Thread Ted Dunning
gt; > > On Thu, Apr 27, 2017 at 3:48 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > Do you have constructive input (guidance or opinion is welcome input) or > > would you like to discontinue the contest. If the later, -1 now. > > > > > > On Apr 27, 2017, at 3:

Re: New logo

2017-04-27 Thread Ted Dunning
ive input (guidance or opinion is welcome input) or > would you like to discontinue the contest. If the later, -1 now. > > > On Apr 27, 2017, at 3:42 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > I thought that none of the proposals were worth continuing with. > > &

Re: New logo

2017-04-27 Thread Ted Dunning
I thought that none of the proposals were worth continuing with. On Thu, Apr 27, 2017 at 3:36 PM, Pat Ferrel wrote: > Yes, -1 means you hate them all or think the designers are not worth > paying. We have to pay to continue, I’ll foot the bill (donations >

Re: Reg:-Integrating Mahout with Solr

2017-04-02 Thread Ted Dunning
ds, > Arun > > On 2 April 2017 at 11:59, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > Arun, > > > > That's good news. > > > > The second limitation will be how much data you have for each document > and > > whehter you have a good measure

Re: Reg:-Integrating Mahout with Solr

2017-04-02 Thread Ted Dunning
> > It wont be a problem for me to use the LAN path for configurations and > index.I can use the larger document base. > > Thanks and Regards, > Arun > > On 2 April 2017 at 07:00, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > On Sat, Apr 1, 2017 at 6:21 PM, arun abraham

Re: Reg:-Integrating Mahout with Solr

2017-04-01 Thread Ted Dunning
On Sat, Apr 1, 2017 at 6:21 PM, arun abraham wrote: > As a first step I am trying to recommend min of two documents(As my > Solr document index is ~100 docs). > This is kind of weird. Can you say why you have so very few documents? There may be something special

Re: Marketing

2017-03-24 Thread Ted Dunning
On Fri, Mar 24, 2017 at 8:27 AM, Pat Ferrel wrote: > maybe we should drop the name Mahout altogether. I have been told that there is a cool secondary interpretation of Mahout as well. I think that the Hebrew word is pronounced roughly like Mahout. מַהוּת The cool

Re: Mahout ML vs Spark Mlib vs Mahout-Spark integreation

2017-01-31 Thread Ted Dunning
>From my perspective, the state of the art of machine learning is with systems like Tensorflow and dl4j. If you can deal with the limits of a non-clustered GPU system, then Theano and Cafe are very useful. Keras papers over the difference between different back-ends nicely. Tensorflow and Theano

Re: Scaling up spark Iitem similarity on big data data sets

2016-06-23 Thread Ted Dunning
This actually sounds like a very small problem. My guess is that there are bad settings for the interaction and frequency cuts. On Thu, Jun 23, 2016 at 11:07 AM, Pat Ferrel wrote: > In addition to increasing downsampling there are some other things to > note. The

Re: mahout tf-idf vs lucene tf-idf

2016-06-04 Thread Ted Dunning
On Sat, Jun 4, 2016 at 10:14 AM, forme book wrote: > On the (Lucene side) has already by default this implementations, what I do > struggle to understand what is the advantage of having lucene.vector in > mahout when Lucene offer that feature out of the box ? > > Maybe I'm

Re: LLR quick clarification

2016-05-12 Thread Ted Dunning
It just means that there is an association. Causation is much more difficult to ascertain. On Wed, May 4, 2016 at 6:06 AM, Nikaash Puri wrote: > Hi, > > Just wanted to clarify a small doubt. On running LLR with primary > indicator as view and secondary indicator as

Re: Matrix inversion

2016-05-05 Thread Ted Dunning
Mahout is considerably better at sparse operations and optimizations than dense ones. Beyond that, I would expect that you would do better with traditional math libraries. And, are you really trying to invert a matrix? The common maxim is that this implies an error in your method because

Re: Algorithms of prediction

2016-02-25 Thread Ted Dunning
On Thu, Feb 25, 2016 at 6:52 AM, wrote: > Thank you for your answer > What other tools you advise me to use? > Do you recommend Rhadoop? > Try h2o instead. Good R interface. Decent model building.

Re: What's the mr item-based recommend algorithm essay?

2016-02-20 Thread Ted Dunning
See here: https://ssc.io/pdf/rec11-schelter.pdf On Fri, Feb 19, 2016 at 3:16 AM, Lee S wrote: > Hi: >Does anybody know which paper the mr algorithm is based on? >

Re: Document similarity

2016-02-14 Thread Ted Dunning
Did you want textual similarity? Or semantic similarity? The actual semantics of a message can be opaque from the content, but clear from the usage. On Sun, Feb 14, 2016 at 5:29 AM, Charles Earl wrote: > David, > LDA or LSI can work quite nicely for similarity (YMMV of

Re: Mahout - Recommenditemvalue with magnitude of 1

2015-11-29 Thread Ted Dunning
On Sun, Nov 29, 2015 at 9:36 PM, Niklas Ekvall wrote: > My conclusion is that recommenditembased in Mahout works better for ratings > than binary data, what is your conclusions? > Still operator error somewhere. Binary data works much better as a real recommender.

Re: Efficiently writing all the recommendation to a file

2015-11-20 Thread Ted Dunning
There are a few problems that you have. 1) user-based recommendation is often slower than item-based (sometimes MUCH slower). This can make a 2-10x difference in practice 2) pre-computing recommendations is usually much less efficient than computing them on the fly (because typically few users

Re: Haters get Love too

2015-11-03 Thread Ted Dunning
No. Not entirely surprising, but it is *really* nice to get some public results on this. The treatment of the negatives as a separate cross term instead of just lumping them together is a very significant difference. On Tue, Nov 3, 2015 at 3:42 PM, Peter Jaumann

Re: Haters get Love too

2015-11-03 Thread Ted Dunning
On Tue, Nov 3, 2015 at 3:20 PM, Pat Ferrel wrote: > For the strict out there we did not directly isolate the two actions, > which is work remaining so some of the lift might be due to just having > more data but it’s a really good first step because more data doesn't >

Re: matrix inversion in plan ?

2015-10-08 Thread Ted Dunning
om SVD right ? thanks, canal > > > On Monday, October 5, 2015 2:25 PM, Ted Dunning < > ted.dunn...@gmail.com> wrote: > > > That isn't enough detail. > > How do you mean to compute degrees of freedom? WHy do you need the inverse > to do this? > > Where di

Re: matrix inversion in plan ?

2015-10-05 Thread Ted Dunning
ay, October 5, 2015 6:25 AM, Peter Jaumann < > peter.jauma...@gmail.com> wrote: > > > This should be done with a matrix solver indeed!!! > > > > On Oct 4, 2015 11:53 AM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: > > > > > > It is almos

Re: matrix inversion in plan ?

2015-10-05 Thread Ted Dunning
On Sun, Oct 4, 2015 at 10:32 PM, go canal wrote: > in fact i need to support both double and complex double for either > distributed memory based or out-of-core. Ahh... Well Mahout doesn't support complex anything. So this isn't going to help you.

Re: matrix inversion in plan ?

2015-10-05 Thread Ted Dunning
lid> wrote: > I will be more than interested to extend to complex double, when the > solver is ready for double data type. thanks, canal > > > On Monday, October 5, 2015 2:02 PM, Ted Dunning < > ted.dunn...@gmail.com> wrote: > > > On Sun, Oct 4, 2015 at 10:32

Re: matrix inversion in plan ?

2015-10-04 Thread Ted Dunning
roject requires the > inversion of a very large matrix. will have to revert back to scalapack or MR > based solutions I guess. > thanks, canal > > > On Saturday, October 3, 2015 11:31 PM, Ted Dunning > <ted.dunn...@gmail.com> wrote: > > > I doubt serio

Re: matrix inversion in plan ?

2015-10-03 Thread Ted Dunning
I doubt seriously that Samsara will support matrix inversion per se. The problem is a) it densifies sparse matrices b) it is much more costly than solving a linear system Samsara is roughly memory based, but different back-ends will try to spill to disk if necessary. It is likely that the

Re: Modifying kmeans algo

2015-09-23 Thread Ted Dunning
On Tue, Sep 22, 2015 at 5:51 PM, Ankit Goel wrote: > What I wanted to do was modify the clustering algorithm, in hopes of > experimenting with different versions of it. I'm not much hung over the MR > part of things, rather the clustering algo itself. > Have at it.

Re: [mahout 0.9 | k-means] methodology for selecting k to cluster very large datasets

2015-09-15 Thread Ted Dunning
My own feeling is that the right answer is to look at average squared distance on your training data and on held out data. As long as these values are nearly the same, you likely have a smaller (or equal) than optimal value of k. When the average squared distance is significantly less on the

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ted Dunning
The most central point in a cluster is often referred to as a medoid (similar to median, but multi-dimensional). The Mahout code does not compute medoids. In general, they are difficult to compute and implementing a full k-medoid clustering algorithm even more so. On Mon, Jul 20, 2015 at 6:25

Re: Kmeans clusterdump Interpretation

2015-07-20 Thread Ted Dunning
suggestions how I should go about that? So far I'm using nutch to crawl, solr to index and now I'm here on mahout. On Tue, Jul 21, 2015 at 7:10 AM, Ted Dunning ted.dunn...@gmail.com wrote: The most central point in a cluster is often referred to as a medoid (similar to median

Re: Realtime update of similarity matrices

2015-06-19 Thread Ted Dunning
The standard approach is to re-run the off-line learning. It is possible, though not yet supported in Mahout tools, to do real-time updates. See here for some details: https://www.mapr.com/resources/videos/fully-real-time-recommendation-%E2%80%93-ted-dunning-sf-data-mining On Fri, Jun 19

Re: Streaming K-means

2015-06-01 Thread Ted Dunning
The streaming k-means works by building a sketch of the data which is then used to do real clustering. It might be that this sketch would be acceptable to do k-medoids, but that is definitely not guaranteed. Similarly, it might be possible to build a medoid sketch instead of a mean based sketch,

Re: Regression using MapReduce

2015-05-30 Thread Ted Dunning
Mahout is deprecating pretty much all of the classic MapReduce implementations in any case in favor of algorithms based fundamentally on a new linear algebra system known as Mahout-Samsara. On Fri, May 29, 2015 at 10:52 PM, Punit Naik naik.puni...@gmail.com wrote: Hello all users I just

Re: Row Similarity

2015-05-14 Thread Ted Dunning
Actually, this is probably done more easily using a simple matrix multiplication. The reason for not using recommendation code for this is that your problem is entirely dense. How exactly you should go about this is a different question. Up to tens of thousands of stars, you can probably do

Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-11 Thread Ted Dunning
) On Fri, Apr 10, 2015 at 9:34 PM, Ted Dunning ted.dunn...@gmail.com wrote: Ah... forgot this. +1 (binding) On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: I downloaded and tested the signatures and check-sums on {binary,source} x

Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-10 Thread Ted Dunning
I downloaded and tested the signatures and check-sums on {binary,source} x {zip,tar} + pom. All were correct. One thing that I worry a little about is that the name of the artifact doesn't include apache. Not sure that is a hard requirement, but it seems a good thing to do. On Fri, Apr 10,

Re: [VOTE] Apache Mahout 0.10.0 Release

2015-04-10 Thread Ted Dunning
Ah... forgot this. +1 (binding) On Fri, Apr 10, 2015 at 11:14 PM, Ted Dunning ted.dunn...@gmail.com wrote: I downloaded and tested the signatures and check-sums on {binary,source} x {zip,tar} + pom. All were correct. One thing that I worry a little about is that the name of the artifact

Re: adjusted cosine similarity for item-based recommender?

2015-04-03 Thread Ted Dunning
For practical recommendation systems, ratings are almost irrelevant. Ratings were prominent in the original academic work on recommendations largely because with the early research systems, users had no recordable interactions with content other than ratings. The Taste component of Mahout was

Re: fast performance way of writing preferences to file?

2015-04-03 Thread Ted Dunning
Are you sure that the problem is writing the results? It seems to me that the real problem is the use of a user-based recommender. For such a small data set, for instance, a search-based recommender will be able to make recommendations in less than a millisecond with multiple recommendations

Re: Latent Semantic Analysis for Document Categorization

2015-03-30 Thread Ted Dunning
, 2015 at 2:45 AM, Ted Dunning ted.dunn...@gmail.com wrote: Also, if you can include linking information between documents, you should be able to substantially improve accuracy. Same goes for behavioral data like browsing history. On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta

Re: Text clustering with SVD

2015-03-30 Thread Ted Dunning
Lanczos may be more accurate than SSVD, but if you use a power step or three, this difference goes away as well. The best way to select k is actually to pick a value k_max larger than you expect to need and then pick random vectors instead of singular vectors. To evaluate how many singular

Re: Fw: Mahout dataset Vectorization

2015-03-26 Thread Ted Dunning
in text format. Destination IP address is not implicit infact its in the second row and is a server. Kindly suggest how i can do the kmeans clustering wrt timestamp or is there a better way? Regards,Raghuveer On Thursday, March 26, 2015 6:34 AM, Ted Dunning ted.dunn...@gmail.com wrote

Re: Latent Semantic Analysis for Document Categorization

2015-03-26 Thread Ted Dunning
Also, if you can include linking information between documents, you should be able to substantially improve accuracy. Same goes for behavioral data like browsing history. On Thu, Mar 26, 2015 at 6:10 AM, Hersheeta Chandankar hersheetachandan...@gmail.com wrote: Thank you so much Chirag and

Re: Fw: Mahout dataset Vectorization

2015-03-25 Thread Ted Dunning
helpful if you can show me a sample for this issue. Kindly suggest. Thanks, Raghuveer On Tuesday, February 17, 2015 12:24 AM, Ted Dunning ted.dunn...@gmail.com wrote: Please take questions like this to the Mahout mailing list. I really prefer to answer these questions in public

Re: implementation of context-aware recommender in Mahout

2015-03-10 Thread Ted Dunning
Glad to help. You can help us by reporting your results when you get them. We look forward to that! On Tue, Mar 10, 2015 at 4:22 AM, Efi Koulouri ekoulou...@gmail.com wrote: Things got clearier with your help! Thank you very much On 9 March 2015 at 01:50, Ted Dunning ted.dunn

Re: implementation of context-aware recommender in Mahout

2015-03-08 Thread Ted Dunning
interesting but in my case I think that building the recommender using the java classes is more appropriate as I need to use both approaches (post filtering,pre filtering). Am I right ? On 8 March 2015 at 16:08, Ted Dunning ted.dunn...@gmail.com wrote: The by far easiest way to build

Re: implementation of context-aware recommender in Mahout

2015-03-08 Thread Ted Dunning
The by far easiest way to build a recommender (especially for production) is to use the search engine approach (what Pat was recommending). Post filtering can be done using the search engine far more easily than using Java classes. On Sat, Mar 7, 2015 at 8:44 AM, Pat Ferrel

Re: problem in recommender similarity computation (taste)

2015-03-08 Thread Ted Dunning
On Sat, Mar 7, 2015 at 3:05 AM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: There can be two solutions: 1. There should be a parameter n, which determines the minimum number of common ratings needed to compute a similarity otherwise the system should return NaN. 2. The similarity should

Re: spark-itemsimilarity question: what's the difference between indicator-matrix and cross-indicator-matrix

2015-03-06 Thread Ted Dunning
The terms main and secondary are a bit confusing. The easiest definition is that cooccurrence analyzes the record of actions you want to recommend. Cross occurrence tries to transfer from one behavior to another. In practice, it has been common to conflate many behaviors into one precisely

Re: How can I manually specify user similarities in the user-based algorithm?

2015-02-15 Thread Ted Dunning
On Sat, Feb 14, 2015 at 6:05 AM, Eugenio Tacchini eugenio.tacch...@gmail.com wrote: Hi Pat, I don't understand why it is not a Mahout problem, my goal is to evaluate (RMSE) the output of a user based algorithm comparing different user similarity measures, Mahout already has everything I need

Re: Apache Mahout Project for GSOC 2015

2015-02-15 Thread Ted Dunning
We haven't had anyone volunteer as a mentor this year as far as I know. On Sun, Feb 15, 2015 at 12:36 PM, Prasad Priyadarshana Fernando bpp...@gmail.com wrote: Hi, I am interested in doing a project on recommender system framework for GSOC 2015. Can somebody tell me whether Apache is

Re: Documentation

2015-02-13 Thread Ted Dunning
On Fri, Feb 13, 2015 at 9:37 AM, Eugenio Tacchini eugenio.tacch...@gmail.com wrote: If I need to use a classical user-based technique, however, the only alternative is the Taste-oriented code, am I right? Right. Still, I can't see how to perform a prediction for a a user/item couple, is

Re: How can I manually specify user similarities in the user-based algorithm?

2015-02-13 Thread Ted Dunning
On Fri, Feb 13, 2015 at 11:11 AM, Eugenio Tacchini eugenio.tacch...@gmail.com wrote: Is there anyone who can give me some hints about this task? Another way to look at this is to try to wedge this into the item similarity code. There are hooks available in the map-reduce version of item

Re: Neural Network in hadoop

2015-02-12 Thread Ted Dunning
That is a really old paper that basically pre-dates all of the recent important work in neural networks. You should look for works on Rectified Linear Units (ReLU), drop-out regularization, parameter servers (downpour sgd) and deep learning. Map-reduce as you have used it will not produce

Re: Own recommender

2015-01-21 Thread Ted Dunning
Juanjo, Using the Taste components, it will be almost impossible to get really high performance. For that, using the itemsimilarity program to feed a search index is the best alternative. The scala version of the itemsimilarity program is available in Scala and could be called fairly easily as

Re: Own recommender

2015-01-15 Thread Ted Dunning
The old Taste code is not the state of the art. User-based recommenders built on that will be slow. On Thu, Jan 15, 2015 at 7:10 AM, Juanjo Ramos jjar...@gmail.com wrote: Hi David, You implement your custom algorithm and create your own class that implements the UserSimilarity interface.

Re: boost selected dimensions in kmeans clustering

2015-01-15 Thread Ted Dunning
On Thu, Jan 15, 2015 at 5:23 AM, Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com wrote: My question is:.. Is it better to scale up these dimensions directly in the tf-idf sequence final mix file using this correction factors OR first do scale up in each tf-vectors and

Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-15 Thread Ted Dunning
trying to find a scalable solution for my problem, I tried to fit it in what's already implemented in Mahout (for clustering), but it's not so obvious to me. I'm open to suggestions, I'm still new to all of this. Thanks, Marko On Sat 10 Jan 2015 07:32:33 AM CET, Ted Dunning wrote

Re: boost selected dimensions in kmeans clustering

2015-01-14 Thread Ted Dunning
The easiest way is to scale those dimensions up. On Wed, Jan 14, 2015 at 2:41 AM, Miguel Angel Martin junquera mianmarjun.mailingl...@gmail.com wrote: hi all, I am clustering using kmeans several text documents from distintct sources and I have generated the sparse vectors of each

Re: How to partition a file to smaller size for performing KNN in hadoop mapreduce

2015-01-14 Thread Ted Dunning
have you considered implementing using something like spark? That could be much easier than raw map-reduce On Wed, Jan 14, 2015 at 10:06 PM, unmesha sreeveni unmeshab...@gmail.com wrote: In KNN like algorithm we need to load model Data into cache for predicting the records. Here is the

Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-10 Thread Ted Dunning
On Sat, Jan 10, 2015 at 3:02 AM, Marko Dinic marko.di...@nissatech.com wrote: For example, mean of two sinusoids while one of them is shifted by Pi is 0. And that's definitely not a good centroid in my case. Well, if you think that phase shifts represent small distance proportional to phase

Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-09 Thread Ted Dunning
with others in cluster (some kind of centroid/medioid). What do you think about this approach and about the scalability? I would highly appreciate your answer, thanks. On Thu 08 Jan 2015 08:19:18 PM CET, Ted Dunning wrote: On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di

Re: DTW distance measure and K-medioids, Hierarchical clustering

2015-01-08 Thread Ted Dunning
On Thu, Jan 8, 2015 at 7:00 AM, Marko Dinic marko.di...@nissatech.com wrote: 1) Is there an implementation of DTW (Dynamic Time Warping) in Mahout that could be used as a distance measure for clustering? No. 2) Why isn't there an implementation of K-mediods in Mahout? I'm guessing that

Re: consistency of StaticWordValueEncoder

2015-01-07 Thread Ted Dunning
On Wed, Jan 7, 2015 at 2:20 PM, chirag lakhani chirag.lakh...@gmail.com wrote: In the Mahout in Action book I got the impression that the term memo will seed the random number generator and I wanted to confirm that means I will have consistency if I deploy this vectorizer in both my Hadoop

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Ted Dunning
On Tue, Dec 23, 2014 at 7:39 AM, AlShater, Hani halsha...@souq.com wrote: @Ted, It is 3 nodes small cluster for POC. Spark executer is given 2g and yarn is configured accordingly. I am trying to avoid spark memory caching. Have you tried the map-reduce version?

Re: spark-itemsimilarity out of memory problem

2014-12-23 Thread Ted Dunning
On Tue, Dec 23, 2014 at 9:16 AM, Pat Ferrel p...@occamsmachete.com wrote: To use the hadoop mapreduce version (Ted’s suggestion) you’ll loose the cross-cooccurrence indicators and you’ll have to translate your IDs into Mahout IDs. This means mapping user and item IDs from your values into

Re: spark-itemsimilarity out of memory problem

2014-12-22 Thread Ted Dunning
Can you say what kind of cluster you have? How many machines? How much memory? How much memory is given to Spark? On Sun, Dec 21, 2014 at 11:44 PM, AlShater, Hani halsha...@souq.com wrote: Hi All, I am trying to use spark-itemsimilarity on 160M user interactions dataset. The job launches

Re: Question about choice of a recommender

2014-12-16 Thread Ted Dunning
How much data are you going to be collecting? How many users and how many presentations per user? Are you saying that the product for each video are completely fixed? Does the same product appear for more than one video? Do users interact with products outside of the narrow confines that you

Re: Collaborative filtering item-based in mahout - without isolating users

2014-12-11 Thread Ted Dunning
Natalia, It sounds like you are starting from the assumption that ratings are being done. This can happen, but in production recommendation settings, ratings is typically a very low value input because the meaning of a rating is very complex and because so few users actually do ratings unless

Re: Process UnStructured Data in Mahout for Clustering

2014-12-05 Thread Ted Dunning
On Thu, Dec 4, 2014 at 5:38 AM, Shahid Shaikh shaikhshah...@gmail.com wrote: i see the problem is with the way data is written What exactly do you mean by this?

Re: User based recommender

2014-12-05 Thread Ted Dunning
etc) Maybe location,sales per item(similarity might lead to knowledge of people who share same purchasing patterns) etc. On Wed, Dec 3, 2014 at 5:28 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com wrote: I have multiple

Re: User based recommender

2014-12-04 Thread Ted Dunning
On Wed, Dec 3, 2014 at 6:22 AM, Yash Patel yashpatel1...@gmail.com wrote: I have multiple different columns such as category,shipping location,item price,online user, etc. How can i use all these different columns and improve recommendation quality(ie.calculate more precise similarity

Re: DBSCAN implementation in Mahout

2014-11-30 Thread Ted Dunning
On Sat, Nov 29, 2014 at 8:31 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: Since Density based clustering algorithms, are being utilised extensively, especially by the GIS research groups, it is a bit sad that there isn't a Map Reduce implementation available.. I think I

Re: DBSCAN implementation in Mahout

2014-11-30 Thread Ted Dunning
'. I think scalability should not be an issue for a Map Reduce implementation. Chirag Nagpal University of Pune, India www.chiragnagpal.com From: Ted Dunning ted.dunn...@gmail.com Sent: Sunday, November 30, 2014 6:29 PM To: user@mahout.apache.org

Re: Mahout 0.7 ALS Recommender: java.lang.Exception: java.lang.RuntimeException: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable

2014-11-24 Thread Ted Dunning
The error message that you got indicated that some input was textual and needed to be an integer. Is there a chance that the type of some of your input is incorrect in your sequence files? On Mon, Nov 24, 2014 at 3:47 PM, Ashok Harnal ashokhar...@gmail.com wrote: Thanks for reply. I did not

Re: Bi-Factorization vs Tri-Factorization for recommender systems

2014-11-24 Thread Ted Dunning
There is no inherent mathematical difference, but there may be some pretty significant practical differences. Using the three matrix form (X = USV') puts the normalization constants into a place where you can control them a bit easier. This can be useful if you want *both* user and item vectors

Re: Re: why rbm was removed from mahout?

2014-11-09 Thread Ted Dunning
Check out H2O. http://0xdata.com/ On Mon, Nov 10, 2014 at 1:38 AM, zhonghong...@yy.com zhonghong...@yy.com wrote: So is there any scalable rbms available ? I'm going to implement a recommender based on it. From: Ted Dunning Date: 2014-11-10 15:34 To: user@mahout.apache.org Subject: Re

Re: Why do most algorithms use sequencefile as input and output?

2014-11-04 Thread Ted Dunning
What should the input be? On Tue, Nov 4, 2014 at 12:28 AM, Lee S sle...@gmail.com wrote: Hi all: I'm wondering why the input and output of most algorithm like kmeans,naivebayes are all sequencefiles. One more step of conversion need to be done if we want the algorithm works.And I think

Re: Why do most algorithms use sequencefile as input and output?

2014-11-04 Thread Ted Dunning
in vector(dense or sparse) format ,so a conversion step needs to be doned before algorithms deal with data. Is that right? 2014-11-04 23:56 GMT+08:00 Ted Dunning ted.dunn...@gmail.com: What should the input be? On Tue, Nov 4, 2014 at 12:28 AM, Lee S sle...@gmail.com wrote: Hi all

Re: using Mahout to classify customer service and sales emails?

2014-10-26 Thread Ted Dunning
process from scratch or can it be done incrementally? Best, Mahesh.B. On Thu, Oct 23, 2014 at 1:13 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. Mahout can do this. Pro: MapR classifiers are pretty easy to integrate because of a simple API. Con: The state of the art

Re: Mahout Vs Spark

2014-10-24 Thread Ted Dunning
. The Python API uses the standard CPython implementation, and can call into existing C libraries for Python such as NumPy. On Thu, Oct 23, 2014 at 1:11 PM, Ted Dunning ted.dunn...@gmail.com wrote: Hmmm I don't think that the array formats used by Spark are compatible

Re: Mahout Vs Spark

2014-10-23 Thread Ted Dunning
vibhanshugs...@gmail.com wrote: actually spark is available in python also, so users of spark are having an upper hand over users of traditional users of mahout. This is applicable to all the libraries of python (including numpy). On Wed, Oct 22, 2014 at 3:54 AM, Ted Dunning ted.dunn

Re: using Mahout to classify customer service and sales emails?

2014-10-22 Thread Ted Dunning
Yes. Mahout can do this. Pro: MapR classifiers are pretty easy to integrate because of a simple API. Con: The state of the art with MapR classifiers is pretty far behind the rest-of-the-world state of the art. On Tue, Oct 21, 2014 at 5:26 PM, Si Chen sic...@opensourcestrategies.com wrote:

Re: Upgrade to Spark 1.1.0?

2014-10-21 Thread Ted Dunning
On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel p...@occamsmachete.com wrote: The problem is not in building Spark it is in building Mahout using the correct Spark jars. If you are using CDH and hadoop 2 the correct jars are in the repos. This should be true for MapR as well.

Re: Mahout Vs Spark

2014-10-21 Thread Ted Dunning
On Tue, Oct 21, 2014 at 3:04 PM, Mahesh Balija balijamahesh@gmail.com wrote: I am trying to differentiate between Mahout and Spark, here is the small list, Features Mahout Spark Clustering Y Y Classification Y Y Regression Y Y Dimensionality Reduction Y Y Java Y Y Scala N Y

Re: Upgrade to Spark 1.1.0?

2014-10-19 Thread Ted Dunning
On Sun, Oct 19, 2014 at 1:49 PM, Pat Ferrel p...@occamsmachete.com wrote: Getting off the dubious Spark 1.0.1 version is turning out to be a bit of work. Does anyone object to upgrading our Spark dependency? I’m not sure if Mahout built for Spark 1.1.0 will run on 1.0.1 so it may mean

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Ted Dunning
On Mon, Oct 13, 2014 at 11:56 AM, Reinis Vicups mah...@orbit-x.de wrote: I have my own implementation of SimilarityAnalysis and by tuning number of tasks I have reached HUGE performance gains. Since I couldn't find how to pass the number of tasks to shuffle operations directly, I have set

Re: Mahout 1.0: parallelism/number tasks during SimilarityAnalysis.rowSimilarity

2014-10-13 Thread Ted Dunning
On Mon, Oct 13, 2014 at 12:32 PM, Reinis Vicups mah...@orbit-x.de wrote: Do you think that simply increasing this parameter is a safe and sane thing to do? Why would it be unsafe? In my own implementation I am using 400 tasks on my 4-node-2cpu cluster and the execution times of largest

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-10-01 Thread Ted Dunning
me some pointers on how I can apply it in this setting? Thanks, Rohit On Tue, Sep 30, 2014 at 4:33 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is an incredibly tiny dataset. If you delete singletons, it is likely to get significantly smaller. I think that something like LDA

Re: word weights using BM25

2014-10-01 Thread Ted Dunning
. I ll try to share something if I succeed. Arian Pasquali http://about.me/arianpasquali 2014-09-24 5:12 GMT+01:00 Suneel Marthi suneel.mar...@gmail.com: Lucene 4.x supports okapi-bm25. So it should be easy to implement. On Tue, Sep 23, 2014 at 11:57 PM, Ted Dunning ted.dunn

Re: word weights using BM25

2014-10-01 Thread Ted Dunning
On Wed, Oct 1, 2014 at 7:52 AM, Arian Pasquali ar...@arianpasquali.com wrote: My dataset is a collection of documents in german and I can say that the scores seems better compared to my TFIDF scores. Results make more sense now, specially my bi-grams. OK. I will take note.

Re: how to get recommendations by using user-user correlation for the given table in this mail

2014-09-29 Thread Ted Dunning
I would recommend that you look at actions other than ratings as well. Did a user expand and read 1 review? did they read 3 reviews? Did they mark a rating as useful? Did they ask for contact information? You know your system better than I possibly could, but using other information in

Re: Cosine Similarity and LogLikelihood not helpful for implicit feedback!

2014-09-29 Thread Ted Dunning
How are you using LLR to compute user similarity? It is normally used to compute item similarity? Also, what is your scale? how many users? how many items? how many actions per user? On Mon, Sep 29, 2014 at 6:24 PM, Parimi Rohit rohit.par...@gmail.com wrote: Hi, I am exploring a

Re: LogLikelihoodSimilarity calculation

2014-09-26 Thread Ted Dunning
again! On Sun, Sep 21, 2014 at 10:06 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Fri, Sep 19, 2014 at 3:29 AM, mario.al...@gmail.com wrote: So my question was -shouldn't we consider both the frequency distribution of item sales *and* of users purchases in the same formula

Re: Performance of RowSimilarityJob

2014-09-26 Thread Ted Dunning
Can you say how many words you are seeing? How many unique bigrams? As Suneel asked, which version of Mahout? On Fri, Sep 26, 2014 at 1:23 PM, Burke Webster bu...@collectiveip.com wrote: I've been implementing the RowSimilarityJob on our 40-node cluster and have run into so serious

  1   2   3   4   5   6   7   8   9   10   >