Re: Using mahout for pre-defined clusters

2012-08-01 Thread Biju Balakrishnan
Hi salman, I want to create clusters that represent what company the news belong to. e.g if the news says Apple launches new iphone , I want this to be in the Apple cluster. similarly if the news says Microsoft share prices raises by 10% I want it to be in the Microsoft cluster. I have a list

Re: ERROR: OutOfMemoryError: Java heap space

2012-08-01 Thread pricila rr
Even changing the memory settings the error continues. What else can I do? And how can I split the file? With smaller files does not occur error. I'm using Mahout and Hadoop on Linux machines, with one master and two slaves. Thank you. 2012/7/28 Anandha L Ranganathan analog.s...@gmail.com You

RE: cmdump

2012-08-01 Thread Sam Hodgson
Unfortunately I dont know any Java as yet, im using PHP so going to have to pipe the output to file and extract what I need from that. Messy but it should work for what I need. Thanks for your input! :) Date: Mon, 30 Jul 2012 20:03:12 -0700 Subject: Re: cmdump From: goks...@gmail.com To:

Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-01 Thread Abramov Pavel
Hello! I have trouble running the example seq2sparse with TFIDF weights. My TF vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has 20 terms, while Document1 in TFIDF vector has only 2 terms. What is wrong? I

Re: Seq2sparse example produces bad TFIDF vectors while TF vectors are Ok.

2012-08-01 Thread Robin Anil
Tfidf job is where the document frequency pruning is applied. Try increasing maxDFPercent to 100 % On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel p.abra...@rambler-co.ruwrote: Hello! I have trouble running the example seq2sparse with TFIDF weights. My TF vectors are Ok, while TFIDF vectors

Clustering or Classification?

2012-08-01 Thread Salman Mahmood
Hi all, I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with. I have got news documents (around 3000 and continuously increasing) containing news about companies, investment,

Re: Clustering or Classification?

2012-08-01 Thread Sean Owen
Classifiers are supervised learning algorithms, so you need to provide a bunch of examples of positive and negative classes. In your example, it would be fine to label a bunch of articles as about Apple or not, then use feature vectors derived from TF-IDF as input, with these labels, to train a

Re: Clustering or Classification?

2012-08-01 Thread Salman Mahmood
Hi Sean, Thank you for the clarification. So you are saying that Mahout is not suitable in this case or did you say clustering is not the right way to go and If its worth it, I should go for classification? Secondly are you the same Sean Owen who wrote Mahout in Action? :) On Wed, Aug 1, 2012

Re: Clustering or Classification?

2012-08-01 Thread Sean Owen
I'm suggesting that classification sounds like the right solution for the problem you have described. You can use Mahout (or anything else that classifies) for that. Yes I am the same. On Wed, Aug 1, 2012 at 6:50 PM, Salman Mahmood salman...@gmail.com wrote: Hi Sean, Thank you for the

Re: Clustering or Classification?

2012-08-01 Thread syed kather
Hi salman mahmood, Whydont you try to apply clustering first . Once you applied high level clustering then check the top terms . You avoid the cluster which you feel good and try to find inter cluster which you found that it has confusion . Once you found that all the clusters are fine . To

Re: Clustering or Classification?

2012-08-01 Thread syed kather
Sry I had not sean owen post as it is not update in mobile . Syed Abdul kather send from Samsung S3 On Aug 1, 2012 11:32 PM, syed kather in.ab...@gmail.com wrote: Hi salman mahmood, Whydont you try to apply clustering first . Once you applied high level clustering then check the top terms

Re: Clustering or Classification?

2012-08-01 Thread John Conwell
here is an article I ran across a few weeks ago that I think describes what your after (at least at a high level) http://blog.getprismatic.com/blog/2012/4/17/clustering-related-stories.html On Wed, Aug 1, 2012 at 10:08 AM, Salman Mahmood salman...@gmail.com wrote: Hi all, I am stuck between

Re: performance study

2012-08-01 Thread Dmitriy Lyubimov
I only know comparisons of parallel algorithms only. There's performance and accuracy comparison between Mahout's SSVD and Lanczos done in dissertation of N. Halko (see link at SSVD page on Mahout wiki). There's also a Heigen SVD paper that discusses distributed modified Lanczos method of a

Re: Kmeans algorithm Error

2012-08-01 Thread kiran kumar
No it is not there in out.txt file. out.txt file basically contains the vectors and the same command works in other machine. I am thinking of some issue in hadoop jar file. It runs the command df and trying to parse the header information. I am not sure of what is the reason.. Thanks, Kiran

Re: performance study

2012-08-01 Thread Ted Dunning
I would like to endorse this point. If your sparse data fits in memory on a single machine, it is very unlikely that you will be able to improve on the cost of doing a stochastic projection on that one machine using any Hadoop based solution. Even with MPI and crazy RDMA networking, I doubt that

MongoDBDataModel doesn't work?

2012-08-01 Thread Winnu Ayi Satria
Hi all, I am trying to combine MongoDB and Mahout using the same code in Mahout In Action book, chapter 2. The very first code. But now I replaced the source, user-item-preference, not from CSV file but from MongoDB. So the model is instantiated from MongoDBDataModel, not FileDataModel anymore.

Re: MongoDBDataModel doesn't work?

2012-08-01 Thread Sean Owen
If the data is 'really' there in the DataModel you seem to have ruled out all the differences. ;) I imagine there is something slightly amiss. Can you step through with a debugger to see what the UserSimilarity calculates? look what data it gets and see if it makes sense. If it seems to,

Unable to find KMeans Cluster class

2012-08-01 Thread Abhinav M Kulkarni
Hi, I have following code snippet from the book 'Hadoop in Action': Vector vec = vectors.get(i); Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure()); I am unable to find Cluster class anywhere with constructor as above. In fact under package org.apache.mahout.clustering

Re: Unable to find KMeans Cluster class

2012-08-01 Thread Sean Owen
That may be a typo in the book. I don't know if it was non-abstract in the past. But try against version 0.5 to be sure. I don't know what the replacement code is if so but someone else here likely does. On Wed, Aug 1, 2012 at 9:20 PM, Abhinav M Kulkarni abhinavkulka...@gmail.com wrote: Hi,

UUID based user IDs

2012-08-01 Thread Matt Mitchell
Question about dealing with UUIDs as Mahout user IDs. I'm considering ways to deal with these values: 1. use getLeastSignificantBits 2. re-map to a database auto-increment number (this would take very long time to do?) 3. customize mahout so that it accepts UUIDs as user IDs Any feedback here?

Re: UUID based user IDs

2012-08-01 Thread Sean Owen
Yep, just hash to a long, from UUID or String or whatever. The occasional collision does not cause a real problem. If you mix the tastes of two users or items once in a billion times, the overall results will hardly be different. You have to maintain the reverse mapping of course. Look at the

Re: UUID based user IDs

2012-08-01 Thread Matt Mitchell
Thanks Sean! That all makes sense. Would you mind recommended a hashing function for this? Is there something in Mahout I could use? - Matt On Wed, Aug 1, 2012 at 4:34 PM, Sean Owen sro...@gmail.com wrote: Yep, just hash to a long, from UUID or String or whatever. The occasional collision does

Re: UUID based user IDs

2012-08-01 Thread Sean Owen
No, but I'd recommend XORing the top 64 bits with the bottom 64 bits, something simple like that. On Wed, Aug 1, 2012 at 9:40 PM, Matt Mitchell goodie...@gmail.com wrote: Thanks Sean! That all makes sense. Would you mind recommended a hashing function for this? Is there something in Mahout I

Re: Unable to find KMeans Cluster class

2012-08-01 Thread Abhinav M Kulkarni
Okay, I used Kluster class under org.apache.mahout.clustering.kmeans package. This implements interface Cluster. On 08/01/2012 01:25 PM, Sean Owen wrote: That may be a typo in the book. I don't know if it was non-abstract in the past. But try against version 0.5 to be sure. I don't know what

Re: MongoDBDataModel doesn't work?

2012-08-01 Thread Winnu Ayi Satria
After checking through debugger, I could confirm using the simple code from Mahout In Action book and MongoDBDataModel, it works. Actually it is trivial problem, the actual userID in MongoDB or CSV file is different with userID inside MongoDBDataModel. So is the itemID. for example:

Re: ERROR: OutOfMemoryError: Java heap space

2012-08-01 Thread Lance Norskog
If you are on Unix, and you want to split your text on line boundaries, the 'split' program will create many files with the same number of lines. On Wed, Aug 1, 2012 at 5:29 AM, pricila rr pricila...@gmail.com wrote: Even changing the memory settings the error continues. What else can I do? And

Re: UUID based user IDs

2012-08-01 Thread Manuel Blechschmidt
Hello Matt, On 01.08.2012, at 22:40, Matt Mitchell wrote: Thanks Sean! That all makes sense. Would you mind recommended a hashing function for this? Is there something in Mahout I could use? The following class uses an string to long mapping based on a MemoryIDMigrator:

Re: UUID based user IDs

2012-08-01 Thread Matt Mitchell
Thanks Manuel, that's very helpful. So you're saying I can just use MemoryIDMigrator, even after my preferences have bee created with UUID values? Or, should I create my preferences using the MemoryIDMigrator? - Matt On Wed, Aug 1, 2012 at 8:49 PM, Manuel Blechschmidt manuel.blechschm...@gmx.de

Question about recommender database drivers

2012-08-01 Thread Matt Mitchell
Hi, The data I'm using to generate preferences happens to be in a solr index. Would it be feasible, or make any sense, to write an adapter so that I can use solr to store the preferences as well? The solr instance could be embedded since this is all java, and would probably end up being pretty

Re: Kmeans algorithm Error

2012-08-01 Thread Paritosh Ranjan
The input should be a sequence file. Maybe that's the error. On 01-08-2012 22:30, Kate Ericson wrote: Hi, From the error message, it's tripping over 1K-blocks when it's expecting a long. Is that somewhere in your input file (F:/docsite/CIIndex/index/out.txt)? Or perhaps part of your hadoop

Re: Clustering or Classification?

2012-08-01 Thread Paritosh Ranjan
Would it help if you find clusters and map top terms with the categories? I think mapping terms to categories will need to be a manual process, as any software won't be able to map iPhone to Apple by itself. So, having a term - category mapping beforehand, and using this mapping on cluster's

Re: Clustering or Classification?

2012-08-01 Thread Biju Balakrishnan
Hi Salman I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item Apple