Hi salman,
I want to create clusters that represent what company the news belong to.
e.g if the news says Apple launches new iphone , I want this to be in the
Apple cluster. similarly if the news says Microsoft share prices raises by
10% I want it to be in the Microsoft cluster. I have a list
Even changing the memory settings the error continues. What else can I do?
And how can I split the file? With smaller files does not occur error.
I'm using Mahout and Hadoop on Linux machines, with one master and two
slaves.
Thank you.
2012/7/28 Anandha L Ranganathan analog.s...@gmail.com
You
Unfortunately I dont know any Java as yet, im using PHP so going to have to
pipe the output to file and extract what I need from that. Messy but it should
work for what I need.
Thanks for your input! :)
Date: Mon, 30 Jul 2012 20:03:12 -0700
Subject: Re: cmdump
From: goks...@gmail.com
To:
Hello!
I have trouble running the example seq2sparse with TFIDF weights. My TF
vectors are Ok, while TFIDF vectors are 10 times smaller. Looks like
seq2sparse cuts my terms during TFxIDF step. Document1 in TF vector has 20
terms, while Document1 in TFIDF vector
has only 2 terms. What is wrong? I
Tfidf job is where the document frequency pruning is applied. Try
increasing maxDFPercent to 100 %
On Wed, Aug 1, 2012 at 11:22 AM, Abramov Pavel p.abra...@rambler-co.ruwrote:
Hello!
I have trouble running the example seq2sparse with TFIDF weights. My TF
vectors are Ok, while TFIDF vectors
Hi all,
I am stuck between a decision to apply classification or clustering on the
data set I got. The more I think about it, the more I get confused. Heres
what I am confronted with.
I have got news documents (around 3000 and continuously increasing)
containing news about companies, investment,
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as about Apple or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a
Hi Sean,
Thank you for the clarification. So you are saying that Mahout is not
suitable in this case or did you say clustering is not the right way to go
and If its worth it, I should go for classification?
Secondly are you the same Sean Owen who wrote Mahout in Action? :)
On Wed, Aug 1, 2012
I'm suggesting that classification sounds like the right solution for
the problem you have described. You can use Mahout (or anything else
that classifies) for that. Yes I am the same.
On Wed, Aug 1, 2012 at 6:50 PM, Salman Mahmood salman...@gmail.com wrote:
Hi Sean,
Thank you for the
Hi salman mahmood,
Whydont you try to apply clustering first . Once you applied high level
clustering then check the top terms . You avoid the cluster which you feel
good and try to find inter cluster which you found that it has confusion .
Once you found that all the clusters are fine . To
Sry I had not sean owen post as it is not update in mobile .
Syed Abdul kather
send from Samsung S3
On Aug 1, 2012 11:32 PM, syed kather in.ab...@gmail.com wrote:
Hi salman mahmood,
Whydont you try to apply clustering first . Once you applied high
level clustering then check the top terms
here is an article I ran across a few weeks ago that I think describes what
your after (at least at a high level)
http://blog.getprismatic.com/blog/2012/4/17/clustering-related-stories.html
On Wed, Aug 1, 2012 at 10:08 AM, Salman Mahmood salman...@gmail.com wrote:
Hi all,
I am stuck between
I only know comparisons of parallel algorithms only. There's
performance and accuracy comparison between Mahout's SSVD and Lanczos
done in dissertation of N. Halko (see link at SSVD page on Mahout
wiki). There's also a Heigen SVD paper that discusses distributed
modified Lanczos method of a
No it is not there in out.txt file. out.txt file basically contains the
vectors and the same command works in other machine. I am thinking of some
issue in hadoop jar file. It runs the command df and trying to parse the
header information. I am not sure of what is the reason..
Thanks,
Kiran
I would like to endorse this point.
If your sparse data fits in memory on a single machine, it is very unlikely
that you will be able to improve on the cost of doing a stochastic
projection on that one machine using any Hadoop based solution.
Even with MPI and crazy RDMA networking, I doubt that
Hi all,
I am trying to combine MongoDB and Mahout using the same code in Mahout In
Action book, chapter 2. The very first code. But now I replaced the source,
user-item-preference, not from CSV file but from MongoDB. So the model is
instantiated from MongoDBDataModel, not FileDataModel anymore.
If the data is 'really' there in the DataModel you seem to have ruled
out all the differences. ;) I imagine there is something slightly
amiss. Can you step through with a debugger to see what the
UserSimilarity calculates? look what data it gets and see if it makes
sense. If it seems to,
Hi,
I have following code snippet from the book 'Hadoop in Action':
Vector vec = vectors.get(i);
Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
I am unable to find Cluster class anywhere with constructor as above. In
fact under package
org.apache.mahout.clustering
That may be a typo in the book. I don't know if it was non-abstract in the
past. But try against version 0.5 to be sure. I don't know what the
replacement code is if so but someone else here likely does.
On Wed, Aug 1, 2012 at 9:20 PM, Abhinav M Kulkarni
abhinavkulka...@gmail.com wrote:
Hi,
Question about dealing with UUIDs as Mahout user IDs. I'm considering
ways to deal with these values:
1. use getLeastSignificantBits
2. re-map to a database auto-increment number (this would take very
long time to do?)
3. customize mahout so that it accepts UUIDs as user IDs
Any feedback here?
Yep, just hash to a long, from UUID or String or whatever. The occasional
collision does not cause a real problem. If you mix the tastes of two users
or items once in a billion times, the overall results will hardly be
different.
You have to maintain the reverse mapping of course. Look at the
Thanks Sean! That all makes sense. Would you mind recommended a
hashing function for this? Is there something in Mahout I could use?
- Matt
On Wed, Aug 1, 2012 at 4:34 PM, Sean Owen sro...@gmail.com wrote:
Yep, just hash to a long, from UUID or String or whatever. The occasional
collision does
No, but I'd recommend XORing the top 64 bits with the bottom 64 bits,
something simple like that.
On Wed, Aug 1, 2012 at 9:40 PM, Matt Mitchell goodie...@gmail.com wrote:
Thanks Sean! That all makes sense. Would you mind recommended a
hashing function for this? Is there something in Mahout I
Okay, I used Kluster class under org.apache.mahout.clustering.kmeans
package. This implements interface Cluster.
On 08/01/2012 01:25 PM, Sean Owen wrote:
That may be a typo in the book. I don't know if it was non-abstract in the
past. But try against version 0.5 to be sure. I don't know what
After checking through debugger, I could confirm using the simple code from
Mahout In Action book and MongoDBDataModel, it works. Actually it is
trivial problem, the actual userID in MongoDB or CSV file is different with
userID inside MongoDBDataModel. So is the itemID.
for example:
If you are on Unix, and you want to split your text on line
boundaries, the 'split' program will create many files with the same
number of lines.
On Wed, Aug 1, 2012 at 5:29 AM, pricila rr pricila...@gmail.com wrote:
Even changing the memory settings the error continues. What else can I do?
And
Hello Matt,
On 01.08.2012, at 22:40, Matt Mitchell wrote:
Thanks Sean! That all makes sense. Would you mind recommended a
hashing function for this? Is there something in Mahout I could use?
The following class uses an string to long mapping based on a MemoryIDMigrator:
Thanks Manuel, that's very helpful. So you're saying I can just use
MemoryIDMigrator, even after my preferences have bee created with UUID
values? Or, should I create my preferences using the MemoryIDMigrator?
- Matt
On Wed, Aug 1, 2012 at 8:49 PM, Manuel Blechschmidt
manuel.blechschm...@gmx.de
Hi,
The data I'm using to generate preferences happens to be in a solr
index. Would it be feasible, or make any sense, to write an adapter so
that I can use solr to store the preferences as well? The solr
instance could be embedded since this is all java, and would probably
end up being pretty
The input should be a sequence file. Maybe that's the error.
On 01-08-2012 22:30, Kate Ericson wrote:
Hi,
From the error message, it's tripping over 1K-blocks when it's
expecting a long.
Is that somewhere in your input file (F:/docsite/CIIndex/index/out.txt)?
Or perhaps part of your hadoop
Would it help if you find clusters and map top terms with the categories?
I think mapping terms to categories will need to be a manual process, as
any software won't be able to map iPhone to Apple by itself.
So, having a term - category mapping beforehand, and using this mapping
on cluster's
Hi Salman
I have got news documents (around 3000 and continuously increasing)
containing news about companies, investment, stocks, economy, quartly
income etc. My goal is to have the news sorted in such a way that I know
which news correspond to which company. e.g for the news item Apple
32 matches
Mail list logo