Scott,
Based on the dictionary output, it looks like the processing of generating
vector from your tokenized text is not working properly. The only term
that's making it into your dictionary is 'java' - everything else is being
filtered out. Furthermore, your tf vectors have a single dimension
Pat,
For what it's worth, in many cases the n-grams with the highest llr
scores tend to be kinda cruddy too. For example, here are the top few
from the reuters data set after tokenization in preparation for
k-means clustering.
reuter 3203110.22877580073
mar 1987108503.63631130551
Hi Sharath,
Just getting back to this -- what is in the reuters/reuters21578
directory? Are the text files of some sort or are they the
reuters-21578 sgm files from
http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
To answer your original question -- there isn't anything in
Hi Wenyia,
The chunk size property will cause seqdirectory to output smaller
sequence files. Using multiple small files as input will allow a
greater number of map tasks to be run in parallel because each file
will be assigned to its own map task.
In the case of the Reuters example, forcing the
Hi Sharath,
Does the reuters/reuters-vectors-bigram directory contain a
tfidf-vectors directory? If so, try using that as input. If not, what
is in that directory?
This sounds similar to the problem Hector ran into running one of the
examples from the mahout-in-action book.
Thanks,
Drew
On
Sean, I'd be surprised to find out that k-means was busted. It was
working just prior to release 0.5 when I was working on
https://issues.apache.org/jira/browse/MAHOUT-694 which may be related
to Mark's problems, but then again I haven't been tracking the other
patches that were applied around
Jeff,
Could you tell me about what's failing in KMeans and LDA when running
on a cluster? I had this working just prior to 0.5 in
https://issues.apache.org/jira/browse/MAHOUT-694
Thanks,
Drew
On Thu, Jun 9, 2011 at 2:01 PM, Jeff Eastman jeast...@narus.com wrote:
Ahem, KMeans is not busted. It
It is just a warning that can be safely ignored. Are you encountering
some other problem?
On Mon, May 2, 2011 at 5:20 PM, Simon Chu simonchu@gmail.com wrote:
11/05/02 14:17:43 WARN driver.MahoutDriver: No
org.apache.lucene.benchmark.utils.ExtractReuters.props found on classpath,
will use
Welcome Dmitry and Shannon! Looking forward to working with both of you.
On Sat, Feb 12, 2011 at 12:12 PM, Grant Ingersoll gsing...@apache.org wrote:
I am pleased to announce that the Mahout PMC has, in recognition of their
continued contributions to Mahout, elected Shannon Quinn and Dmitry
On Sun, Jan 23, 2011 at 11:09 PM, Darren Govoni dar...@ontrenet.com wrote:
Drew,
Thanks for the tip. It works great now!
Great, glad it's working.
PS. the sort command you suggested doesn't quite sort by LLR score
because its only a lexical sort and misses something like 70.000 should be
Hi Darren,
From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.
A couple questions that will help me answer your question:
1) What version of Mahout are you using?
2) How are
, Drew Farris wrote:
Hi Darren,
From the error message you receive, it is not exactly clear what is
happening here. I suppose it could be due to the format of the input
sequence file, but I'm not certain.
A couple questions that will help me answer your question:
1) What version of Mahout
2010/12/2 Jure Jeseničnik jure.jesenic...@planet9.si
When running locally, mahout was only consuming one cpu core? I’m running
it on win 7 through Cygwin, but it behaved pretty the same on some proper
linux machines. How could I make it use all the available cpu power?
IIRC, LocalJobRunner
Per o.a.m.utils.vectors.lucene.TFDFMapper, which is called from
o.a.m.utils.vectors.lucene.Driver, the vectors created are instances
of RandomAccessSparseVector
On Sun, Nov 21, 2010 at 9:28 AM, Mike Perry mikeperrycan...@gmail.com wrote:
Thanks Ted for the answer.
Should be sparse, but I can't
FWIW, Jimmy Lin's book has a chapter on MapReduce-based EM algorithms
(http://www.umiacs.umd.edu/~jimmylin/book.html)
On Mon, Nov 8, 2010 at 8:01 AM, Sebastian Schelter s...@apache.org wrote:
I'm moving a twitter conversation to the mailing list so that it doesn't
vanish in the short-lived
The jira issue MAHOUT-520, includes a patch that contains script that
can be used to run the twenty newsgroups example. If the wiki isn't
clear regarding input and output paths, the script should give you a
good idea what goes where. At the very least you should be able to run
the script and
You can get a preview of the talk from the Booz Allen Hamilton folks
here: http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010
Although their talk will be less focused on Biometrics per se, and
more on general uses of their Fuzzy Table code. They use Mahout canopy
and kmeans to partition
Rosario uclamath...@gmail.com wrote:
Thank you for your help.
I tried dividing the data into two files spam.txt and nonspam.txt
within directory simple_spam,
but still have the same problem. No useful output.
Ryan
On Mon, Oct 4, 2010 at 7:42 PM, Drew Farris d...@apache.org wrote:
Hi Ryan
Hi Ryan,
Your format looks good. The -i argument must point to a directory of
one or more files as input. In the example the 20newsgroups data is
separated into a single file per class. I'm not certain this is a
requirement because the class is in the first column after all.
If you are running
On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh neil.gh...@gmail.com wrote:
My Question is , If I want to test unknown, documents , do I need it in
specific format ? or just keep them (as raw text ) in the input folder while
testing ?
If I interpret your question correctly, you're saying I've
Hi Bhaskar,
Thake a look at the latest from svn trunk:
https://svn.apache.org/repos/asf/mahout/trunk/, you'll find the
TrainNewsGroups class in the examples project. It is alll pretty new,
so there are no docs on the wiki, but the code is very readable.
If you are interested in working with the
Congratulations!
What's the best way to send messages back to the caller of an EMR job,
using stderr instead of the log framework here?
On Sat, Sep 11, 2010 at 9:32 PM, Grant Ingersoll gsing...@apache.org wrote:
And indeed, running this via the Ruby CLI works as well. Woo hoo!
-Grant
On
The new location is: http://svn.apache.org/repos/asf/mahout/trunk
On Thu, Sep 2, 2010 at 9:45 AM, Jeff Zhang zjf...@gmail.com wrote:
Thanks Sean, but why this link
http://svn.apache.org/repos/asf/lucene/mahout/trunk is empty ?
Isn't it mahout's office site ?
On Thu, Sep 2, 2010 at 1:20 AM,
/2/10 10:04 AM, Drew Farris wrote:
Were there specific issues you ran into? I suspect the documentation
on the wiki is out of date.
Drew
On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersollgsing...@apache.org
wrote:
Has anyone successfully run any of the clustering algorithms on Amazon's
On Mon, Aug 9, 2010 at 4:14 PM, Simon Reavely simon.reav...@gmail.com wrote:
Please note, i suspect that this might be an issue with how I
hacked together my package since I can't figure out how to create a proper
binary release from src.
I'm not familiar with the taste code, but as far as
Hi Kris,
Could you try the code in the patch at:
https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch
This should cause VectorDumper to emit the names found in NamedVectors.
Thanks,
Drew
On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack mrkrisj...@gmail.com wrote:
Hi Grant,
Manish,
Have you looked at Gephi at all? http://gephi.org
- Drew
On Sun, Jun 27, 2010 at 12:20 PM, Manish Katyal manish.kat...@gmail.comwrote:
Any recommendations on visualization tools for a sparse but large social
network graph?
This is for exploratory analysis of the graph so I need to
On Thu, May 27, 2010 at 2:59 PM, Jake Mannix jake.man...@gmail.com wrote:
Ditto this. I thought we already had one in mahout somewhere too?
Not that I know of.
There are a couple implementations in hbase too, not sure how similar these
are to the one in hadoop:
28 matches
Mail list logo