Re: Problem converting tokenized documents into TFIDF vectors

2014-01-26 Thread Drew Farris
Scott, Based on the dictionary output, it looks like the processing of generating vector from your tokenized text is not working properly. The only term that's making it into your dictionary is 'java' - everything else is being filtered out. Furthermore, your tf vectors have a single dimension

Re: n-gram and ml

2012-06-11 Thread Drew Farris
Pat, For what it's worth, in many cases the n-grams with the highest llr scores tend to be kinda cruddy too. For example, here are the top few from the reuters data set after tokenization in preparation for k-means clustering. reuter 3203110.22877580073 mar 1987108503.63631130551

Re: Exception while testing reuters data

2011-06-29 Thread Drew Farris
Hi Sharath, Just getting back to this -- what is in the reuters/reuters21578 directory? Are the text files of some sort or are they the reuters-21578 sgm files from http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz To answer your original question -- there isn't anything in

Re: parameter setting for using Seqdirectory and SequenceFile

2011-06-29 Thread Drew Farris
Hi Wenyia, The chunk size property will cause seqdirectory to output smaller sequence files. Using multiple small files as input will allow a greater number of map tasks to be run in parallel because each file will be assigned to its own map task. In the case of the Reuters example, forcing the

Re: Exception while testing reuters data

2011-06-22 Thread Drew Farris
Hi Sharath, Does the reuters/reuters-vectors-bigram directory contain a tfidf-vectors directory? If so, try using that as input. If not, what is in that directory? This sounds similar to the problem Hector ran into running one of the examples from the mahout-in-action book. Thanks, Drew On

Re: Problems running examples

2011-06-09 Thread Drew Farris
Sean, I'd be surprised to find out that k-means was busted. It was working just prior to release 0.5 when I was working on https://issues.apache.org/jira/browse/MAHOUT-694 which may be related to Mark's problems, but then again I haven't been tracking the other patches that were applied around

Re: Problems running examples

2011-06-09 Thread Drew Farris
Jeff, Could you tell me about what's failing in KMeans and LDA when running on a cluster? I had this working just prior to 0.5 in https://issues.apache.org/jira/browse/MAHOUT-694 Thanks, Drew On Thu, Jun 9, 2011 at 2:01 PM, Jeff Eastman jeast...@narus.com wrote: Ahem, KMeans is not busted. It

Re: mahout example warning

2011-05-04 Thread Drew Farris
It is just a warning that can be safely ignored. Are you encountering some other problem? On Mon, May 2, 2011 at 5:20 PM, Simon Chu simonchu@gmail.com wrote: 11/05/02 14:17:43 WARN driver.MahoutDriver: No org.apache.lucene.benchmark.utils.ExtractReuters.props found on classpath, will use

Re: Welcome new committers: Shannon Quinn and Dmitry Lyubimov

2011-02-13 Thread Drew Farris
Welcome Dmitry and Shannon! Looking forward to working with both of you. On Sat, Feb 12, 2011 at 12:12 PM, Grant Ingersoll gsing...@apache.org wrote: I am pleased to announce that the Mahout PMC has, in recognition of their continued contributions to Mahout, elected Shannon Quinn and Dmitry

Re: Running CollocDriver, exception

2011-01-24 Thread Drew Farris
On Sun, Jan 23, 2011 at 11:09 PM, Darren Govoni dar...@ontrenet.com wrote: Drew,  Thanks for the tip. It works great now! Great, glad it's working. PS. the sort command you suggested doesn't quite sort by LLR score because its only a lexical sort and misses something like 70.000 should be

Re: Running CollocDriver, exception

2011-01-23 Thread Drew Farris
Hi Darren, From the error message you receive, it is not exactly clear what is happening here. I suppose it could be due to the format of the input sequence file, but I'm not certain. A couple questions that will help me answer your question: 1) What version of Mahout are you using? 2) How are

Re: Running CollocDriver, exception

2011-01-23 Thread Drew Farris
, Drew Farris wrote: Hi Darren,  From the error message you receive, it is not exactly clear what is happening here. I suppose it could be due to the format of the input sequence file, but I'm not certain. A couple questions that will help me answer your question: 1) What version of Mahout

Re: Clustering performance

2010-12-05 Thread Drew Farris
2010/12/2 Jure Jeseničnik jure.jesenic...@planet9.si When running locally, mahout was only consuming one cpu core? I’m running it on win 7 through Cygwin, but it behaved pretty the same on some proper linux machines. How could I make it use all the available cpu power? IIRC, LocalJobRunner

Re: Sparse Vectors

2010-11-21 Thread Drew Farris
Per o.a.m.utils.vectors.lucene.TFDFMapper, which is called from o.a.m.utils.vectors.lucene.Driver, the vectors created are instances of RandomAccessSparseVector On Sun, Nov 21, 2010 at 9:28 AM, Mike Perry mikeperrycan...@gmail.com wrote: Thanks Ted for the answer. Should be sparse, but I can't

Re: Moving a twitter conversation to the mailing list

2010-11-12 Thread Drew Farris
FWIW, Jimmy Lin's book has a chapter on MapReduce-based EM algorithms (http://www.umiacs.umd.edu/~jimmylin/book.html) On Mon, Nov 8, 2010 at 8:01 AM, Sebastian Schelter s...@apache.org wrote: I'm moving a twitter conversation to the mailing list so that it doesn't vanish in the short-lived

Re: Regarding Mahout NaiveBayes classifier and Cbayes classifier training

2010-10-12 Thread Drew Farris
The jira issue MAHOUT-520, includes a patch that contains script that can be used to run the twenty newsgroups example. If the wiki isn't clear regarding input and output paths, the script should give you a good idea what goes where. At the very least you should be able to run the script and

Re: Offtopic: Hadoop World Talks?

2010-10-08 Thread Drew Farris
You can get a preview of the talk from the Booz Allen Hamilton folks here: http://www.slideshare.net/ydn/3-biometric-hadoopsummit2010 Although their talk will be less focused on Biometrics per se, and more on general uses of their Fuzzy Table code. They use Mahout canopy and kmeans to partition

Re: Can't Get Bayes Classifier to Work Properly

2010-10-05 Thread Drew Farris
Rosario uclamath...@gmail.com wrote: Thank you for your help. I tried dividing the data into two files spam.txt and nonspam.txt within directory simple_spam, but still have the same problem. No useful output. Ryan On Mon, Oct 4, 2010 at 7:42 PM, Drew Farris d...@apache.org wrote: Hi Ryan

Re: Can't Get Bayes Classifier to Work Properly

2010-10-04 Thread Drew Farris
Hi Ryan, Your format looks good. The -i argument must point to a directory of one or more files as input. In the example the 20newsgroups data is separated into a single file per class. I'm not certain this is a requirement because the class is in the first column after all. If you are running

Re: unknown test data twenty-newsgroups example

2010-09-30 Thread Drew Farris
On Thu, Sep 30, 2010 at 10:00 AM, Neil Ghosh neil.gh...@gmail.com wrote: My Question is , If I want to test unknown, documents , do I need it in specific format ? or just keep them (as raw text ) in the input folder while testing ? If I interpret your question correctly, you're saying I've

Re: What are the ways to train and run classifiers on text?

2010-09-26 Thread Drew Farris
Hi Bhaskar, Thake a look at the latest from svn trunk: https://svn.apache.org/repos/asf/mahout/trunk/, you'll find the TrainNewsGroups class in the examples project. It is alll pretty new, so there are no docs on the wiki, but the code is very readable. If you are interested in working with the

Re: Clustering on Elastic Map Reduce

2010-09-11 Thread Drew Farris
Congratulations! What's the best way to send messages back to the caller of an EMR job, using stderr instead of the log framework here? On Sat, Sep 11, 2010 at 9:32 PM, Grant Ingersoll gsing...@apache.org wrote: And indeed, running this via the Ruby CLI works as well.  Woo hoo! -Grant On

Re: Mahout svn is empty ?

2010-09-02 Thread Drew Farris
The new location is: http://svn.apache.org/repos/asf/mahout/trunk On Thu, Sep 2, 2010 at 9:45 AM, Jeff Zhang zjf...@gmail.com wrote: Thanks Sean, but why this link http://svn.apache.org/repos/asf/lucene/mahout/trunk is empty ? Isn't it mahout's office site ? On Thu, Sep 2, 2010 at 1:20 AM,

Re: Clustering on Elastic Map Reduce

2010-09-02 Thread Drew Farris
/2/10 10:04 AM, Drew Farris wrote: Were there specific issues you ran into? I suspect the documentation on the wiki is out of date. Drew On Sun, Aug 29, 2010 at 10:58 AM, Grant Ingersollgsing...@apache.org  wrote: Has anyone successfully run any of the clustering algorithms on Amazon's

Re: Trouble running RecommenderJob with Mahout 0.3 - class not found issues

2010-08-09 Thread Drew Farris
On Mon, Aug 9, 2010 at 4:14 PM, Simon Reavely simon.reav...@gmail.com wrote: Please note, i suspect that this might be an issue with how I hacked together my package since I can't figure out how to create a proper binary release from src. I'm not familiar with the taste code, but as far as

Re: Reading Vectors Created from a Lucene Index

2010-07-01 Thread Drew Farris
Hi Kris, Could you try the code in the patch at: https://issues.apache.org/jira/secure/attachment/12448536/MAHOUT-402.patch This should cause VectorDumper to emit the names found in NamedVectors. Thanks, Drew On Thu, Jul 1, 2010 at 10:23 AM, Kris Jack mrkrisj...@gmail.com wrote: Hi Grant,

Re: question: network visualization

2010-06-27 Thread Drew Farris
Manish, Have you looked at Gephi at all? http://gephi.org - Drew On Sun, Jun 27, 2010 at 12:20 PM, Manish Katyal manish.kat...@gmail.comwrote: Any recommendations on visualization tools for a sparse but large social network graph? This is for exploratory analysis of the graph so I need to

Re: Collocation and Seq2Sparse Questions

2010-05-27 Thread Drew Farris
On Thu, May 27, 2010 at 2:59 PM, Jake Mannix jake.man...@gmail.com wrote: Ditto this. I thought we already had one in mahout somewhere too? Not that I know of. There are a couple implementations in hbase too, not sure how similar these are to the one in hadoop: