20news

2011-07-04 Thread Vijay Santhanam
Hi All, I'm new to Mahout and I'm interested in experimenting with it's classifiers. Right now, I'm just trying to get up and running with the demo's and examples. After checking out the mahout trunk, I've tried running the classification example 20news, but after running the ./examples/bin/buil

Re: 20news

2011-07-04 Thread Sergey Bartunov
When I started with Mahout I had the same errors. In my case, I just didn't run PrepareTwentyNewsgroups. You may try to accurately repeat all steps from https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html On 4 July 2011 12:52, Vijay Santhanam wrote: > Hi All, > > I'm new to Mahout and I'm inte

Re: 20news

2011-07-04 Thread Vijay Santhanam
Thanks Sergey, I'm still receiving the same error after following those steps. I've chosen not to use hadoop - does yours work WITH hadoop? A few bits of info that might be relevant. My examples/bin/work folder contains the expected folders from test data preparation and training... drwxr-xr-x@

Re: 20news

2011-07-04 Thread Sergey Bartunov
Yes, I worked WITH hadoop, but there should be no difference. Why do you use examples/bin/build/20news-bayes.sh instead of direct running bin/mahout? Is it the same? On 4 July 2011 13:12, Vijay Santhanam wrote: > Thanks Sergey, > > I'm still receiving the same error after following those steps.

Re: 20news

2011-07-04 Thread Sergey Bartunov
Paste somewhere your bayes-test-input file. On 4 July 2011 13:20, Sergey Bartunov wrote: > Yes, I worked WITH hadoop, but there should be no difference. > > Why do you use examples/bin/build/20news-bayes.sh instead of direct > running bin/mahout? Is it the same? > > On 4 July 2011 13:12, Vijay S

Re: 20news

2011-07-04 Thread Vijay Santhanam
Hi Sergey, I've tried using both the sh script file and following the instructions at https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html - like you suggested. Both return the same results. I've uploaded my bayes-test-input folder to dropbox, the first file is here... http://dl.dropbox.com/u/7

Re: 20news

2011-07-04 Thread Sergey Bartunov
Stop, did you _train_ the classifier successfully before running the _test_? On 4 July 2011 13:30, Vijay Santhanam wrote: > Hi Sergey, > > I've tried using both the sh script file and following the instructions at > https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html - like you suggested. > Bo

Re: 20news

2011-07-04 Thread Vijay Santhanam
Hi Sergey, Yes, there were no errors. And all the model data seems to have been populated into bayes-model folder. Also, each main folder in bayes-model has a _SUCESS file. See the tarball of my trained model here, http://dl.dropbox.com/u/7881451/bayes-model.tar.gz Please compare it to your trai

Re: 20news

2011-07-04 Thread Sergey Bartunov
Well, that's strange. Sorry, I can't help you at the moment, maybe someone else in the mailing list could. On 4 July 2011 13:49, Vijay Santhanam wrote: > Hi Sergey, > > Yes, there were no errors. > > And all the model data seems to have been populated into bayes-model folder. > Also, each main fo

Re: 20news

2011-07-04 Thread Vijay Santhanam
Thanks anyway Sergey. Could you perhaps upload your bayes-model folder so I could try that out? On Mon, Jul 4, 2011 at 7:57 PM, Sergey Bartunov wrote: > Well, that's strange. Sorry, I can't help you at the moment, maybe > someone else in the mailing list could. > > On 4 July 2011 13:49, Vijay

Re: 20news

2011-07-04 Thread Vijay Santhanam
I tried deleting all the folders from the test and train data except for alt.atheism, but I get the identical error. I might try debugging the problem in eclipse rather than from commandline, but Eclipse doesn't quite want to work either. On Mon, Jul 4, 2011 at 8:02 PM, Vijay Santhanam wrote: >

Re: 20news

2011-07-04 Thread Robin Anil
Can you send me the console dump Command line + Log written by the program and put it on say pastebin Robin On Mon, Jul 4, 2011 at 3:48 PM, Vijay Santhanam wrote: > I tried deleting all the folders from the test and train data except for > alt.atheism, but I get the identical error. > > I might

Re: Exclude by RuleSet

2011-07-04 Thread Marko Ciric
Hi Em, If I understood well what you're asking, you could implement a new CandidateItemStrategy class. If you see that interface, there's this method getCandidateItems(long userID, DataModel dataModel) that has all parameters you need in order to filter out items that belong to the unwanted

Re: 20news

2011-07-04 Thread Vijay Santhanam
Hi Robin, The console dump was a too large for pastebin, so I uploaded it here -- http://dl.dropbox.com/u/7881451/build-20news-bayes-console-output.txt I performed a fresh checkout only hours ago, and I used script examples/bin/build-20news-bayes.sh I've opted to avoid hadoop, but from what I can

Re: Exclude by RuleSet

2011-07-04 Thread Em
Hi Marco, thank you for pointing me to this direction. Again I have to ask: What would be more efficient? Rescoring or CandidateItemStrategy? Where are the differences? Thanks! Am 04.07.2011 12:39, schrieb Marko Ciric: > > Hi Em, > > If I understood well what you're asking, you could impleme

Re: 20news

2011-07-04 Thread Vijay Santhanam
Hi, I got debugger running w/ eclipse so I could watch what was happening under the hood. Here's the exception again Exception in thread "main" java.lang.IllegalArgumentException: Label not found: alt.atheism from at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88) at or

Re: 20news

2011-07-04 Thread Vijay Santhanam
Hi, Okay, I replaced all the tab characters with space characters for each file in the bayes-test-input folder and now the classifier completes without error. Tomorrow I'll investigate why the trainer correctly parses the tab-separated label correctly, but the classifier does not. Actually, I kno

Re: Exclude by RuleSet

2011-07-04 Thread Marko Ciric
Rescoring is done after an item is processed which would be "too late". CandidateItemStrategy is the one that returns a set of all possible items that could be recommended, inside an ItemBasedRecommender so it is done before any rescoring (and even estimating) is processed. Therefore, implementing

Re: 20news

2011-07-04 Thread Robin Anil
Are you using some non-standard Java character encoding? On Mon, Jul 4, 2011 at 5:23 PM, Vijay Santhanam wrote: > Hi, > > Okay, I replaced all the tab characters with space characters for each file > in the bayes-test-input folder and now the classifier completes without > error. > > Tomorrow I'

Re: 20news

2011-07-04 Thread Vijay Santhanam
No sir. UTF-8 all the way. When doing non-sequential training and classification, what class is used for tokenization? I get the feeling different tokenizer classes are used for sequential and parallel training/classification. On Mon, Jul 4, 2011 at 10:23 PM, Robin Anil wrote: > Are you us

Re: 20news

2011-07-04 Thread Robin Anil
We are using the default lucene tokenizer. You can also pass in a tokenizer via the command line. On Mon, Jul 4, 2011 at 5:55 PM, Vijay Santhanam wrote: > No sir. > > UTF-8 all the way. > > When doing non-sequential training and classification, what class is used > for tokenization? > > I get t

Re: 20news

2011-07-04 Thread Vijay Santhanam
Sorry, I think I asked the wrong question. I'm asking about training and classification (i.e. post-preparation -- after steps 3/4 in https://cwiki.apache.org/MAHOUT/twenty-newsgroups.html) phases. In parallel mode, what class is used for extracting the "label" for a document? In sequential mode,

Re: 20news

2011-07-04 Thread Sean Owen
This could be my doing. I noticed that various bits of code split input files in different ways: StringTokenizer, Pattern, Splitter. And using different delimiters: space, space/tab, or the weird collection of delimiters from StringTokenizer. (BTW StringTokenizer is all but deprecated for this reas

Using naive bayes classification with continuous, categorical and word-like features

2011-07-04 Thread Vijay Santhanam
Hi, I'm new to Mahout and many of the machine learning ideas, but from what I understand of Naive Bayes classifier, it's possible to train a Naive Bayes model with continuous, categorical and word-like features from my understanding of the wikipedia entry http://en.wikipedia.org/wiki/Naive_Bayes_c

Re: 20news

2011-07-04 Thread Vijay Santhanam
Hi Sean, Thanks for responding. I would expect the sequential classifer tokenizer to be identical to what's used in the parallel classifier tokenizer. If that's not possible, then NGrams should perhaps be configurable with where it finds it's first token (i.e. the label). I'm very new to hadoop

MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Mark
I've read the source for FileDataModel and it suggested using a JDBC backed implementation for larger datasets so I decided to upgrade our recommendation system to use MySQLJDBCDataModel with MySQLJDBCInMemoryItemSimilarity. I've found that the JDBC backed versions performance is actually wors

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Sean Owen
Yes, this is trading memory for speed. If you can fit everything in memory, then you should. FileDataModel is in memory. MySQLJDBCDataModel is not in memory and queries the DB every time. This is pretty slow, though by caching item-item similarity as you do, a lot of the load is removed. However i

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Mark
Ahh ok. So if I want everything in memory like the file backed solution I should use ReloadFromJDBCDataModel? I'm going to give that a try right now. Typically which solution is recommended for production use? Thanks On 7/4/11 10:09 AM, Sean Owen wrote: Yes, this is trading memory for speed.

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Sean Owen
Yes. Both are just fine to use in production. For speed and avoiding abuse of the database, I'd load into memory and tell it to periodically reload. But that too is a bit of a choice between how often you want to consume new data and how much work you want to do to recompute new values. On Mon, Ju

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Mark
I wouldn't use the in memory JDBC solution. I was wondering do most people choose the JDBC backed solutions or the File backed? On 7/4/11 10:17 AM, Sean Owen wrote: Yes. Both are just fine to use in production. For speed and avoiding abuse of the database, I'd load into memory and tell it to

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Sebastian Schelter
A look into a recent blogpost of mine might maybe be helpful with choosing the appropriate data access strategies for your recommender setup. It covers a very common usecase in great detail: http://ssc.io/deploying-a-massively-scalable-recommender-system-with-apache-mahout/ --sebastian 2011/7/4

how do I choose appropriate OnlineLogisticRegression parameters for modelling this?

2011-07-04 Thread Vijay Santhanam
Hi, I'm trying to model the following training data that's targeting the gender and from what I've been reading in the archives of the mailing list, the OnlineLogisticRegression classifier is the easiest to get up and running.. sex height (feet) weight (lbs) foot size(inches) male 6 180 12 male

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Mark
May I ask why you choose to go with AllSimilarItemsCandidateItemsStrategy over the default PreferredItemsNeighborhoodCandidateItemsStrategy? On 7/4/11 10:23 AM, Sebastian Schelter wrote: A look into a recent blogpost of mine might maybe be helpful with choosing the appropriate data access stra

Re: MySQLJDBCDataModel vs FileDataModel

2011-07-04 Thread Sebastian Schelter
If the item similarities are already precomputed there's no sense in fetching them from the data model, you can just read use the already precomputed set of possibly similar items as no other items can be recommended anyway and it's faster to fetch them from a similarity implementation that holds t

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-04 Thread Ted Dunning
The mahout implementation of Naive_Bayes does not use continuous variables well. The best bet is to discretize these variables either individually or together using k-means. Then use the discrete version for the classifier. The random forest implementation and the SGD implementation are both hap

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-04 Thread Ted Dunning
The wikipedia page recommends binning if you have a large amount of data and a supervised variable extraction method if not. These are both ways of preprocessing to discretize continuous variables. On Mon, Jul 4, 2011 at 11:28 AM, Ted Dunning wrote: > The mahout implementation of Naive_Bayes do

Re: Using naive bayes classification with continuous, categorical and word-like features

2011-07-04 Thread Vijay Santhanam
Thank you Ted However, even with using the default OnlineLogisiticRegression I'm unable to get acceptable results when trying to replicate the gender-guesser discussed in the example of http://en.wikipedia.org/wiki/Naive_Bayes_classifier For that particular problem, do you recommend I take a binn

How could I use bayse model with my C++ online classifier

2011-07-04 Thread 刘逸哲
Hi all, I have trained a bayes model using mahout on my hadoop cluster, and I want to use this model with my c++ online application. So I will implement the classifier as mahout did, but I don’t know how to load the model using c++ as the model are sequence files. I want a bayes mod

回复: How could I use bayse model with my C++ online classifier

2011-07-04 Thread beneo_7
read the java source code and implemenet it in c++ 我也不明白为啥你要用阿里巴巴的邮箱 2011-07-05 beneo_7 发件人: 刘逸哲 发送时间: 2011-07-05 10:55 主 题: How could I use bayse model with my C++ online classifier 收件人: "user@mahout.apache.org" Hi all, I have trained a bayes model using mahout on my hadoop

Re: how do I choose appropriate OnlineLogisticRegression parameters for modelling this?

2011-07-04 Thread Xiaobo Gu
You should use AdaptiveLogisticRegression, and you can try the MAHOUT-696. Xiaobo Gu On Tue, Jul 5, 2011 at 1:30 AM, Vijay Santhanam wrote: > Hi, > > I'm trying to model the following training data that's targeting the gender > and from what I've been reading in the archives of the mailing list

Lanczos SVD scalability

2011-07-04 Thread agnonchik
What could be the reason of a poor Lanczos SVD scalability on cluster? I don't observe any speed-up at all increasing the number of nodes. What am I doing wrong? I'm processing a 1x1000 matrix with 1% non-zeros. The elapsed CPU time scales like this: 1 slave node - 89m39.399s 2 slave nodes - 9