Re: Mahout 0.6 Naive Bayes Accuracy

2012-03-29 Thread Dimitri Goldin
Hi Isabel, First of all, thanks for your reply. On 03/28/2012 09:10 AM, Isabel Drost wrote: On 27.03.2012 Dimitri Goldin wrote: Having tried Mallets naive bayes implementation we achieved ~95% accuracy without having to balance the training-data. Does anybody know which implementation detail

Re: CityBlockSimilarity details

2012-03-29 Thread Sean Owen
Nope it's the sum of the absolute values of differences in ratings, for your purposes. On Thu, Mar 29, 2012 at 7:29 PM, ziad kamel ziad.kame...@gmail.com wrote: City block distance or Manhattan distance Wikipedia define it for points as http://en.wikipedia.org/wiki/Taxicab_geometry So how

Re: CityBlockSimilarity details

2012-03-29 Thread ziad kamel
Dear Owen, I tried to look for any other representations like what you used but didn't find. Can you direct me to any if you are aware of. Why you used the distance below and not just the absolute difference between ratings ? Many thanks ! On Thu, Mar 29, 2012 at 1:30 PM, Sean Owen

Re: CityBlockSimilarity details

2012-03-29 Thread ziad kamel
What make me wonder is that CityBlockSimilarity gave a much higher precision compared with EuclideanDistanceSimilarity PearsonCorrelationSimilarity and others , so it this something usual ? do we have a reason behind ? On Thu, Mar 29, 2012 at 1:49 PM, ziad kamel ziad.kame...@gmail.com wrote:

Getting InMemBuilder to use more mappers

2012-03-29 Thread Jason L Shaw
I have a dataset that is not terribly large (~31 MB on disk in plaintext, ~145,000 records with 26 fields). I am trying to build random forests over the data, but the process is quite slow. It takes about half an hour to build 100 trees using the partial implementation. (I didn't realize I

Re: CityBlockSimilarity details

2012-03-29 Thread Sean Owen
Like I think we've said, it depends on your data. I expect that some similarity metrics will work better than others. Why is hard to say without knowing anything about your data. I don't understand your previous question about representation. I just gave you the definition of city-block distance.

Re: Getting InMemBuilder to use more mappers

2012-03-29 Thread Sean Owen
Hadoop is what chooses the number of mappers, and it bases it on input size. Generally it will not assign less than one worker per chunk and a chunk is usually 64MB (still, I believe). You can override this directly (well, at least, register a suggestion to Hadoop). I would tell you the exact flag

Re: CityBlockSimilarity details

2012-03-29 Thread ziad kamel
I think that it is NOT using preferences values. Also in the algorithm it mentions that it is using the NUMBER and not values * @param pref1number of non-zero values in left vector * @param pref2number of non-zero values in right vector * @param intersection number of

Re: CityBlockSimilarity details

2012-03-29 Thread Sean Owen
What top items? I am not sure what you're referring to here, but, no I do not expect things to be identical when changing metrics in general. I've already answered your other question. On Thu, Mar 29, 2012 at 10:52 PM, ziad kamel ziad.kame...@gmail.com wrote: OK, things become more clear .

Can I create/make preferences ?

2012-03-29 Thread ziad kamel
Hi , I want to recommend movies based of user preferences and movie type ( comedy , etc ). I have a data of users watching movies during years. I don't have a direct preferences but was wondering if I can create some with the years and movies type . Any suggestions? data format user - movie -

Re: CityBlockSimilarity details

2012-03-29 Thread Ted Dunning
It is very common that preferences or ratings DECREASE recommendation performance. The basic reason is that there is little or no real signal in the ratings after you account for the fact that the rating exists at all. In practice, there is the additional reason that if you don't need a rating,

Re: Getting InMemBuilder to use more mappers

2012-03-29 Thread Jason L Shaw
Suggestion, indeed. I passed that option, but still only 2 mappers were created. On Thu, Mar 29, 2012 at 5:23 PM, Sean Owen sro...@gmail.com wrote: Hadoop is what chooses the number of mappers, and it bases it on input size. Generally it will not assign less than one worker per chunk and a

Re: Getting InMemBuilder to use more mappers

2012-03-29 Thread Sean Owen
(If you're using a modern version of Hadoop, the flag is something different, so make sure you check what the real value is.) There's another option concerning minimum split size that you could reduce from its default too. On Thu, Mar 29, 2012 at 11:05 PM, Jason L Shaw jls...@uw.edu wrote:

Re: Getting InMemBuilder to use more mappers

2012-03-29 Thread Ted Dunning
Split your training data into lots of little files. Depending on the wind, that may cause more mappers to be invoked. On Thu, Mar 29, 2012 at 3:05 PM, Jason L Shaw jls...@uw.edu wrote: Suggestion, indeed. I passed that option, but still only 2 mappers were created. On Thu, Mar 29, 2012 at

Re: CityBlockSimilarity details

2012-03-29 Thread ziad kamel
I never though a ratings can decrease the recommendations. Does this thing have a name like under-fitting or so in recommender systems ? On Thu, Mar 29, 2012 at 5:04 PM, Ted Dunning ted.dunn...@gmail.com wrote: It is very common that preferences or ratings DECREASE recommendation performance.

Re: CityBlockSimilarity details

2012-03-29 Thread Ted Dunning
No. It is more related to the fact that ratings are just very strange things. On Thu, Mar 29, 2012 at 3:35 PM, ziad kamel ziad.kame...@gmail.com wrote: I never though a ratings can decrease the recommendations. Does this thing have a name like under-fitting or so in recommender systems ?

content of forest.seq

2012-03-29 Thread Xiaomeng Wan
Hi, I am trying the partial decision forest example, and wondering whether there is any way to check the built trees stored in forest.seq file? I cannot find any function in DecisionForest can do that. Thanks! Regards, Shawn

Re: Getting InMemBuilder to use more mappers

2012-03-29 Thread deneche abdelhakim
-Dmapred.map.tasks=N only gives a suggestion to Hadoop, and in most cases (especially when the data is small) Hadoop doesn't take it into consideration. To generate more mappers use -Dmapred.max.split.size=S, S being the size of each data partition in bytes. So your data ~ 3100B, if you want

Re: content of forest.seq

2012-03-29 Thread deneche abdelhakim
You can use DecisionForest.load(Configuration conf, Path path) (org.apache.mahout.classifier.df package). You can just pass the output path that contains the trees and this function will load them all. On Fri, Mar 30, 2012 at 3:41 AM, Xiaomeng Wan shawn...@gmail.com wrote: Hi, I am trying the