Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-12 Thread Matt Molek
ecommender returning 0 recommendations. On Mon, Aug 12, 2013 at 9:47 AM, Matt Molek wrote: > I'm using a custom PlusAnonUser recommender which is just a > GenericBooleanPrefUserBasedRecommender with a PlusAnonymousUser DataModel > wrapped around a GenericBoo

Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-12 Thread Matt Molek
2013 at 4:13 PM, Ted Dunning wrote: > On Fri, Aug 9, 2013 at 12:30 PM, Matt Molek wrote: > > > From some local IR precision/recall testing, I've found that user based > > recommenders do better on my data, so I'd like to stick with user based > if > > I ca

Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-09 Thread Matt Molek
input to the mostSimilarItems() > method. That should give you the same results. > > > On 09.08.2013 03:33, Matt Molek wrote: > > Thanks, Sebastian. > > > > To get around this problem, I was just reading about the > > PlusAnonymousUserDataModel. Would that be

Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-08 Thread Matt Molek
save new interactions in a database and > load them into memory from time to time. > > 2013/8/8 Matt Molek > > > Ok, having implemented a recommender that tried to call > setPreference(...) > > on a GenericBooleanPrefUserBasedRec > > o

Re: Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-08 Thread Matt Molek
d new user-item associations to the model though. Is this just no possible? That seems weird. I thought all of the in-memory models supported having new data added on the fly. Am I missing something? Thanks for the help, Matt On Thu, Aug 8, 2013 at 12:31 PM, Matt Molek wrote: >

Setting new preferences on GenericBooleanPrefUserBasedRecommender

2013-08-08 Thread Matt Molek
I'm using a GenericBooleanPrefUserBasedRecommender with a GenericBooleanPrefDataModel. When I load the historical user/item associations from a file, they're just in the format of userid, itemid, and as I understand it, the GenericBooleanPrefDataModel does not store any 'rating' data. I'd like to

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
Oops, I'm sorry. I had one too many zeros there, should be '-Dmapred.max.split.size=10' Just (input size)/(desired number of mappers)

Re: Modify number of mappers for a mahout process?

2013-08-01 Thread Matt Molek
One trick to getting more mappers on a job when running from the command line is to pass a '-Dmapred.max.split.size=' argument. The is a size in bytes. So if you have some hypothetical 10MB input set, but you want to force ~100 mappers, use '-Dmapred.max.split.size=100' On Wed, Jul 3

Does seq2sparse drop empty documents?

2013-04-22 Thread Matt Molek
I'm losing a some documents when running seq2sparse. I think it's because the documents are composed of common terms, and end up having no terms at all once common words are pruned. I couldn't find documentation that this is what's supposed to be happening though, so I wanted to ask if this is expe

Re: Does it make sense to use Mahout for text classification when I have a huge number of documents but a small number of labels?

2013-04-19 Thread Matt Molek
Instead of manually splitting your files, you should be able pass -Dmapred.min.split.size= at the command line, or otherwise set the mapred.min.split.size property to get the number of mappers you want. On Wed, Apr 17, 2013 at 7:55 PM, Ryan Compton wrote: > Got it, thanks. > > For some reason I h

Re: Error while compiling Ted dunning algo - knn-master

2013-03-14 Thread Matt Molek
Hi Vivek, Not sure if you're still trying to resolve this, but I got it working. git clone http://git-wip-us.apache.org/repos/asf/mrunit.git failed for me as well with error 405, "fatal: The remote end hung up unexpectedly" Just on a whim, I tried https://... instead and that worked. On Fri,

Re: KMeans Results: Finding Cluster Members

2013-03-07 Thread Matt Molek
oints file I > see the following output: > > 1.0: [3887:3.000, 9441:1.000] is in 1205002 > 1.0: [6773:1.000] is in 1205002 > 1.0: [8987:2.000] is in 1205002 > 1.0: [2956:1.000] is in 1205002 > > > Thanks again, > Colum > > On Tue, Mar 5, 2013 at 8:57 PM, Matt Mo

Re: KMeans Results: Finding Cluster Members

2013-03-05 Thread Matt Molek
If you run kmeans with the "-cl" option (or set the runClustering option to true if you're calling the driver from Java code), you'll get a sequence file in the directory $KMEANS_OUT/clusteredPoints with an IntWritiable key identifying the cluster, and a WeightedVectorWritable with a pdf weight (al

Re: What do "normal" pdf values look like for points clustered with kmeans?

2013-03-01 Thread Matt Molek
n pdf value of 0.0200. Do these pdf values say anything about the fit or quality of my cluster results? On Fri, Mar 1, 2013 at 2:56 AM, Ted Dunning wrote: > How high is the dimension? > > How is your data generated? > > > > On Wed, Feb 27, 2013 at 1:38 PM, Matt Molek w

Re: kmeans clustering - how to leave some docs unclustered

2013-02-27 Thread Matt Molek
articular document was clustered or not. > > On 27 Feb 2013, at 03:01, Matt Molek wrote: > > > I think you have the right idea about the clusterClassificationThreshold, > > but something just isn't working right in your case. > > > > I know this answer won't

Re: kmeans clustering - how to leave some docs unclustered

2013-02-26 Thread Matt Molek
I think you have the right idea about the clusterClassificationThreshold, but something just isn't working right in your case. I know this answer won't be particularly helpful since I don't have any suggestions to fix your problem, but I did a test recently where I tried clusterClassificationThres

Re: Run multiple kmeans jobs at once from the same bash script as part of top down clustering

2012-11-20 Thread Matt Molek
te. That's the sort of thing I want to be able to do with KMeans on multiple separate datasets. On Tue, Nov 20, 2012 at 11:58 AM, Matt Molek wrote: > I've given up on the CLI and I'm trying to do this in java now, but it > looks like I can't launch multiple KMeans driv

Re: Run multiple kmeans jobs at once from the same bash script as part of top down clustering

2012-11-20 Thread Matt Molek
ot too familiar with concurrency in java). I'd really like to be able to launch multiple clustering runs at the same time since launching them one at a time and waiting for each to finish is killing my overall performance. On Thu, Nov 8, 2012 at 1:48 PM, Matt Molek wrote: > When do

Re: Increase timeout for running PFPGrowth

2012-10-22 Thread Matt Molek
Isn't this the same question you asked earlier today? I responded to the initial one that "-D mapred.task.timeout=1800" shouldn't have a space after the D. It should be "-Dmapred.task.timeout=1800" And IIRC, these Hadoop parameters need to go before all of your other parameters. On Mon,

Re: clusterpp is only writing directories for about half of my clusters.

2012-10-22 Thread Matt Molek
ses: 1) do LSA ( in terms of SSVD, it means --pca >>>> false and take U output for document topic space), or 2) perhaps do >>>> sphere projection first and then do dimensionality reduction with >>>> --pca true. the latter will at least preserve cosine distances as f

Re: Increase timeout for running PFPGrowth

2012-10-22 Thread Matt Molek
Did you have those spaces "-D mapred.task.timeout=1800"? That won't be parsed correctly. It should be: "-Dmapred.task.timeout=1800" On Mon, Oct 22, 2012 at 1:08 PM, Amit Krishna Joshi wrote: > Hi, > > I am running PFP on several datasets and it works well for smaller ones (< > 5GB) > Howe

Re: clusterpp is only writing directories for about half of my clusters.

2012-10-22 Thread Matt Molek
I've done some more testing and submitted a JIRA: https://issues.apache.org/jira/browse/MAHOUT-1103 On Sat, Oct 20, 2012 at 9:01 PM, Matt Molek wrote: > Thanks for the quick response! > > I will do some testing tomorrow with various numbers of clusters and > create a JIRA

Re: clusterpp is only writing directories for about half of my clusters.

2012-10-20 Thread Matt Molek
e/MAHOUT. > And if you are interested, this would be a good starting point to > contribute to Mahout also. > > On Sun, Oct 21, 2012 at 1:14 AM, Matt Molek wrote: > >> First off, thank you everyone for your help so far. This mailing list >> has been a great help getting m

How to use ssvd for dimensionality reduction of tfidf-vectors?

2012-10-19 Thread Matt Molek
Sorry for the basic question. I've been reading about this for a few hours, but I'm still confused. I want to use ssvd to reduce the dimensionality of some tfidf-vectors so I can perform clustering on the result. Among many other things, I've read: https://cwiki.apache.org/MAHOUT/dimensional-reduc

seq2sparse seems to be ignoring the value of my “-x” parameter

2012-09-25 Thread Matt Molek
I'm using mahout 0.7 on a pseudo-distributed hadoop installation for testing purposes. A lot of what I'm doing is being guided by Mahout in Action, which I know deals with 0.5, but as far as I can tell, nothing major has changed with seq2sparse. I'm having a problem with the tfidf vectors generat