Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
Johannes, Your summary is good. I would add that the precalculated recommendations can be large enough that the lookup becomes more expensive. Your point about staleness is very on-point. On Mon, May 20, 2013 at 10:15 PM, Johannes Schulte johannes.schu...@gmail.com wrote: I think Pat is

Re: convert input for SVD

2013-05-21 Thread Ted Dunning
Are you using Lanczos instead of SSVD for a reason? On Mon, May 20, 2013 at 4:13 AM, Rajesh Nikam rajeshni...@gmail.com wrote: Hello, I have arff / csv file containing input data that I want to pass to svd : Lanczos Singular Value Decomposition. Which tool to use to convert it to

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Thanks! Could you also add how to learn the weights you talked about, or at least a hint? Learning weights for search engine query terms always sounds like learning to rank to me but this always seemed pretty complicated and i never managed to try it out.. On Tue, May 21, 2013 at 8:01 AM, Ted

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner
Dear all, May i ask please about usage a bit here? Previously I had: import com.carrotsearch.hppc.IntSet; import com.carrotsearch.hppc.IntOpenHashSet; IntSet columnValues = new IntOpenHashSet(); for loop (...) { if (columnValues.contains(x)) continue; ... columnValues.add(x); } It

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner
Dear all, After a while of debugging I understand that 3 elements were added fine, but when adding the fourth one it does not crash but says com.sun.jdi.InvocationException occurred invoking method. So a table (hashtable?) of fixed size is created. Why that's happening? On 21 May 2013 10:18,

Re: mahout colt collections

2013-05-21 Thread Dan Filimon
Hi Sophie, What you're describing is odd and while the hash set is allocated with a fixed small size initially, it's resized as you add more elements. Can you please post the full stack trace of the exception? On Tue, May 21, 2013 at 12:35 PM, Sophie Sperner sophie.sper...@gmail.comwrote:

Feature vector generation from Bag-of-Words

2013-05-21 Thread Stuti Awasthi
Hi all, I have a query regarding the Feature Vector generation for Text documents. I have read Mahout in Action and understood how to create the text document in feature vector weighed by Tf of Tfidf schemes. My usecase is a little tweaked with that. I have few keywords may be say 100 and I

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner
Dear Dan, all, I do not have skills to get the stack trace. The code hangs one, Eclipse does not print me its stack trace because it does not terminate the program. So I decided to make a small test.java file that you can easily run. This code has the main function that simply runs getItemList()

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner
Link to hhpc jar file - http://labs.carrotsearch.com/hppc-download.htmlthen press Download button on the right. On 21 May 2013 13:23, Sophie Sperner sophie.sper...@gmail.com wrote: Dear Dan, all, I do not have skills to get the stack trace. The code hangs one, Eclipse does not print me its

Re: convert input for SVD

2013-05-21 Thread Rajesh Nikam
Hello Ted, Thanks for reply. I have started exploring SVD based on its mention of could help to drop features which are not relevant for clustering. My objective is reduce number of features before passing them to clustering and just keep important features. arff/csv== ssvd (for dimensionality

Re: mahout colt collections

2013-05-21 Thread Robin Anil
I think you forgot to attach the test file On May 21, 2013 7:30 AM, Sophie Sperner sophie.sper...@gmail.com wrote: Link to hhpc jar file - http://labs.carrotsearch.com/hppc-download.htmlthen press Download button on the right. On 21 May 2013 13:23, Sophie Sperner sophie.sper...@gmail.com

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner
Alright, below is my message. In the next mail I will attach my files. Dear Dan, all, I do not have skills to get the stack trace. The code hangs one, Eclipse does not print me its stack trace because it does not terminate the program. So I decided to make a small test.java file that you can

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner

Re: mahout colt collections

2013-05-21 Thread Sophie Sperner
I fine with using partially hppc libs partially mahout. At the moment converted my code. Very similar API. But you may be interested in running test.java quite simple example in order to find out the possible bug. Best of luck to you. On 21 May 2013 15:24, Sophie Sperner

Re: convert input for SVD

2013-05-21 Thread Dmitriy Lyubimov
Sounds like dimensionality reduction to me. You may want to use ssvd -pca Apologies for brevity. Sent from my Android phone. -Dmitriy On May 21, 2013 6:27 AM, Rajesh Nikam rajeshni...@gmail.com wrote: Hello Ted, Thanks for reply. I have started exploring SVD based on its mention of could

Re: Which database should I use with Mahout

2013-05-21 Thread Pat Ferrel
In the interest of getting some empirical data out about various architectures: On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel pat.fer...@gmail.com wrote: ... You use the user history vector as a query? The most recent suffix of the history vector. How much is used varies by the purpose. We

Re: Feature vector generation from Bag-of-Words

2013-05-21 Thread Suneel Marthi
Stuti, Here's how I would do it. 1.  Create a collection of the 100 keywords that r of interest. CollectionString keywords = new ArrayListString(); keywords.addAll(your 100 keywords); 2.  For each word in each of the text documents create a Multiset (which is a bag of words) ,

Re: mahout colt collections

2013-05-21 Thread Dan Filimon
Sophie, you still haven't attached your test.java file. :) On Tue, May 21, 2013 at 6:03 PM, Sophie Sperner sophie.sper...@gmail.comwrote: I fine with using partially hppc libs partially mahout. At the moment converted my code. Very similar API. But you may be interested in running test.java

Re: convert input for SVD

2013-05-21 Thread Suneel Marthi
Filling in for Dmitriy's brief reply mahout ssvd  -i input -o output -pca true -us true -U false -V false -kno of columns From: Dmitriy Lyubimov dlie...@gmail.com To: user@mahout.apache.org Sent: Tuesday, May 21, 2013 11:48 AM Subject: Re: convert input for

Re: Feature vector generation from Bag-of-Words

2013-05-21 Thread Suneel Marthi
It should be easy to convert the below pseudocode to MapReduce to scale for large collection of documents. From: Suneel Marthi suneel_mar...@yahoo.com To: user@mahout.apache.org user@mahout.apache.org Sent: Tuesday, May 21, 2013 12:20 PM Subject: Re: Feature

Re: mahout colt collections

2013-05-21 Thread Ted Dunning
Dan, I think that she did do the attachment and it got filtered away. Sophie, One easy thing to do is to file a JIRA report using https://issues.apache.org/jira/browse/MAHOUT Then you can attach your program to that bug report. Alternatively, you can attach the program to some other service.

Re: convert input for SVD

2013-05-21 Thread Dmitriy Lyubimov
ps as far as U, V data close to zero, yes that's what you'd expect. Here, by close to zero it still means much bigger than a rounding error of course. e.g. 1E-12 is indeed a small number, and 1E-16 to 1E-18 would be indeed close to zero for the purposes of singularity. 1E-2..1E-5 are actually

Re: convert input for SVD

2013-05-21 Thread Dmitriy Lyubimov
PPS As far as the tool for arff, i am frankly not sure. but it sounds like you've already solved this. On Tue, May 21, 2013 at 1:41 PM, Dmitriy Lyubimov dlie...@gmail.com wrote: ps as far as U, V data close to zero, yes that's what you'd expect. Here, by close to zero it still means much

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
I have so far just used the weights that Solr applies natively. In my experience, what makes a recommendation engine work better is, in order of importance, a) dithering so that you gather wider data b) using multiple sources of input c) returning results quickly and reliably d) the actual

Re: Which database should I use with Mahout

2013-05-21 Thread Ted Dunning
Inline On Tue, May 21, 2013 at 8:59 AM, Pat Ferrel p...@occamsmachete.com wrote: In the interest of getting some empirical data out about various architectures: On Mon, May 20, 2013 at 9:46 AM, Pat Ferrel pat.fer...@gmail.com wrote: ... You use the user history vector as a query?

Re: Interpreting Cluster Dump Metrics

2013-05-21 Thread Ted Dunning
On Tue, May 21, 2013 at 8:47 PM, Pat Ferrel pat.fer...@gmail.com wrote: For this sample it looks like about 20-40 clusters is best? Looking at the results for k=40 by eyeball they do seem pretty good. It is really hard to tell with these numbers. IN spite of their heritage, these scaled

Re: Which database should I use with Mahout

2013-05-21 Thread Johannes Schulte
Thanks for the list...as a non native speaker I got problems understanding the meaning of dithering here. I got the feeling that somewhere between a) and d) there is also diversification of items in the recommendation list, so increasing the distance between the list items according to some