Using Rowsimilarity

2016-04-14 Thread David Starina
showed you here and I should check some other place, like my input data? Thanks in advance for any help! --David Starina

Re: Removing MAHOUT_LOCAL option

2016-03-21 Thread David Starina
Anyhow, I'm +1 for removing MAHOUT_LOCAL, but I believe the deprecated MapReduce-based code still makes sense if it is running well on Ignite. On Mon, Mar 21, 2016 at 8:20 AM, David Starina <david.star...@gmail.com> wrote: > Has anyone tried to run the deprecated MapReduce code

Re: Removing MAHOUT_LOCAL option

2016-03-21 Thread David Starina
Has anyone tried to run the deprecated MapReduce code on Ignite? Is the performance improvement good enough to reconsider leaving those algorithms in Mahout? On Mon, Mar 21, 2016 at 12:45 AM, Andrew Musselman < andrew.mussel...@gmail.com> wrote: > Yes I agree; will leave the question open a

LDA - parameters "maxIter" and "max_doc_topic_iters"

2016-03-20 Thread David Starina
What are the best values for those two parameters? I usually only read suggestions on how to set the number of iterations (=maxIter). Some suggest it is best to set it as high as 1000 iterations. However - how about number of iterations for document? How is this one really used and what would be

Re: Document similarity

2016-03-11 Thread David Starina
you want to know cluster > inclusion or get a list of similar docs? > > On Feb 23, 2016, at 1:01 PM, David Starina <david.star...@gmail.com> > wrote: > > Guys, one more question ... Are there some incremental methods to do this? > I don't want to run the whole job

Re: LDA - help me understand

2016-03-10 Thread David Starina
About the last question: it probably has something to do with setting the max iterations and max iterations per document to the same value ... What is the "number of iterations per document" really doing? --David On Thu, Mar 10, 2016 at 5:39 PM, David Starina <david.star...@gma

Re: LDA - help me understand

2016-03-10 Thread David Starina
. Is there something I don't understand about this algorithm? Why would one iteration take that much longer just because you run more of iterations? --David On Thu, Mar 10, 2016 at 2:24 PM, David Starina <david.star...@gmail.com> wrote: > How does memory requirement grow with the number of topics?

Re: LDA - help me understand

2016-03-10 Thread David Starina
How does memory requirement grow with the number of topics? A little experimentation shows me that number of documents doesn't matter as much as the number of topics ... Does the memory requirement grow exponentially with the number of topics? --David On Thu, Mar 10, 2016 at 11:43 AM, David

LDA - help me understand

2016-03-10 Thread David Starina
Hi, I realize MapReduce algorithms are not the "hot new stuff" anymore, but I am playing around with LDA. I have some problems with the memory, can you help me suggest how to set up parameters to make this work? I am running on a virtual cluster on my laptop - two nodes with 3 GB of memory each

Re: Document similarity

2016-02-23 Thread David Starina
they work well. > > The query to the KNN engine is a document, each field mapped to the > corresponding field of the index. The result is the k nearest neighbors to > the query doc. > > > > On Feb 14, 2016, at 11:05 AM, David Starina <david.star...@gmail.com>

Re: Document similarity

2016-02-14 Thread David Starina
ll lead you to a good similarity or distance measure. > > As I recall, Spark does provide an LDA implementation. Gensim provides a > > API for doing LDA similarity out of the box. Vowpal Wabbit is also worth > > looking at, particularly for a large dataset. > > Hope th

Document similarity

2016-02-14 Thread David Starina
Hi, I need to build a system to determine N (i.e. 10) most similar documents to a given document. I have some (theoretical) knowledge of Mahout algorithms, but not enough to build the system. Can you give me some suggestions? At first I was researching Latent Semantic Analysis for the task, but

Running Mahout on Hadoop cluster

2016-02-08 Thread David Starina
Hi, I am not sure why I can not find the info I am looking for online, probably not searching in the right way, so I am hoping you guys will be able to point me in the right direction. I have set up a Mahout project in IntelliJ IDEA on my machine. I created a class extending AbstractJob to run

Re: Mahout - problem importing to Eclipse

2016-02-08 Thread David Starina
can just leave it at that. Best regards, David On Mon, Feb 1, 2016 at 11:52 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > the user list will not let attachments thru. > > On Sun, Jan 31, 2016 at 11:59 PM, David Starina <david.star...@gmail.com> > wrote: > >

Mahout - problem importing to Eclipse

2016-02-01 Thread David Starina
Hi, I have problem importing the project to Eclipse - I get the error "Could not update project mahout-mr configuration". Attaching the error as image. Anyone seen this problem before? I am using Eclipse 4.5.1 (Mars.1) of Fedora 22. I did a Maven build successfully, installed m2eclipse and

Re: word2vec in mahout.

2015-05-13 Thread David Starina
You can also check out the implementation in MLlib: https://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec On Wed, May 13, 2015 at 9:11 PM, Dan Dong dongda...@gmail.com wrote: Thanks Andrew, I will turn to DL4J. Cheers, Dan 2015-05-13 10:34 GMT-05:00 Andrew

Re: Latent Semantic Analysis for Document Categorization

2015-03-26 Thread David Starina
Hi, as Chirag said, try LDA. You can also check an implementation of pLSA, but it is not part of Mahout, you can find it here: https://github.com/akopich/dplsa --David On Thu, Mar 26, 2015 at 2:01 PM, 3316 Chirag Nagpal chiragnagpal_12...@aitpune.edu.in wrote: A better approach I can think