Incremental Clustering from Text Data

2014-01-16 Thread John White
Hello, I use seq2sparse with -wt tfidf option and execute the kmeans pipeline. If new data comes at a later date, should I decide which cluster it belongs using Listing 9.4 News clustering using canopy generation and k-means clustering in Mahout in Action, or is there a better/more generic (i.e.

Re: Classification of books

2014-01-16 Thread Ted Dunning
You generally want to do linguistic pre-processing (finding phrases, synonymizing certain forms such as abbreviations, tokenizing, dropping stop words, removing boilerplate, removing tables) before doing vectorization. Altogether, these form pre-processing. To classify books, you need to

Re: Classification of books

2014-01-16 Thread Saeed Iqbal KhattaK
Dear Suresh, I am also working in Classification of books. First of all I collect a meta-data of my e-books, after collecting a meta-data than I start my second level to pre-process an e-book. In pre-processing, I collect information regarding *books title, chapter titles sections, subsection

Re: Classification of books

2014-01-16 Thread Suresh M
Hi, Thanks for your reply. I have got the table of contents, meta-data, title, author, etc for the books. Can you please tell me the next steps to proceed. I have read in Mahout In Action book that there are few tools available for vectorization Ex: Lucene analyzers, Mahout vector encoders Can

RE: travelling salesman on Mahout

2014-01-16 Thread simon.2.thompson
Hi all - there is a project at MIT called FlexGP that has done more work on this. http://groups.csail.mit.edu/EVO-DesignOpt/groupWebSite/index.php?n=Site.FlexGP Unfortunately I can't find a download for the code so I suppose that it's not opensource, however you might like to contact these

problem with recommendation algorithm

2014-01-16 Thread Giuseppe
Hi guys, I'm new with mahout. I'm using it for an experimentation with recommender system. I'm using this code: import org.apache.mahout.cf.taste.impl.neighborhood.*; import org.apache.mahout.cf.taste.impl.recommender.*; import org.apache.mahout.cf.taste.impl.similarity.*; import

Re: problem with recommendation algorithm

2014-01-16 Thread Sebastian Schelter
Does the csv file that you load contain user with id 1 ? On 01/16/2014 12:02 PM, Giuseppe wrote: Hi guys, I'm new with mahout. I'm using it for an experimentation with recommender system. I'm using this code: import org.apache.mahout.cf.taste.impl.neighborhood.*; import

Mahout 0.9 Release - Call for Volunteers

2014-01-16 Thread Suneel Marthi
Here's the new URL for Mahout 0.9 Release: https://repository.apache.org/content/repositories/orgapachemahout-1001/org/apache/mahout/mahout-buildtools/0.9/ For those volunteering to test this, some of the things to be verified: a) Verify that u can unpack the release (tar or zip) b) Verify u r

Re: Mahout 0.9 Release Candidate - VOTE

2014-01-16 Thread Suneel Marthi
Please hold off on this, screwed up the future development version#. Have to redo this again. Sorry about that. On Thursday, January 16, 2014 8:47 AM, spa...@gmail.com spa...@gmail.com wrote: Sorry, sent little too early :). Got email from Suneel. On Thu, Jan 16, 2014 at 7:16 PM,

Re: Mahout 0.9 Release - Call for Volunteers

2014-01-16 Thread Chameera Wijebandara
Hi Suneel, Still it getting 404 error. Thanks, Chameera On Thu, Jan 16, 2014 at 7:11 PM, Suneel Marthi suneel_mar...@yahoo.comwrote: Here's the new URL for Mahout 0.9 Release:

Re: Mahout 0.9 Release - Call for Volunteers

2014-01-16 Thread Koji Sekiguchi
Is it: https://repository.apache.org/content/repositories/orgapachemahout-1002/org/apache/mahout/mahout-buildtools/0.9/ koji -- http://soleami.com/blog/mahout-and-machine-learning-training-course-is-here.html (14/01/16 23:23), Chameera Wijebandara wrote: Hi Suneel, Still it getting 404

Re: Mahout 0.9 Release - Call for Volunteers

2014-01-16 Thread Suneel Marthi
https://repository.apache.org/content/repositories/orgapachemahout-1002/org/apache/mahout/mahout-distribution/0.9/ On Thursday, January 16, 2014 9:43 AM, Koji Sekiguchi k...@r.email.ne.jp wrote: Is it:

Re: Classification of books

2014-01-16 Thread Ken Krugler
On Jan 16, 2014, at 1:58am, Suresh M suresh4mas...@gmail.com wrote: Hi, Thanks for your reply. I have got the table of contents, meta-data, title, author, etc for the books. Can you please tell me the next steps to proceed. I have read in Mahout In Action book that there are few tools

Re: Incremental Clustering from Text Data

2014-01-16 Thread John White
Hi, Clarifying my question a little bit: How can I create a vector from a single text document to conform the schema of the collection of vectors that I created using seq2sparse before? I want to use it to find the closest centroid to a text document that is submitted by a client Best

Re: Mahout 0.9 Release - Call for Volunteers

2014-01-16 Thread Suneel Marthi
This is not a maven issue. Andrew, r u on Mac OS 10.8?  If so u would be seeing these errors. These errors being spewed by Carrot RandomizedRunner and per the conversation in Mahout-1345 this happens on Mac OS X due to an issue in Lucene 4.3.1 and below that was fixed in later Lucene releases.

Re: Mahout 0.9 Release - Call for Volunteers

2014-01-16 Thread Sékine Coulibaly
Suneel, Mahout build is ok. However at least 3 Integration test cases fail as follow : Failed tests: ARFFVectorIterableTest.testNumerics:237-Assert.assertEquals:592-Assert.assertEquals:494-Assert.failNotEquals:743-Assert.fail:88 expected:1.0 but was:NaN

mahout text mining

2014-01-16 Thread qiaoresearcher
Mahout has an example of using naive bayes to classify 20 news group. but how to just classify paragraphs (e.g. twitter message, movie review) in text files such as: Text files has content like: -- text paragraph 1 class

Re: mahout text mining

2014-01-16 Thread Suneel Marthi
See http://chimpler.wordpress.com/2013/03/13/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages/ for classifying twitter messages. Lucene has support for ngrams, stopwords, porter stemmer, snowball stemmer, language specific analyzers etc... Mahout uses Lucene

Re: mahout text mining

2014-01-16 Thread qiaoresearcher
Suneel, thanks a lot. I assume the example you mentioned was generating a numerical vector for each paragraph, is it right? now, to further improve the performance, I may add other features from other data set into this vector and make it much longer, then use the enriched vector for naive