This sounds like the best suggestion so far. On Apr 3, 2013, at 8:45 AM, Julien Nioche wrote:
> This is typically what Behemoth can be used for > https://github.com/DigitalPebble/behemoth. It has a Mahout module to > generate vectors at the same format as SparseVectorsFromSequenceFiles. > Assuming > that the document similarity job itself can run on the same input as the > clustering then you'd be able to use that in combination with the other > Behemoth modules e.g. import the documents, parse with Tika, tokenize, do > some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR > etc... > > Julien > * > * > * > * > > > > On 3 April 2013 16:28, Sebastian Schelter <ssc.o...@googlemail.com> wrote: > >> Thinking loud here: It would be great to have a DocumentSimilarityJob >> that is supplied a collection of documents and then applies necessary >> preprocessing (tokenization, vectorization, etc) and computes document >> similarities. >> >> Could be a nice starter task to add something like this. >> >> On 03.04.2013 17:09, Suneel Marthi wrote: >>> Akshay, >>> >>> If you are trying to determine document similarity using MapReduce, >> Mahout's RowSimiliarity may be useful here. >>> >>> Have a look at the following thread:- >>> >>> >> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results >>> >>> >>> I had tried this on a corpus of 2 million web sites and had good results. >>> >>> Let us know if this works for u. >>> >>> >>> >>> ________________________________ >>> From: akshay bhatt <akshay22bh...@gmail.com> >>> To: user@mahout.apache.org >>> Sent: Wednesday, April 3, 2013 5:36 AM >>> Subject: Integrating Mahout with existing nlp libraries >>> >>> I tried searching for it here and there, but could not find any good >> solution, >>> so though of asking nlp experts. I am developing an text similarity >> finding >>> application for which I need to match thousands and thousands of >> documents (of >>> around 1000 words each) with each other. For nlp part, my best bet is >> NLTK >>> (seeing its capabilities and algorithm friendlyness of python.But now >> when parts >>> of speech tagging in itself taking so much of time, I believe, nltk may >> not be >>> best suitable. Java or C won't hurt me, hence any solution will work for >> me. >>> Please note, I have already started migrating from mysql to hbase in >> order to >>> work with more freedom on such large number of data. But still question >> exists, >>> how to perform algos. Mahout may be a choice, but that too is for machine >>> learning, not dedicated for nlp (may be good for speech recognition). >> What else >>> are available options. In gist, I need high performance nlp, (a step >> down from >>> high performance machine learning). (I am inclined a bit towards Mahout, >> seeing >>> future usage). >>> >>> (already asked at - >>> >> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives >> ) >>> . >>> >> >> > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble