This sounds like the best suggestion so far.
        
On Apr 3, 2013, at 8:45 AM, Julien Nioche wrote:

> This is typically what Behemoth can be used for
> https://github.com/DigitalPebble/behemoth. It has a Mahout module to
> generate vectors at the same format as SparseVectorsFromSequenceFiles. 
> Assuming
> that the document similarity job itself can run on the same input as the
> clustering then you'd be able to use that in combination with the other
> Behemoth modules e.g. import the documents, parse with Tika, tokenize, do
> some NLP with GATE or UIMA, find the similarities with Mahout, send to SOLR
> etc...
> 
> Julien
> *
> *
> *
> *
> 
> 
> 
> On 3 April 2013 16:28, Sebastian Schelter <ssc.o...@googlemail.com> wrote:
> 
>> Thinking loud here: It would be great to have a DocumentSimilarityJob
>> that is supplied a collection of documents and then applies necessary
>> preprocessing (tokenization, vectorization, etc) and computes document
>> similarities.
>> 
>> Could be a nice starter task to add something like this.
>> 
>> On 03.04.2013 17:09, Suneel Marthi wrote:
>>> Akshay,
>>> 
>>> If you are trying to determine document similarity using MapReduce,
>> Mahout's RowSimiliarity may be useful here.
>>> 
>>> Have a look at the following thread:-
>>> 
>>> 
>> http://markmail.org/message/ddkd3qbuub3ak6gl#query:+page:1+mid:x5or2x4rsv2kl4wv+state:results
>>> 
>>> 
>>> I had tried this on a corpus of 2 million web sites and had good results.
>>> 
>>> Let us know if this works for u.
>>> 
>>> 
>>> 
>>> ________________________________
>>> From: akshay bhatt <akshay22bh...@gmail.com>
>>> To: user@mahout.apache.org
>>> Sent: Wednesday, April 3, 2013 5:36 AM
>>> Subject: Integrating Mahout with existing nlp libraries
>>> 
>>> I tried searching for it here and there, but could not find any good
>> solution,
>>> so though of asking nlp experts. I am developing an text similarity
>> finding
>>> application for which I need to match thousands and thousands of
>> documents (of
>>> around 1000 words each) with each other. For nlp part, my best bet is
>> NLTK
>>> (seeing its capabilities and algorithm friendlyness of python.But now
>> when parts
>>> of speech tagging in itself taking so much of time, I believe, nltk may
>> not be
>>> best suitable. Java or C won't hurt me, hence any solution will work for
>> me.
>>> Please note, I have already started migrating from mysql to hbase in
>> order to
>>> work with more freedom on such large number of data. But still question
>> exists,
>>> how to perform algos. Mahout may be a choice, but that too is for machine
>>> learning, not dedicated for nlp (may be good for speech recognition).
>> What else
>>> are available options. In gist, I need high performance nlp, (a step
>> down from
>>> high performance machine learning). (I am inclined a bit towards Mahout,
>> seeing
>>> future usage).
>>> 
>>> (already asked at -
>>> 
>> http://stackoverflow.com/questions/15782898/how-can-i-imporve-performance-of-nltk-alternatives
>> )
>>> .
>>> 
>> 
>> 
> 
> 
> -- 
> *
> *Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

Reply via email to