For example, I am able to do
Analyzer analyzer = new StandardAnalyzer(); // or any other analyzer
TokenStream ts = analyzer.tokenStream("myfield",new StringReader("some
text goes here"));
Token t = ts.next();
while (t!=null) {
System.out.println("token: "+t));
t = ts.next();
}
But I would need to enhance it to include
- Split on hyphen,semicolon etc
- stemming ( porter )
- synonyms
Thanks
joe_coder wrote:
>
> Grant, thanks for responding.
>
> My issue is that I am not planning to use lucene ( as I don't need any
> search capability, atleast yet). All I have is a text document and I need
> to extract keywords and their frequency ( which could be a simple split on
> space and tracking the count). But I realize that I would need to do some
> preprocessing to remove stopwords, stem words and also check for synonyms.
> So wondering if there is already such code present in lucene ( or any
> other project ) that I can use directly.
>
> Thanks!
>
>
>
> Grant Ingersoll-6 wrote:
>>
>>
>> On Aug 13, 2009, at 7:40 AM, joe_coder wrote:
>>
>>>
>>> I was wondering if there is any way to directly use Lucene API to
>>> extract
>>> terms from a given string. My requirement is that I have a text
>>> document for
>>> which I need a term frequency vector ( after stemming, removing
>>> stopwords
>>> and synonyms checks ). The result needs to be the terms and frequency.
>>
>> IndexReader.getTermFreqVector(), assuming you have indexed using Term
>> Vectors.
>>
>>
>>>
>>> Is it possible to get this using any lucene API? ( As I see lucene
>>> also
>>> needs to stem, remove stopwords, synonyms etc before indexing). Or
>>> is this
>>> any java project that would help me in this?
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Term-Extraction-tp24953406p24953406.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)
>> using Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>>
>
>
--
View this message in context:
http://www.nabble.com/Term-Extraction-tp24953406p24954264.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]