Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Ahmet Arslan
Hi Greet,

I suggest you to do these kind of transformation on query time only. Don't 
interfere with the index. This is way is more flexible. You can disable/enable 
on the fly, change your list without re-indexing. 

Just an imaginary example : When user passes String as International 
Businessmachine logo then this query can be generated : 

PhraseQuery("International Business Machine") AND/OR TermQuery(logo)



I know this is solr but please see : 
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/





On Thursday, February 20, 2014 11:47 AM, Geet Gangwar  
wrote:
Hi,

I have a requirement to write a custom tokenizer using Lucene framework.

My requirement is it should have capabilities to match multiple words as
one token. for example. When user passes String as International Business
machine logo or IBM logo it should return International Business Machine as
one token and logo as one token.

Please help me as how can I approach this ...

Regards

Geet


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Michael McCandless
If you already know the set of phrases you need to detect then you can
use Lucene's SynonymFilter to spot them and insert a new token.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies  wrote:
> It sounds like you've been asked to implement Named Entity Recognition.
> OpenNLP has some capability here. There are also, um, commercial
> alternatives.
>
>
> On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote:
>
>> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar 
>> wrote:
>>
>> Hi,
>>
>> > My requirement is it should have capabilities to match multiple words as
>> > one token. for example. When user passes String as International Business
>> > machine logo or IBM logo it should return International Business Machine
>> as
>> > one token and logo as one token.
>>
>> This is an interesting problem. I suppose that if the user enters
>> "International Business Machines", possibly with some misspelling, you
>> want to find all documents containing "IBM" - and that if he enters
>> the string "IBM", you want to find documents which contain the string
>> "International Business Machines", or even only parts of it. So this
>> means you need some kind of map relating some acronyms with their
>> content parts. There really are two directions here: acronym to
>> content and content to acronym.
>>
>> One cannot find what an acronym means without some kind of acronym
>> dictionary. This means that whatever approach you intend to use, there
>> should be an external dictionary involved, which, for each acronym,
>> would map a list of possible phrases. Retrieving all phrases matching
>> the inputted acronym, you'd inject each part of each phrase as a token
>> (removing possible duplicates between phrase parts). That's basically
>> it for the direction "acronym to content".
>>
>> The direction "content to acronym" is trickier, I believe. One way is
>> to generate a second (reversed) map, matching each acronym content
>> part to a list of acronyms containing that part. You'd simply inject
>> acronyms (and possibly other things) if one part of their content is
>> matched (or more than one part, if you want to increase relevance).
>> This could however possibly require the definition of a specific
>> hashing mechanism, if you want to find approximate (distanced) keys
>> (e.g. "intenational", with the lacking "r", would still find "IBM"). A
>> second way (more coupled to the concept of acronym, so less generic)
>> could be to consider that every word starting with a capital letter if
>> part of an acronym, buffering sequences of words starting with a
>> capital letter, and eventually injecting the resulting acronym, if
>> found in the acronym dictionary. This might not be safe, though - the
>> user may not have the discipline to capitalize the words being part of
>> an acronym (or may even misspell the first letter), or concatenated
>> first letters could match an irrelevant acronym (many word sequences
>> can give the acronym "IBM").
>>
>> I do not know whether there already exists some Lucene module which
>> processes acronyms, or if someone is working on one. It's definitely
>> worth a search though, because writing a good one from scratch could
>> mean a few days of work, or more.
>>
>> HTH.
>>
>> -
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition.
OpenNLP has some capability here. There are also, um, commercial
alternatives.


On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote:

> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar 
> wrote:
>
> Hi,
>
> > My requirement is it should have capabilities to match multiple words as
> > one token. for example. When user passes String as International Business
> > machine logo or IBM logo it should return International Business Machine
> as
> > one token and logo as one token.
>
> This is an interesting problem. I suppose that if the user enters
> "International Business Machines", possibly with some misspelling, you
> want to find all documents containing "IBM" - and that if he enters
> the string "IBM", you want to find documents which contain the string
> "International Business Machines", or even only parts of it. So this
> means you need some kind of map relating some acronyms with their
> content parts. There really are two directions here: acronym to
> content and content to acronym.
>
> One cannot find what an acronym means without some kind of acronym
> dictionary. This means that whatever approach you intend to use, there
> should be an external dictionary involved, which, for each acronym,
> would map a list of possible phrases. Retrieving all phrases matching
> the inputted acronym, you'd inject each part of each phrase as a token
> (removing possible duplicates between phrase parts). That's basically
> it for the direction "acronym to content".
>
> The direction "content to acronym" is trickier, I believe. One way is
> to generate a second (reversed) map, matching each acronym content
> part to a list of acronyms containing that part. You'd simply inject
> acronyms (and possibly other things) if one part of their content is
> matched (or more than one part, if you want to increase relevance).
> This could however possibly require the definition of a specific
> hashing mechanism, if you want to find approximate (distanced) keys
> (e.g. "intenational", with the lacking "r", would still find "IBM"). A
> second way (more coupled to the concept of acronym, so less generic)
> could be to consider that every word starting with a capital letter if
> part of an acronym, buffering sequences of words starting with a
> capital letter, and eventually injecting the resulting acronym, if
> found in the acronym dictionary. This might not be safe, though - the
> user may not have the discipline to capitalize the words being part of
> an acronym (or may even misspell the first letter), or concatenated
> first letters could match an irrelevant acronym (many word sequences
> can give the acronym "IBM").
>
> I do not know whether there already exists some Lucene module which
> processes acronyms, or if someone is working on one. It's definitely
> worth a search though, because writing a good one from scratch could
> mean a few days of work, or more.
>
> HTH.
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Yann-Erwan Perio
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar  wrote:

Hi,

> My requirement is it should have capabilities to match multiple words as
> one token. for example. When user passes String as International Business
> machine logo or IBM logo it should return International Business Machine as
> one token and logo as one token.

This is an interesting problem. I suppose that if the user enters
"International Business Machines", possibly with some misspelling, you
want to find all documents containing "IBM" - and that if he enters
the string "IBM", you want to find documents which contain the string
"International Business Machines", or even only parts of it. So this
means you need some kind of map relating some acronyms with their
content parts. There really are two directions here: acronym to
content and content to acronym.

One cannot find what an acronym means without some kind of acronym
dictionary. This means that whatever approach you intend to use, there
should be an external dictionary involved, which, for each acronym,
would map a list of possible phrases. Retrieving all phrases matching
the inputted acronym, you'd inject each part of each phrase as a token
(removing possible duplicates between phrase parts). That's basically
it for the direction "acronym to content".

The direction "content to acronym" is trickier, I believe. One way is
to generate a second (reversed) map, matching each acronym content
part to a list of acronyms containing that part. You'd simply inject
acronyms (and possibly other things) if one part of their content is
matched (or more than one part, if you want to increase relevance).
This could however possibly require the definition of a specific
hashing mechanism, if you want to find approximate (distanced) keys
(e.g. "intenational", with the lacking "r", would still find "IBM"). A
second way (more coupled to the concept of acronym, so less generic)
could be to consider that every word starting with a capital letter if
part of an acronym, buffering sequences of words starting with a
capital letter, and eventually injecting the resulting acronym, if
found in the acronym dictionary. This might not be safe, though - the
user may not have the discipline to capitalize the words being part of
an acronym (or may even misspell the first letter), or concatenated
first letters could match an irrelevant acronym (many word sequences
can give the acronym "IBM").

I do not know whether there already exists some Lucene module which
processes acronyms, or if someone is working on one. It's definitely
worth a search though, because writing a good one from scratch could
mean a few days of work, or more.

HTH.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org