Custom Tokenizer/Analyzer

2014-02-20 Thread Geet Gangwar
Hi,

I have a requirement to write a custom tokenizer using Lucene framework.

My requirement is it should have capabilities to match multiple words as
one token. for example. When user passes String as International Business
machine logo or IBM logo it should return International Business Machine as
one token and logo as one token.

Please help me as how can I approach this ...

Regards

Geet


Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Yann-Erwan Perio
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar geetgang...@gmail.com wrote:

Hi,

 My requirement is it should have capabilities to match multiple words as
 one token. for example. When user passes String as International Business
 machine logo or IBM logo it should return International Business Machine as
 one token and logo as one token.

This is an interesting problem. I suppose that if the user enters
International Business Machines, possibly with some misspelling, you
want to find all documents containing IBM - and that if he enters
the string IBM, you want to find documents which contain the string
International Business Machines, or even only parts of it. So this
means you need some kind of map relating some acronyms with their
content parts. There really are two directions here: acronym to
content and content to acronym.

One cannot find what an acronym means without some kind of acronym
dictionary. This means that whatever approach you intend to use, there
should be an external dictionary involved, which, for each acronym,
would map a list of possible phrases. Retrieving all phrases matching
the inputted acronym, you'd inject each part of each phrase as a token
(removing possible duplicates between phrase parts). That's basically
it for the direction acronym to content.

The direction content to acronym is trickier, I believe. One way is
to generate a second (reversed) map, matching each acronym content
part to a list of acronyms containing that part. You'd simply inject
acronyms (and possibly other things) if one part of their content is
matched (or more than one part, if you want to increase relevance).
This could however possibly require the definition of a specific
hashing mechanism, if you want to find approximate (distanced) keys
(e.g. intenational, with the lacking r, would still find IBM). A
second way (more coupled to the concept of acronym, so less generic)
could be to consider that every word starting with a capital letter if
part of an acronym, buffering sequences of words starting with a
capital letter, and eventually injecting the resulting acronym, if
found in the acronym dictionary. This might not be safe, though - the
user may not have the discipline to capitalize the words being part of
an acronym (or may even misspell the first letter), or concatenated
first letters could match an irrelevant acronym (many word sequences
can give the acronym IBM).

I do not know whether there already exists some Lucene module which
processes acronyms, or if someone is working on one. It's definitely
worth a search though, because writing a good one from scratch could
mean a few days of work, or more.

HTH.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Benson Margulies
It sounds like you've been asked to implement Named Entity Recognition.
OpenNLP has some capability here. There are also, um, commercial
alternatives.


On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio ye.pe...@gmail.comwrote:

 On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar geetgang...@gmail.com
 wrote:

 Hi,

  My requirement is it should have capabilities to match multiple words as
  one token. for example. When user passes String as International Business
  machine logo or IBM logo it should return International Business Machine
 as
  one token and logo as one token.

 This is an interesting problem. I suppose that if the user enters
 International Business Machines, possibly with some misspelling, you
 want to find all documents containing IBM - and that if he enters
 the string IBM, you want to find documents which contain the string
 International Business Machines, or even only parts of it. So this
 means you need some kind of map relating some acronyms with their
 content parts. There really are two directions here: acronym to
 content and content to acronym.

 One cannot find what an acronym means without some kind of acronym
 dictionary. This means that whatever approach you intend to use, there
 should be an external dictionary involved, which, for each acronym,
 would map a list of possible phrases. Retrieving all phrases matching
 the inputted acronym, you'd inject each part of each phrase as a token
 (removing possible duplicates between phrase parts). That's basically
 it for the direction acronym to content.

 The direction content to acronym is trickier, I believe. One way is
 to generate a second (reversed) map, matching each acronym content
 part to a list of acronyms containing that part. You'd simply inject
 acronyms (and possibly other things) if one part of their content is
 matched (or more than one part, if you want to increase relevance).
 This could however possibly require the definition of a specific
 hashing mechanism, if you want to find approximate (distanced) keys
 (e.g. intenational, with the lacking r, would still find IBM). A
 second way (more coupled to the concept of acronym, so less generic)
 could be to consider that every word starting with a capital letter if
 part of an acronym, buffering sequences of words starting with a
 capital letter, and eventually injecting the resulting acronym, if
 found in the acronym dictionary. This might not be safe, though - the
 user may not have the discipline to capitalize the words being part of
 an acronym (or may even misspell the first letter), or concatenated
 first letters could match an irrelevant acronym (many word sequences
 can give the acronym IBM).

 I do not know whether there already exists some Lucene module which
 processes acronyms, or if someone is working on one. It's definitely
 worth a search though, because writing a good one from scratch could
 mean a few days of work, or more.

 HTH.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org




Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Michael McCandless
If you already know the set of phrases you need to detect then you can
use Lucene's SynonymFilter to spot them and insert a new token.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies ben...@basistech.com wrote:
 It sounds like you've been asked to implement Named Entity Recognition.
 OpenNLP has some capability here. There are also, um, commercial
 alternatives.


 On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio ye.pe...@gmail.comwrote:

 On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar geetgang...@gmail.com
 wrote:

 Hi,

  My requirement is it should have capabilities to match multiple words as
  one token. for example. When user passes String as International Business
  machine logo or IBM logo it should return International Business Machine
 as
  one token and logo as one token.

 This is an interesting problem. I suppose that if the user enters
 International Business Machines, possibly with some misspelling, you
 want to find all documents containing IBM - and that if he enters
 the string IBM, you want to find documents which contain the string
 International Business Machines, or even only parts of it. So this
 means you need some kind of map relating some acronyms with their
 content parts. There really are two directions here: acronym to
 content and content to acronym.

 One cannot find what an acronym means without some kind of acronym
 dictionary. This means that whatever approach you intend to use, there
 should be an external dictionary involved, which, for each acronym,
 would map a list of possible phrases. Retrieving all phrases matching
 the inputted acronym, you'd inject each part of each phrase as a token
 (removing possible duplicates between phrase parts). That's basically
 it for the direction acronym to content.

 The direction content to acronym is trickier, I believe. One way is
 to generate a second (reversed) map, matching each acronym content
 part to a list of acronyms containing that part. You'd simply inject
 acronyms (and possibly other things) if one part of their content is
 matched (or more than one part, if you want to increase relevance).
 This could however possibly require the definition of a specific
 hashing mechanism, if you want to find approximate (distanced) keys
 (e.g. intenational, with the lacking r, would still find IBM). A
 second way (more coupled to the concept of acronym, so less generic)
 could be to consider that every word starting with a capital letter if
 part of an acronym, buffering sequences of words starting with a
 capital letter, and eventually injecting the resulting acronym, if
 found in the acronym dictionary. This might not be safe, though - the
 user may not have the discipline to capitalize the words being part of
 an acronym (or may even misspell the first letter), or concatenated
 first letters could match an irrelevant acronym (many word sequences
 can give the acronym IBM).

 I do not know whether there already exists some Lucene module which
 processes acronyms, or if someone is working on one. It's definitely
 worth a search though, because writing a good one from scratch could
 mean a few days of work, or more.

 HTH.

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org



-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Custom Tokenizer/Analyzer

2014-02-20 Thread Ahmet Arslan
Hi Greet,

I suggest you to do these kind of transformation on query time only. Don't 
interfere with the index. This is way is more flexible. You can disable/enable 
on the fly, change your list without re-indexing. 

Just an imaginary example : When user passes String as International 
Businessmachine logo then this query can be generated : 

PhraseQuery(International Business Machine) AND/OR TermQuery(logo)



I know this is solr but please see : 
http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/





On Thursday, February 20, 2014 11:47 AM, Geet Gangwar geetgang...@gmail.com 
wrote:
Hi,

I have a requirement to write a custom tokenizer using Lucene framework.

My requirement is it should have capabilities to match multiple words as
one token. for example. When user passes String as International Business
machine logo or IBM logo it should return International Business Machine as
one token and logo as one token.

Please help me as how can I approach this ...

Regards

Geet


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org