Re: Custom Tokenizer/Analyzer
Hi Greet, I suggest you to do these kind of transformation on query time only. Don't interfere with the index. This is way is more flexible. You can disable/enable on the fly, change your list without re-indexing. Just an imaginary example : When user passes String as International Businessmachine logo then this query can be generated : PhraseQuery("International Business Machine") AND/OR TermQuery(logo) I know this is solr but please see : http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/ On Thursday, February 20, 2014 11:47 AM, Geet Gangwar wrote: Hi, I have a requirement to write a custom tokenizer using Lucene framework. My requirement is it should have capabilities to match multiple words as one token. for example. When user passes String as International Business machine logo or IBM logo it should return International Business Machine as one token and logo as one token. Please help me as how can I approach this ... Regards Geet - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Custom Tokenizer/Analyzer
If you already know the set of phrases you need to detect then you can use Lucene's SynonymFilter to spot them and insert a new token. Mike McCandless http://blog.mikemccandless.com On Thu, Feb 20, 2014 at 7:21 AM, Benson Margulies wrote: > It sounds like you've been asked to implement Named Entity Recognition. > OpenNLP has some capability here. There are also, um, commercial > alternatives. > > > On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote: > >> On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar >> wrote: >> >> Hi, >> >> > My requirement is it should have capabilities to match multiple words as >> > one token. for example. When user passes String as International Business >> > machine logo or IBM logo it should return International Business Machine >> as >> > one token and logo as one token. >> >> This is an interesting problem. I suppose that if the user enters >> "International Business Machines", possibly with some misspelling, you >> want to find all documents containing "IBM" - and that if he enters >> the string "IBM", you want to find documents which contain the string >> "International Business Machines", or even only parts of it. So this >> means you need some kind of map relating some acronyms with their >> content parts. There really are two directions here: acronym to >> content and content to acronym. >> >> One cannot find what an acronym means without some kind of acronym >> dictionary. This means that whatever approach you intend to use, there >> should be an external dictionary involved, which, for each acronym, >> would map a list of possible phrases. Retrieving all phrases matching >> the inputted acronym, you'd inject each part of each phrase as a token >> (removing possible duplicates between phrase parts). That's basically >> it for the direction "acronym to content". >> >> The direction "content to acronym" is trickier, I believe. One way is >> to generate a second (reversed) map, matching each acronym content >> part to a list of acronyms containing that part. You'd simply inject >> acronyms (and possibly other things) if one part of their content is >> matched (or more than one part, if you want to increase relevance). >> This could however possibly require the definition of a specific >> hashing mechanism, if you want to find approximate (distanced) keys >> (e.g. "intenational", with the lacking "r", would still find "IBM"). A >> second way (more coupled to the concept of acronym, so less generic) >> could be to consider that every word starting with a capital letter if >> part of an acronym, buffering sequences of words starting with a >> capital letter, and eventually injecting the resulting acronym, if >> found in the acronym dictionary. This might not be safe, though - the >> user may not have the discipline to capitalize the words being part of >> an acronym (or may even misspell the first letter), or concatenated >> first letters could match an irrelevant acronym (many word sequences >> can give the acronym "IBM"). >> >> I do not know whether there already exists some Lucene module which >> processes acronyms, or if someone is working on one. It's definitely >> worth a search though, because writing a good one from scratch could >> mean a few days of work, or more. >> >> HTH. >> >> - >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: Custom Tokenizer/Analyzer
It sounds like you've been asked to implement Named Entity Recognition. OpenNLP has some capability here. There are also, um, commercial alternatives. On Thu, Feb 20, 2014 at 6:24 AM, Yann-Erwan Perio wrote: > On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar > wrote: > > Hi, > > > My requirement is it should have capabilities to match multiple words as > > one token. for example. When user passes String as International Business > > machine logo or IBM logo it should return International Business Machine > as > > one token and logo as one token. > > This is an interesting problem. I suppose that if the user enters > "International Business Machines", possibly with some misspelling, you > want to find all documents containing "IBM" - and that if he enters > the string "IBM", you want to find documents which contain the string > "International Business Machines", or even only parts of it. So this > means you need some kind of map relating some acronyms with their > content parts. There really are two directions here: acronym to > content and content to acronym. > > One cannot find what an acronym means without some kind of acronym > dictionary. This means that whatever approach you intend to use, there > should be an external dictionary involved, which, for each acronym, > would map a list of possible phrases. Retrieving all phrases matching > the inputted acronym, you'd inject each part of each phrase as a token > (removing possible duplicates between phrase parts). That's basically > it for the direction "acronym to content". > > The direction "content to acronym" is trickier, I believe. One way is > to generate a second (reversed) map, matching each acronym content > part to a list of acronyms containing that part. You'd simply inject > acronyms (and possibly other things) if one part of their content is > matched (or more than one part, if you want to increase relevance). > This could however possibly require the definition of a specific > hashing mechanism, if you want to find approximate (distanced) keys > (e.g. "intenational", with the lacking "r", would still find "IBM"). A > second way (more coupled to the concept of acronym, so less generic) > could be to consider that every word starting with a capital letter if > part of an acronym, buffering sequences of words starting with a > capital letter, and eventually injecting the resulting acronym, if > found in the acronym dictionary. This might not be safe, though - the > user may not have the discipline to capitalize the words being part of > an acronym (or may even misspell the first letter), or concatenated > first letters could match an irrelevant acronym (many word sequences > can give the acronym "IBM"). > > I do not know whether there already exists some Lucene module which > processes acronyms, or if someone is working on one. It's definitely > worth a search though, because writing a good one from scratch could > mean a few days of work, or more. > > HTH. > > - > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >
Re: Custom Tokenizer/Analyzer
On Thu, Feb 20, 2014 at 10:46 AM, Geet Gangwar wrote: Hi, > My requirement is it should have capabilities to match multiple words as > one token. for example. When user passes String as International Business > machine logo or IBM logo it should return International Business Machine as > one token and logo as one token. This is an interesting problem. I suppose that if the user enters "International Business Machines", possibly with some misspelling, you want to find all documents containing "IBM" - and that if he enters the string "IBM", you want to find documents which contain the string "International Business Machines", or even only parts of it. So this means you need some kind of map relating some acronyms with their content parts. There really are two directions here: acronym to content and content to acronym. One cannot find what an acronym means without some kind of acronym dictionary. This means that whatever approach you intend to use, there should be an external dictionary involved, which, for each acronym, would map a list of possible phrases. Retrieving all phrases matching the inputted acronym, you'd inject each part of each phrase as a token (removing possible duplicates between phrase parts). That's basically it for the direction "acronym to content". The direction "content to acronym" is trickier, I believe. One way is to generate a second (reversed) map, matching each acronym content part to a list of acronyms containing that part. You'd simply inject acronyms (and possibly other things) if one part of their content is matched (or more than one part, if you want to increase relevance). This could however possibly require the definition of a specific hashing mechanism, if you want to find approximate (distanced) keys (e.g. "intenational", with the lacking "r", would still find "IBM"). A second way (more coupled to the concept of acronym, so less generic) could be to consider that every word starting with a capital letter if part of an acronym, buffering sequences of words starting with a capital letter, and eventually injecting the resulting acronym, if found in the acronym dictionary. This might not be safe, though - the user may not have the discipline to capitalize the words being part of an acronym (or may even misspell the first letter), or concatenated first letters could match an irrelevant acronym (many word sequences can give the acronym "IBM"). I do not know whether there already exists some Lucene module which processes acronyms, or if someone is working on one. It's definitely worth a search though, because writing a good one from scratch could mean a few days of work, or more. HTH. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org