Specialized Analyzer for names

2012-11-23 Thread Carsten Schnober
Hi,
I'm indexing names in a dedicated Lucene field and I wonder which
analyzer to use for that purpose. Typically, the names are in the format
John Smith, so the WhitespaceAnalyzer is likely the best in most
cases. The field type to choose seems to be the TextField.
Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
cautious about that because I'm afraid of wildcard or regex queries such
as *Smith or .*Smith respectively.

However, there might also be special cases and spelling exceptions of
all kinds, e.g. Smith, John, John 'Hammmer' Smith, Abd al-Aziz,
Stan van Hoop and what else one could imagine. Is there a special
Analyzer that is optimized on dealing with such cases or do I have to do
normalization beforehand?
I see that such special characters and spellings can easily be covered
by the right queries, but that requires the user to know the exact
spelling, which is what I'm trying to spare her.

Best regards,
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Specialized Analyzer for names

2012-11-23 Thread Ian Lea
I'd use StandardAnalyzer, or ClassicAnalyzer.  Also depends on how you
want to search.  You probably want a query for John Smith to match

John Smith and Smith, John but maybe not John Brown and Sam
Smith.  The latter is a problem.  You can partially work round it by
using a BooleanQuery made up of a phrase query, and/or SpanNearQuery
with small slop and InOrder true and a general catch all clause, with
boosts on the first two.

If this is real world data there will always be exceptions and problems.


--
Ian.


On Fri, Nov 23, 2012 at 2:36 PM, Carsten Schnober
schno...@ids-mannheim.de wrote:
 Hi,
 I'm indexing names in a dedicated Lucene field and I wonder which
 analyzer to use for that purpose. Typically, the names are in the format
 John Smith, so the WhitespaceAnalyzer is likely the best in most
 cases. The field type to choose seems to be the TextField.
 Or, would you rather recommend using the KeywordAnalyzer? I'm a bit
 cautious about that because I'm afraid of wildcard or regex queries such
 as *Smith or .*Smith respectively.

 However, there might also be special cases and spelling exceptions of
 all kinds, e.g. Smith, John, John 'Hammmer' Smith, Abd al-Aziz,
 Stan van Hoop and what else one could imagine. Is there a special
 Analyzer that is optimized on dealing with such cases or do I have to do
 normalization beforehand?
 I see that such special characters and spellings can easily be covered
 by the right queries, but that requires the user to know the exact
 spelling, which is what I'm trying to spare her.

 Best regards,
 Carsten

 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789  | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation
 Next Generation Corpus Analysis Platform

 -
 To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-user-h...@lucene.apache.org


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org