Leaving certain tokens intact during indexing and search
I have documents containing tokens of a certain format in arbitrary positions, like this: ... blah blahblah AB/1234/5678 blah blah blahblah ... I would like to enable usual keyword searching within these documents. In addition, I'd also like to enable users to find AB/1234/5678, ideally without a need to quote this as a phrase. And match highlighting should highlight this term just as other term matches would be highlighted. BTW, it's *not* necessary to find this document by searching for parts of that token, like ab, 1234 or 5678. As I understand, StandardTokenizerFactory considers the slash as a word delimiter and thus removes it. Is there a Tokenizer available that allows me to to skip tokenizing on slashes in this case, but only on this case? Or how could I create one myself? Do I extend StandardTokenizerFactory in my own Java class? Thanks! Marian
Re: Leaving certain tokens intact during indexing and search
There's about a zillion tokenizers, for what you're describing WhitespaceTokenizerFactory is a good candidate. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for a partial list, and it has links to the authoritative docs. Best Erick On Wed, Nov 30, 2011 at 9:23 AM, Marian Steinbach marian.steinb...@gmail.com wrote: I have documents containing tokens of a certain format in arbitrary positions, like this: ... blah blahblah AB/1234/5678 blah blah blahblah ... I would like to enable usual keyword searching within these documents. In addition, I'd also like to enable users to find AB/1234/5678, ideally without a need to quote this as a phrase. And match highlighting should highlight this term just as other term matches would be highlighted. BTW, it's *not* necessary to find this document by searching for parts of that token, like ab, 1234 or 5678. As I understand, StandardTokenizerFactory considers the slash as a word delimiter and thus removes it. Is there a Tokenizer available that allows me to to skip tokenizing on slashes in this case, but only on this case? Or how could I create one myself? Do I extend StandardTokenizerFactory in my own Java class? Thanks! Marian
Re: Leaving certain tokens intact during indexing and search
Thanks for the quick response! Are you saying that I should extend WhitespaceTokenizerFactory to create my own? Or should I simply use it? Because, I guess tokenizing on spaces wouldn't be enough. I would need tokenizing on slashes in other positions, just not within strings matching ([A-Z]+/[0-9]+/[0-9]+). Marian 2011/11/30 Erick Erickson erickerick...@gmail.com There's about a zillion tokenizers, for what you're describing WhitespaceTokenizerFactory is a good candidate. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for a partial list, and it has links to the authoritative docs. Best Erick
Re: Leaving certain tokens intact during indexing and search
Well, it depends (tm). No, in your case WhitespaceTokenizer wouldn't work, although it did satisfy your initial statement. You could consider PatternTokenizerFactory, but take a look at the link I provided, and follow it to the javadocs to see if there are better matches. Best Erick On Wed, Nov 30, 2011 at 9:41 AM, Marian Steinbach marian.steinb...@gmail.com wrote: Thanks for the quick response! Are you saying that I should extend WhitespaceTokenizerFactory to create my own? Or should I simply use it? Because, I guess tokenizing on spaces wouldn't be enough. I would need tokenizing on slashes in other positions, just not within strings matching ([A-Z]+/[0-9]+/[0-9]+). Marian 2011/11/30 Erick Erickson erickerick...@gmail.com There's about a zillion tokenizers, for what you're describing WhitespaceTokenizerFactory is a good candidate. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for a partial list, and it has links to the authoritative docs. Best Erick
RE: Leaving certain tokens intact during indexing and search
Hi Marian, Extending the StandardTokenizer(Factory) java class is not the way to go if you want to change its behavior. StandardTokenizer is generated from a JFlex http://jflex.de/ specification, so you would need to modify the specification to include your special slash-containing-word rule, then regenerate the java code, and then compile it. It would be much simpler to use a PatternReplaceCharFilter http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html to convert the slashes into unusual (sequences of) characters that won't be broken up by the analyzer you're using, then add a PatternReplaceFilter to convert the unusual sequences back to slashes. E.g. if you used -blah- as the unusual sequence (note: people have also reported using a single character drawn from a script that would otherwise not be used in the text, e.g. a Chinese ideograph in English text), AB/1234/5678 could become AB-blah-1234-blah-5678. Here's an (untested!) analyzer specification that would do this: analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([A-Z]+)/([0-9]+)/([0-9]+) replacement=$1-blah-$2-blah-$3/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=-blah- replacement=/ replace=all/ filter class=solr.LowerCaseFilterFactory/ /analyzer Steve -Original Message- From: Marian Steinbach [mailto:marian.steinb...@gmail.com] Sent: Wednesday, November 30, 2011 9:41 AM To: solr-user@lucene.apache.org Subject: Re: Leaving certain tokens intact during indexing and search Thanks for the quick response! Are you saying that I should extend WhitespaceTokenizerFactory to create my own? Or should I simply use it? Because, I guess tokenizing on spaces wouldn't be enough. I would need tokenizing on slashes in other positions, just not within strings matching ([A-Z]+/[0-9]+/[0-9]+). Marian 2011/11/30 Erick Erickson erickerick...@gmail.com There's about a zillion tokenizers, for what you're describing WhitespaceTokenizerFactory is a good candidate. See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters for a partial list, and it has links to the authoritative docs. Best Erick
Re: Leaving certain tokens intact during indexing and search
That's pretty helpful, thanks! Especially since I didn't understand so far that I could use a filter like PatternReplaceCharFilterFactory both as a charFilter and as a filter. In the meantime I had figured out another alternative, involving WordDelimiterFilterFactory. But I had to use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which means that I had to use extra PatternReplaceCharFilterFactory filters to get rid of leading/trailing punctuation. Again, thanks! Marian 2011/11/30 Steven A Rowe sar...@syr.edu Hi Marian, Extending the StandardTokenizer(Factory) java class is not the way to go if you want to change its behavior. StandardTokenizer is generated from a JFlex http://jflex.de/ specification, so you would need to modify the specification to include your special slash-containing-word rule, then regenerate the java code, and then compile it. It would be much simpler to use a PatternReplaceCharFilter http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html to convert the slashes into unusual (sequences of) characters that won't be broken up by the analyzer you're using, then add a PatternReplaceFilter to convert the unusual sequences back to slashes. E.g. if you used -blah- as the unusual sequence (note: people have also reported using a single character drawn from a script that would otherwise not be used in the text, e.g. a Chinese ideograph in English text), AB/1234/5678 could become AB-blah-1234-blah-5678. Here's an (untested!) analyzer specification that would do this: analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([A-Z]+)/([0-9]+)/([0-9]+) replacement=$1-blah-$2-blah-$3/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=-blah- replacement=/ replace=all/ filter class=solr.LowerCaseFilterFactory/ /analyzer Steve
RE: Leaving certain tokens intact during indexing and search
Note that my example does not actually use PatternReplaceCharFilterFactory twice - the second one is actually a PatternReplaceFilterFactory - note that Char isn't present in the second one. CharFilters operate before tokenizers, and regular filters operate after tokenizers. Steve -Original Message- From: Marian Steinbach [mailto:marian.steinb...@gmail.com] Sent: Wednesday, November 30, 2011 10:44 AM To: solr-user@lucene.apache.org Subject: Re: Leaving certain tokens intact during indexing and search That's pretty helpful, thanks! Especially since I didn't understand so far that I could use a filter like PatternReplaceCharFilterFactory both as a charFilter and as a filter. In the meantime I had figured out another alternative, involving WordDelimiterFilterFactory. But I had to use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which means that I had to use extra PatternReplaceCharFilterFactory filters to get rid of leading/trailing punctuation. Again, thanks! Marian 2011/11/30 Steven A Rowe sar...@syr.edu Hi Marian, Extending the StandardTokenizer(Factory) java class is not the way to go if you want to change its behavior. StandardTokenizer is generated from a JFlex http://jflex.de/ specification, so you would need to modify the specification to include your special slash-containing-word rule, then regenerate the java code, and then compile it. It would be much simpler to use a PatternReplaceCharFilter http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceC harFilter.html to convert the slashes into unusual (sequences of) characters that won't be broken up by the analyzer you're using, then add a PatternReplaceFilter to convert the unusual sequences back to slashes. E.g. if you used -blah- as the unusual sequence (note: people have also reported using a single character drawn from a script that would otherwise not be used in the text, e.g. a Chinese ideograph in English text), AB/1234/5678 could become AB-blah-1234-blah-5678. Here's an (untested!) analyzer specification that would do this: analyzer charFilter class=solr.PatternReplaceCharFilterFactory pattern=([A-Z]+)/([0-9]+)/([0-9]+) replacement=$1-blah-$2-blah-$3/ tokenizer class=solr.StandardTokenizerFactory/ filter class=solr.PatternReplaceFilterFactory pattern=-blah- replacement=/ replace=all/ filter class=solr.LowerCaseFilterFactory/ /analyzer Steve
Re: Leaving certain tokens intact during indexing and search
Got me right when Solr reported the error on restart :) Thanks! 2011/11/30 Steven A Rowe sar...@syr.edu Note that my example does not actually use PatternReplaceCharFilterFactory twice - the second one is actually a PatternReplaceFilterFactory - note that Char isn't present in the second one. CharFilters operate before tokenizers, and regular filters operate after tokenizers.