Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
I have documents containing tokens of a certain format in arbitrary
positions, like this:

... blah blahblah AB/1234/5678 blah blah blahblah ...

I would like to enable usual keyword searching within these documents. In
addition, I'd also like to enable users to find AB/1234/5678, ideally
without a need to quote this as a phrase. And match highlighting should
highlight this term just as other term matches would be highlighted.

BTW, it's *not* necessary to find this document by searching for parts of
that token, like ab, 1234 or 5678.

As I understand, StandardTokenizerFactory considers the slash as a word
delimiter and thus removes it.

Is there a Tokenizer available that allows me to to skip tokenizing on
slashes in this case, but only on this case? Or how could I create one
myself? Do I extend StandardTokenizerFactory in my own Java class?

Thanks!

Marian


Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Erick Erickson
There's about a zillion tokenizers, for what you're describing
WhitespaceTokenizerFactory is a good candidate.

See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
for a partial list, and it has links to the authoritative docs.

Best
Erick

On Wed, Nov 30, 2011 at 9:23 AM, Marian Steinbach
marian.steinb...@gmail.com wrote:
 I have documents containing tokens of a certain format in arbitrary
 positions, like this:

    ... blah blahblah AB/1234/5678 blah blah blahblah ...

 I would like to enable usual keyword searching within these documents. In
 addition, I'd also like to enable users to find AB/1234/5678, ideally
 without a need to quote this as a phrase. And match highlighting should
 highlight this term just as other term matches would be highlighted.

 BTW, it's *not* necessary to find this document by searching for parts of
 that token, like ab, 1234 or 5678.

 As I understand, StandardTokenizerFactory considers the slash as a word
 delimiter and thus removes it.

 Is there a Tokenizer available that allows me to to skip tokenizing on
 slashes in this case, but only on this case? Or how could I create one
 myself? Do I extend StandardTokenizerFactory in my own Java class?

 Thanks!

 Marian


Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
Thanks for the quick response!

Are you saying that I should extend WhitespaceTokenizerFactory to create my
own? Or should I simply use it?

Because, I guess tokenizing on spaces wouldn't be enough. I would need
tokenizing on slashes in other positions, just not within strings matching
([A-Z]+/[0-9]+/[0-9]+).

Marian


2011/11/30 Erick Erickson erickerick...@gmail.com

 There's about a zillion tokenizers, for what you're describing
 WhitespaceTokenizerFactory is a good candidate.

 See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 for a partial list, and it has links to the authoritative docs.

 Best
 Erick




Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Erick Erickson
Well, it depends (tm). No, in your case WhitespaceTokenizer wouldn't work,
although it did satisfy your initial statement.

You could consider PatternTokenizerFactory, but take a look at the
link I provided, and follow it to the javadocs to see if there are
better matches.

Best
Erick

On Wed, Nov 30, 2011 at 9:41 AM, Marian Steinbach
marian.steinb...@gmail.com wrote:
 Thanks for the quick response!

 Are you saying that I should extend WhitespaceTokenizerFactory to create my
 own? Or should I simply use it?

 Because, I guess tokenizing on spaces wouldn't be enough. I would need
 tokenizing on slashes in other positions, just not within strings matching
 ([A-Z]+/[0-9]+/[0-9]+).

 Marian


 2011/11/30 Erick Erickson erickerick...@gmail.com

 There's about a zillion tokenizers, for what you're describing
 WhitespaceTokenizerFactory is a good candidate.

 See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
 for a partial list, and it has links to the authoritative docs.

 Best
 Erick




RE: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Steven A Rowe
Hi Marian,

Extending the StandardTokenizer(Factory) java class is not the way to go if you 
want to change its behavior.

StandardTokenizer is generated from a JFlex http://jflex.de/ specification, 
so you would need to modify the specification to include your special 
slash-containing-word rule, then regenerate the java code, and then compile it.

It would be much simpler to use a PatternReplaceCharFilter 
http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html
 to convert the slashes into unusual (sequences of) characters that won't be 
broken up by the analyzer you're using, then add a PatternReplaceFilter to 
convert the unusual sequences back to slashes.  E.g. if you used -blah- as 
the unusual sequence (note: people have also reported using a single character 
drawn from a script that would otherwise not be used in the text, e.g. a 
Chinese ideograph in English text), AB/1234/5678 could become 
AB-blah-1234-blah-5678.

Here's an (untested!) analyzer specification that would do this:

analyzer
  charFilter class=solr.PatternReplaceCharFilterFactory 
pattern=([A-Z]+)/([0-9]+)/([0-9]+)
  replacement=$1-blah-$2-blah-$3/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.PatternReplaceFilterFactory pattern=-blah- 
replacement=/ replace=all/
  filter class=solr.LowerCaseFilterFactory/
/analyzer

Steve

 -Original Message-
 From: Marian Steinbach [mailto:marian.steinb...@gmail.com]
 Sent: Wednesday, November 30, 2011 9:41 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Leaving certain tokens intact during indexing and search
 
 Thanks for the quick response!
 
 Are you saying that I should extend WhitespaceTokenizerFactory to create
 my
 own? Or should I simply use it?
 
 Because, I guess tokenizing on spaces wouldn't be enough. I would need
 tokenizing on slashes in other positions, just not within strings matching
 ([A-Z]+/[0-9]+/[0-9]+).
 
 Marian
 
 
 2011/11/30 Erick Erickson erickerick...@gmail.com
 
  There's about a zillion tokenizers, for what you're describing
  WhitespaceTokenizerFactory is a good candidate.
 
  See: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
  for a partial list, and it has links to the authoritative docs.
 
  Best
  Erick
 
 


Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
That's pretty helpful, thanks! Especially since I didn't understand so far
that I could use a filter like PatternReplaceCharFilterFactory both as a
charFilter and as a filter.

In the meantime I had figured out another alternative,
involving WordDelimiterFilterFactory. But I had to
use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which
means that I had to use extra PatternReplaceCharFilterFactory filters to
get rid of leading/trailing punctuation.

Again, thanks!

Marian

2011/11/30 Steven A Rowe sar...@syr.edu

 Hi Marian,

 Extending the StandardTokenizer(Factory) java class is not the way to go
 if you want to change its behavior.

 StandardTokenizer is generated from a JFlex http://jflex.de/
 specification, so you would need to modify the specification to include
 your special slash-containing-word rule, then regenerate the java code, and
 then compile it.

 It would be much simpler to use a PatternReplaceCharFilter 
 http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceCharFilter.html
 to convert the slashes into unusual (sequences of) characters that won't be
 broken up by the analyzer you're using, then add a PatternReplaceFilter to
 convert the unusual sequences back to slashes.  E.g. if you used -blah-
 as the unusual sequence (note: people have also reported using a single
 character drawn from a script that would otherwise not be used in the text,
 e.g. a Chinese ideograph in English text), AB/1234/5678 could become
 AB-blah-1234-blah-5678.

 Here's an (untested!) analyzer specification that would do this:

 analyzer
  charFilter class=solr.PatternReplaceCharFilterFactory
 pattern=([A-Z]+)/([0-9]+)/([0-9]+)
  replacement=$1-blah-$2-blah-$3/
  tokenizer class=solr.StandardTokenizerFactory/
  filter class=solr.PatternReplaceFilterFactory pattern=-blah-
 replacement=/ replace=all/
  filter class=solr.LowerCaseFilterFactory/
 /analyzer

 Steve




RE: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Steven A Rowe
Note that my example does not actually use PatternReplaceCharFilterFactory 
twice - the second one is actually a PatternReplaceFilterFactory - note that 
Char isn't present in the second one.

CharFilters operate before tokenizers, and regular filters operate after 
tokenizers.

Steve

 -Original Message-
 From: Marian Steinbach [mailto:marian.steinb...@gmail.com]
 Sent: Wednesday, November 30, 2011 10:44 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Leaving certain tokens intact during indexing and search
 
 That's pretty helpful, thanks! Especially since I didn't understand so far
 that I could use a filter like PatternReplaceCharFilterFactory both as a
 charFilter and as a filter.
 
 In the meantime I had figured out another alternative,
 involving WordDelimiterFilterFactory. But I had to
 use WhitespaceTokenizerFactory instead of StandardTokenizerFactory, which
 means that I had to use extra PatternReplaceCharFilterFactory filters to
 get rid of leading/trailing punctuation.
 
 Again, thanks!
 
 Marian
 
 2011/11/30 Steven A Rowe sar...@syr.edu
 
  Hi Marian,
 
  Extending the StandardTokenizer(Factory) java class is not the way to go
  if you want to change its behavior.
 
  StandardTokenizer is generated from a JFlex http://jflex.de/
  specification, so you would need to modify the specification to include
  your special slash-containing-word rule, then regenerate the java code,
 and
  then compile it.
 
  It would be much simpler to use a PatternReplaceCharFilter 
 
 http://lucene.apache.org/solr/api/org/apache/solr/analysis/PatternReplaceC
 harFilter.html
  to convert the slashes into unusual (sequences of) characters that won't
 be
  broken up by the analyzer you're using, then add a PatternReplaceFilter
 to
  convert the unusual sequences back to slashes.  E.g. if you used -blah-
 
  as the unusual sequence (note: people have also reported using a single
  character drawn from a script that would otherwise not be used in the
 text,
  e.g. a Chinese ideograph in English text), AB/1234/5678 could become
  AB-blah-1234-blah-5678.
 
  Here's an (untested!) analyzer specification that would do this:
 
  analyzer
   charFilter class=solr.PatternReplaceCharFilterFactory
  pattern=([A-Z]+)/([0-9]+)/([0-9]+)
   replacement=$1-blah-$2-blah-$3/
   tokenizer class=solr.StandardTokenizerFactory/
   filter class=solr.PatternReplaceFilterFactory pattern=-blah-
  replacement=/ replace=all/
   filter class=solr.LowerCaseFilterFactory/
  /analyzer
 
  Steve
 
 


Re: Leaving certain tokens intact during indexing and search

2011-11-30 Thread Marian Steinbach
Got me right when Solr reported the error on restart :) Thanks!

2011/11/30 Steven A Rowe sar...@syr.edu

 Note that my example does not actually use PatternReplaceCharFilterFactory
 twice - the second one is actually a PatternReplaceFilterFactory - note
 that Char isn't present in the second one.

 CharFilters operate before tokenizers, and regular filters operate after
 tokenizers.