PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour

2015-05-11 Thread Mihran Shahinian
I must be missing something obvious.I have a simple regex that removes
spacehyphenspace pattern.

The unit test below works fine, but when I plug it into schema and query,
regex does not match, since input already gets split by space (further
below). My understanding that charFilter would operate on raw input string
and than pass it to the whitespace tokenizer which seems to be the case,
but I am not sure why I get already split token stream.

Analyzer analyzer = new Analyzer() {
@Override
protected TokenStreamComponents createComponents(String
fieldName,
 Reader reader)
{
Tokenizer tokenizer = new MockTokenizer(reader,

MockTokenizer.WHITESPACE,
false);
return new TokenStreamComponents(tokenizer,
 tokenizer);
}

@Override
protected Reader initReader(String fieldName,
Reader reader) {
return new
PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+),
 ,
reader);
}
};

final TokenStream tokens = analyzer.tokenStream(,  new
StringReader(a - b));
tokens.reset();
final CharTermAttribute termAtt =
tokens.addAttribute(CharTermAttribute.class);
while (tokens.incrementToken()) {
System.out.println(===  +
   new String(Arrays.copyOf(termAtt.buffer(),
termAtt.length(;
}

I end up with:
=== a
=== b


Now I define the same in my schema:

fieldType name=text class=solr.TextField positionIncrementGap=100
 multiValued=true autoGeneratePhraseQueries=false
analyzer  type=index
 tokenizer class=solr.WhitespaceTokenizerFactory
/
/analyzer
analyzer  type=query
charFilter
class=solr.PatternReplaceCharFilterFactory
pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ;
 /
tokenizer class=solr.WhitespaceTokenizerFactory /
/analyzer
/fieldType

field name=myfield type=text indexed=true stored=false
multiValued=true/

When I query the input already comes in split into (e.g. a,-,b)
PatternReplaceCharFilter's processPattern method so regex would not match.
CharSequence processPattern(CharSequence input) ...
even though charFilter is defined before tokenizer.




Here is the query
SolrQuery solrQuery = new SolrQuery(a - b);
solrQuery.setRequestHandler(/select);
solrQuery.set(defType,
  edismax);
solrQuery.set(qf,
  myfield);
solrQuery.set(CommonParams.ROWS,
  0);
solrQuery.set(CommonParams.DEBUG,
  true);
solrQuery.set(CommonParams.DEBUG_QUERY,
  true);
QueryResponse response = solrSvr.query(solrQuery);

System.out.println(parsedQtoString  +
   response.getDebugMap()
   .get(parsedquery_toString));
System.out.println(parsedQ  +
   response.getDebugMap()
   .get(parsedquery));

Output is
parsedQtoString +((myfield:a) (myfield:-) (myfield:b))
parsedQ (+(DisjunctionMaxQuery((myfield:a))
DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord


Re: PatternReplaceCharFilter + solr.WhitespaceTokenizerFactory behaviour

2015-05-11 Thread Erick Erickson
This trips up _everybody_ at one point or other. The problem is that
the input goes through the query _parsing_ prior to getting to the
field analysis, and the parser is sensitive to spaces.

Consider the input (without quotes) of my dog. That gets broken up into
default_field:my default_field:dog
and only _then_ does the analysis chain, including your
PatternReplaceCharFilterFactory get applied to the individual tokens.

So, your query input needs to escape the spaces, as in whatever\ -\
somethingelse, or perhaps quote the input, although this latter has
other implications.

Best,
Erick

On Mon, May 11, 2015 at 2:00 PM, Mihran Shahinian slowmih...@gmail.com wrote:
 I must be missing something obvious.I have a simple regex that removes
 spacehyphenspace pattern.

 The unit test below works fine, but when I plug it into schema and query,
 regex does not match, since input already gets split by space (further
 below). My understanding that charFilter would operate on raw input string
 and than pass it to the whitespace tokenizer which seems to be the case,
 but I am not sure why I get already split token stream.

 Analyzer analyzer = new Analyzer() {
 @Override
 protected TokenStreamComponents createComponents(String
 fieldName,
  Reader reader)
 {
 Tokenizer tokenizer = new MockTokenizer(reader,

 MockTokenizer.WHITESPACE,
 false);
 return new TokenStreamComponents(tokenizer,
  tokenizer);
 }

 @Override
 protected Reader initReader(String fieldName,
 Reader reader) {
 return new
 PatternReplaceCharFilter(pattern(\\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\\s+),
  ,
 reader);
 }
 };

 final TokenStream tokens = analyzer.tokenStream(,  new
 StringReader(a - b));
 tokens.reset();
 final CharTermAttribute termAtt =
 tokens.addAttribute(CharTermAttribute.class);
 while (tokens.incrementToken()) {
 System.out.println(===  +
new String(Arrays.copyOf(termAtt.buffer(),
 termAtt.length(;
 }

 I end up with:
 === a
 === b


 Now I define the same in my schema:

 fieldType name=text class=solr.TextField positionIncrementGap=100
  multiValued=true autoGeneratePhraseQueries=false
 analyzer  type=index
  tokenizer class=solr.WhitespaceTokenizerFactory
 /
 /analyzer
 analyzer  type=query
 charFilter
 class=solr.PatternReplaceCharFilterFactory
 pattern=\s+[\u002d,\u2011,\u2012,\u2013,\u2014,\u2212]\s+ replacement= ;
  /
 tokenizer class=solr.WhitespaceTokenizerFactory /
 /analyzer
 /fieldType

 field name=myfield type=text indexed=true stored=false
 multiValued=true/

 When I query the input already comes in split into (e.g. a,-,b)
 PatternReplaceCharFilter's processPattern method so regex would not match.
 CharSequence processPattern(CharSequence input) ...
 even though charFilter is defined before tokenizer.




 Here is the query
 SolrQuery solrQuery = new SolrQuery(a - b);
 solrQuery.setRequestHandler(/select);
 solrQuery.set(defType,
   edismax);
 solrQuery.set(qf,
   myfield);
 solrQuery.set(CommonParams.ROWS,
   0);
 solrQuery.set(CommonParams.DEBUG,
   true);
 solrQuery.set(CommonParams.DEBUG_QUERY,
   true);
 QueryResponse response = solrSvr.query(solrQuery);

 System.out.println(parsedQtoString  +
response.getDebugMap()
.get(parsedquery_toString));
 System.out.println(parsedQ  +
response.getDebugMap()
.get(parsedquery));

 Output is
 parsedQtoString +((myfield:a) (myfield:-) (myfield:b))
 parsedQ (+(DisjunctionMaxQuery((myfield:a))
 DisjunctionMaxQuery((myfield:-)) DisjunctionMaxQuery((myfield:b/no_coord