Hi all

Interesting and by the looks of things very solid project you have here with 
SOLR, however ..

I have an index that contains a large number of "phrases" that I need to search 
for over, each of these phrases is fairly small being on average about 4 words 
long.

The search terms that I am given to search these phrases are very long, and 
quite arbitrary, sometimes the search terms will be up to 25 words long.

As such the performance of my index when built naively is sporadic sometimes 
searches are very fast on average they are somewhat slower.

I have attempted to improve this situation by using shingling for the phrases 
and the related search queries, in my schema I have the following


    <fieldType name="bigramed_phrase" class="solr.TextField" 
positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="true" 
outputUnigramIfNoNgram="true" />
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.ShingleFilterFactory" outputUnigrams="false" 
outputUnigramIfNoNgram="true" />
      </analyzer>
    </fieldType>

In the indexes, as seen with luke I do indeed have a large range of shingled 
terms.

When I run the analyser for either query or index terms I also see the 
breakdown 
with the shingled terms correctly displayed.

However when I attempt to use this in a query I do not see the terms applied in 
the debug output, for example with the term "short red evil fox" I would expect 
to see the shingles
'short_red' 'red_evil' 'evil_fox'

but instead I get the following

"debug":{
  "rawquerystring":"short red evil fox",
  "querystring":"short red evil fox",
  "parsedquery":"+() ()",
  "parsedquery_toString":"+() ()",
  "explain":{},
  "QParser":"DisMaxQParser",
  "altquerystring":null,
  "boostfuncs":null,
  "filter_queries":["atomId:(8235 100000914 100000911 )"],
  "parsed_filter_queries":["atomId:8235 atomId:100000914 atomId:100000911"],
  "timing":{ ......

Does anyone know what I could be doing wrong here, is it a bug in the debug 
output, a stupid mistake misconception or piece of idiocy on my part or 
something else.


Many thanks

-- Greg Bowyer


Reply via email to