Re: quick question

Reece Mon, 18 Feb 2008 07:49:12 -0800

Hello Everyone,

I'm having some issues getting SOLR to work with our data.  I'm using
it to index incident data for our technical support department.  The
two main issues:


1) As an example, searching for "binarydata_groupdocument_fk" returns
nothing, while searching for "BinaryData_GroupDocument_FK" returns
results.  I have the lowercasefilterfactory applied to both the index
and query analyzers.  Does this not actually set everything to lower
case?  From the wiki at
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, it says
"Creates tokens by lowercasing all letters and dropping non-letters"
but that does not seem to be happening here.

2) Some of our data is one sentence.  Some is over 5 MB of text.  When
searching for a term, it's returning the one sentence data first
because the fieldNorm is so different (0.4 for one, 0.002 for others).
 Is there a way to disable using the fieldnorm in the score
calculation?  An alternative I tried was posting parts of the data in
as different values of the field (so having multiple tags of that
field-name in the add xml post), but that appeared to have zero effect
on the results - even the querydebugger showed the exact same
calculation for the search.  Does anyone know how to disable the
fieldnorm, or have the score created from adding the scores from each
value of a multivalued field?

3) I discovered that searching for '"certificate not found"' (using
the double quotes for a phrase here) did not return any results, even
though the phrase did exist (and was lower case originally too, so
different than my first issue).  I discovered it was because of the
stopword "not", but the same stopfilterfactory was applied to both the
index and query analyzers.  Am I doing something wrong there?  As a
workaround I'm having php manually removing stopwords from the
querystring, which is a real pain.

Here is my fieldtype I do the actual searches on:

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory"
synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>

Any help or advice would be greatly appreciated, thanks!

-Reece

Re: quick question

Reply via email to