RE: Questions about filters and scoring

Lance Norskog Mon, 18 Feb 2008 13:51:21 -0800

> 3)
But then would not 'certificate anystopword found' match your phrase?  I
wound up making a separate index without stopwords just so that my phrase
lookups would work. (I do not have the luxury of re-indexing, so now I'm
stuck with this design even if there is a better one.)


I also made one with the phonetic DoubleMetaphone analyzer. This is really
useful, especially for spell checking.

Cheers,

Lance 

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley
Sent: Monday, February 18, 2008 1:43 PM
To: solr-user@lucene.apache.org; Reece
Subject: Re: Questions about filters and scoring

On Feb 18, 2008 3:56 PM, Reece <[EMAIL PROTECTED]> wrote:
> Hello Everyone,
>
> First off, sorry about the thread hijack earlier, it was not intentional.
>
> Back to the point though, I'm having some issues getting SOLR to work 
> with our dataset.  I'm using it to index ticket data for our technical 
> support department.  Below are a few of the problems I've been having, 
> and the wiki hasn't had much to say about them.
>
> 1) As an example, searching for "binarydata_groupdocument_fk" returns 
> nothing, while searching for "BinaryData_GroupDocument_FK" returns 
> results.  I have the lowercasefilterfactory applied to both the index 
> and query analyzers.  Does this not actually set everything to lower 
> case?  From the wiki at 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, it says 
> "Creates tokens by lowercasing all letters and dropping non-letters"
> but that does not seem to be happening here.  Am I forgetting to 
> configure something?

Did you re-index?

> 2) Some of our data is one sentence.  Some is over 5 MB of text.  When 
> searching for a term, it's returning the one sentence data first 
> because the fieldNorm is so different (0.4 for one, 0.002 for others).
>  Is there a way to disable using the fieldnorm in the score 
> calculation?

It's probably Lucene's default length normalization over-emphasizing short
fields.
You could use a better similarity for your data, or turn off length
normalization by setting omitNorms="true" for that field in the schema and
then re-indexing (make sure to delete the old index entirely first).

>  An alternative I tried was posting parts of the data in as different 
> values of the field (so having multiple tags of that field-name in the 
> add xml post), but that appeared to have zero effect on the results - 
> even the querydebugger showed the exact same calculation for the 
> search.  Does anyone know how to disable the fieldnorm, or have the 
> score created from adding the scores from each value of a multivalued 
> field?
>
> 3) I discovered that searching for '"certificate not found"' (using 
> the double quotes for a phrase here) did not return any results, even 
> though the phrase did exist (and was lower case originally too, so 
> different than my first issue).  I discovered it was because of the 
> stopword "not", but the same stopfilterfactory was applied to both the 
> index and query analyzers.  Am I doing something wrong there?  As a 
> workaround I'm having php manually removing stopwords from the 
> querystring, which is a real pain.  I'm thinking my filters aren't 
> being applied correctly since this is similar to issue #1 but with a 
> different filter.

Hmmm, looks like a recent change in lucene probably causes this bug.
Could you open a new Solr JIRA issue to report this bug?

-Yonik

RE: Questions about filters and scoring

Reply via email to