> 3) But then would not 'certificate anystopword found' match your phrase? I wound up making a separate index without stopwords just so that my phrase lookups would work. (I do not have the luxury of re-indexing, so now I'm stuck with this design even if there is a better one.)
I also made one with the phonetic DoubleMetaphone analyzer. This is really useful, especially for spell checking. Cheers, Lance -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Yonik Seeley Sent: Monday, February 18, 2008 1:43 PM To: solr-user@lucene.apache.org; Reece Subject: Re: Questions about filters and scoring On Feb 18, 2008 3:56 PM, Reece <[EMAIL PROTECTED]> wrote: > Hello Everyone, > > First off, sorry about the thread hijack earlier, it was not intentional. > > Back to the point though, I'm having some issues getting SOLR to work > with our dataset. I'm using it to index ticket data for our technical > support department. Below are a few of the problems I've been having, > and the wiki hasn't had much to say about them. > > 1) As an example, searching for "binarydata_groupdocument_fk" returns > nothing, while searching for "BinaryData_GroupDocument_FK" returns > results. I have the lowercasefilterfactory applied to both the index > and query analyzers. Does this not actually set everything to lower > case? From the wiki at > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters, it says > "Creates tokens by lowercasing all letters and dropping non-letters" > but that does not seem to be happening here. Am I forgetting to > configure something? Did you re-index? > 2) Some of our data is one sentence. Some is over 5 MB of text. When > searching for a term, it's returning the one sentence data first > because the fieldNorm is so different (0.4 for one, 0.002 for others). > Is there a way to disable using the fieldnorm in the score > calculation? It's probably Lucene's default length normalization over-emphasizing short fields. You could use a better similarity for your data, or turn off length normalization by setting omitNorms="true" for that field in the schema and then re-indexing (make sure to delete the old index entirely first). > An alternative I tried was posting parts of the data in as different > values of the field (so having multiple tags of that field-name in the > add xml post), but that appeared to have zero effect on the results - > even the querydebugger showed the exact same calculation for the > search. Does anyone know how to disable the fieldnorm, or have the > score created from adding the scores from each value of a multivalued > field? > > 3) I discovered that searching for '"certificate not found"' (using > the double quotes for a phrase here) did not return any results, even > though the phrase did exist (and was lower case originally too, so > different than my first issue). I discovered it was because of the > stopword "not", but the same stopfilterfactory was applied to both the > index and query analyzers. Am I doing something wrong there? As a > workaround I'm having php manually removing stopwords from the > querystring, which is a real pain. I'm thinking my filters aren't > being applied correctly since this is similar to issue #1 but with a > different filter. Hmmm, looks like a recent change in lucene probably causes this bug. Could you open a new Solr JIRA issue to report this bug? -Yonik