RE: Solr configuration issue for sorting on title field

EL KASMI Hicham Mon, 25 Jan 2010 07:48:52 -0800

I see! Thanks a lot Eric.

==========================================
Hicham El Kasmi
Université Libre de Bruxelles - Libraries
Av. F.D. Roosevelt 50, CP 180
1050 BRUSSELS Belgium


Tel: + 32 2 650 25 30
Fax: + 32 2 650 23 91
========================================== 


-----Message d'origine-----
De : Erick Erickson [mailto:erickerick...@gmail.com] 
Envoyé : lundi 25 janvier 2010 16:43
À : solr-user@lucene.apache.org
Objet : Re: Solr configuration issue for sorting on title field

Well, stop doing that <G>.... The error message is  a bit misleading, and
was probably in response to the more-frequent error of tokenizing a field
then trying to sort on it.

The problem is that sorting on something with more than one token is
indeterminate. In this case, with more than one title is a different
manifestation of the underlying issue. Say you have three values
in a field, "a", "b" and "c". What does sorting on that field mean?
Just use the first token? Second? Any answer is wrong so it's
best to fail loudly.

By extension, more than one value (which you're getting from
multiple titles for at least some documents) will produce
incorrect (or at least puzzling) results sometime.

So storing exactly one title/document (probably in another field)
should cure this problem...

HTH
Erick

On Mon, Jan 25, 2010 at 10:30 AM, EL KASMI Hicham <hicham.el.ka...@ulb.ac.be
> wrote:

> Hi Eric,
>
> Yes, we're indexing more than one title per document (document's title, the
> title of the series or of the journal in which the document was published,
> etc...).
> Normally, each time we modified the SOLR config, we drop the old index and
> we create a new one.
>
> Thanks,
>
> ==========================================
> Hicham El Kasmi
> Université Libre de Bruxelles - Libraries
> Av. F.D. Roosevelt 50, CP 180
> 1050 BRUSSELS Belgium
>
> Tel: + 32 2 650 25 30
> Fax: + 32 2 650 23 91
> ==========================================
>
>
> -----Message d'origine-----
> De : Erik Hatcher [mailto:erik.hatc...@gmail.com]
> Envoyé : lundi 25 janvier 2010 15:03
> À : solr-user@lucene.apache.org
> Objet : Re: Solr configuration issue for sorting on title field
>
> Are you sending in more than one title per document, by chance?
>
> Have you changed your configuration without reindexing the entire
> collection, possibly?
>
>        Erik
>
>
> On Jan 25, 2010, at 8:32 AM, EL KASMI Hicham wrote:
>
> > Thanks Otis,
> >
> > The long message was an attempt to be as detailed as possible, so
> > that one could understand our tests. I'm afraid the problem we
> > wanted to describe didn't really come through.
> >
> > These are the relevant entries from our config:
> >
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > ======================================================================
> > <field name="title" type="text" indexed="true" stored="true"
> > termVectors="true" />
> > <field name="titleStr" type="string" indexed="true" stored="false" />
> >
> > <copyField source="title" dest="titleStr" />
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > =
> > ======================================================================
> >
> > We want to sort on titleStr; but we end up with the error message:
> >
> > "HTTP Status 500 - there are more terms than documents in field
> > "titleStr", but it's impossible to sort on tokenized fields".
> >
> > We don't understand this message, or what is wrong in our config. We
> > tried several other configs, as described in my first message, but
> > no positive result.
> >
> > Thanks for any clarification you can provide us.
> >
> > Hicham.
> >
> > ==========================================
> > Hicham El Kasmi
> > Université Libre de Bruxelles - Libraries
> > Av. F.D. Roosevelt 50, CP 180
> > 1050 BRUSSELS Belgium
> >
> > Tel: + 32 2 650 25 30
> > Fax: + 32 2 650 23 91
> > ==========================================
> >
> >
> > -----Message d'origine-----
> > De : Otis Gospodnetic [mailto:otis_gospodne...@yahoo.com]
> > Envoyé : vendredi 22 janvier 2010 6:17
> > À : solr-user@lucene.apache.org
> > Objet : Re: Solr configuration issue for sorting on title field
> >
> > Hi,
> >
> > Long message.  I skimmed through your configs.  It looks like your
> > main question is how can changing the field type (or, really,
> > turning off "multiValued" on a field cause the number of document in
> > your index to decrease, right?  Well, it can't or shouldn't.  I am
> > guessing you simply did something wrong, like not index all docs, or
> > got errors while indexing that you didn't notice or some such.
> >
> >
> > If all you changed is a field's type, this alone should not cause
> > your index to have fewer documents.
> >
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
> >
> >
> >
> > ----- Original Message ----
> >> From: EL KASMI Hicham <hicham.el.ka...@ulb.ac.be>
> >> To: solr-user@lucene.apache.org
> >> Sent: Thu, January 21, 2010 5:14:22 AM
> >> Subject: Solr configuration issue for sorting on title field
> >>
> >> Hello again,
> >>
> >> We have a problem with sorting on title field in Solr instance of our
> >> production repository, we get the error message:
> >>
> >> "HTTP Status 500 - there are more terms than documents in field
> >> "titleStr", but it's impossible to sort on tokenized fields".
> >>
> >> After some googling and searching in this listserv, we found that a
> >> sorting field has to be untokenized but our sorting field "titleStr"
> >> which is a copy of the "title" field has a string type.
> >>
> >> What we did as configs in our schema.xml file :
> >>
> >> 1st config
> >> ++++++++++
> >>
> >>
> >> sortMissingLast="true" omitNorms="true"/>
> >>
> >>
> >>
> >> positionIncrementGap="100">
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >> ===============================
> >>
> >>
> >> termVectors="true"/>
> >>
> >>
> >> =================================
> >>
> >>
> >>
> >>
> >> As you can see, the title field has the termVectors property as
> >> true, we
> >> drop it in the second attempt of our config
> >>
> >> 2end attempt
> >> ++++++++++++
> >>
> >>
> >> positionIncrementGap="100">
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >> ===============================
> >>
> >>
> >>
> >>
> >> =================================
> >>
> >>
> >>
> >>
> >> 3rd attempt
> >> +++++++++++
> >> Create a new field type named 'text_exact' which doesn't use the
> >> "WhitespaceTokenizer" tokenizer but instead uses the
> >> "KeywordTokenizer"
> >> tokenizer.
> >>
> >>
> >>
> >> positionIncrementGap="100">
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >> sortMissingLast="true" omitNorms="true">
> >>
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >>
> >>
> >> version="icu4j" composed="false" remove_diacritics="true"
> >> remove_modifiers="true" fold="true"/>
> >>
> >>
> >> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> >>
> >> words="stopwords.txt"/>
> >>
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0"/>
> >>
> >>
> >> protected="protwords.txt"/>
> >>
> >>
> >>
> >>
> >>
> >> ===============================
> >>
> >>
> >>
> >> stored="false"/>
> >>
> >> =================================
> >>
> >> No copyField.
> >>
> >>
> >> 4th attempt
> >> +++++++++++
> >> Same config as the 3rd attempt but adding explicitly that the
> >> property
> >> 'multiValued' of titleStr (text_exact type) field is false
> >>
> >>
> >> multiValued="false"/>
> >>
> >> For this last config, we noticed that the number of documents in our
> >> index was downsized from ~22500 records to ~17800 records! We don't
> >> understand this behavior of Sorl/Lucene?
> >>
> >>
> >> For all these configs, we got the same error message, please note
> >> that
> >> we encounter this issue on our production server
> >>
> http://difusion.ulb.ac.be/vufind/Search/Home?lookfor=&sort=pubdate+desc&;
> >> submitButton=Recherche&type=general&sort=title (with ~22500 records),
> >>
> >> or with the same config (the first one) on our test server
> >> http://bib17.ulb.ac.be/vufind/Search/Home?lookfor=&sort=pubdate
> >> +desc⊂
> >> mitButton=Find&type=general&sort=title (with ~57700 records), the
> >> sorting on title is going well!
> >>
> >> Thanks in advance for the time you can spend to have a look on this.
> >>
> >> Best regards,
> >>
> >> Hicham El Kasmi
> >
> >
>
>
>

RE: Solr configuration issue for sorting on title field

Reply via email to