Re: Enabling/disabling docValues

Gus Heck Tue, 11 Jun 2019 06:52:09 -0700

On Mon, Jun 10, 2019 at 10:53 PM John Davis <johndavis925...@gmail.com>
wrote:

> You have made many assumptions which might not always be realistic a)
> TextField is always tokenized

Well, you could of course change configuration or code to do something else
but this would be a very odd and misleading thing to do and we would expect
you to have mentioned it.

> b) Users care about precise counts and

This is indeed use case dependent if you are talking about approximately
correct (150 vs 152 etc), but it's pretty reasonable to say that gross
errors (75 vs 153 or 0 vs 5 etc) more or less make faceting pointless.

> c) Users have the luxury or ability to do a full re-index anytime.

This is a state of affairs we consistently advise against. The reason we
give the advice is precisely because one cannot change the schema out from
under an existing index safely without rewriting the index. Without
extremely careful design on your side (not using certain features and high
storage requirements), your index will not retain enough information to
re-remake itself. Therefore, it is a long standing bad practice to not have
a separate canonical copy of the data and a means to re-index it (or a
design where only the very most recent data is important, and a copy of
that). There is a whole page dedicated to reindexing in the ref guide:
https://lucene.apache.org/solr/guide/8_0/reindexing.html Here's a relevant
bit from the current version:

`There is no process in Solr for programmatically reindexing data. When we
say "reindex", we mean, literally, "index it again". However you got the
data into the index the first time, you will run that process again. It is
strongly recommended that Solr users index their data in a repeatable,
consistent way, so that the process can be easily repeated when the need
for reindexing arises.`

The ref guide has lots of nice info, maybe you should read it rather than
snubbing one of the nicest and most knowledgeable committers on the project
(who is helping you for free) by haughtily saying you'll go ask someone
else... And if you've been left with this situation (no ability to reindex)
by your predecessor you have our deepest sympathies, but it still doesn't
change the fact that you need break it to management the your predecessor
has lost the data required to maintain the system and you still need
re-index whatever you can salvage somehow, or start fresh.

When Erick is saying you shouldn't be asking that question... >90% of the
time you really shouldn't be, and if you do pursue it, you'll just waste a
lot of your own time.

> On Mon, Jun 10, 2019 at 10:55 AM Erick Erickson <erickerick...@gmail.com>
> wrote:
>
> > bq. Does lucene look at %docs in each state, or the first doc or
> something
> > else?
> >
> > Frankly I don’t care since no matter what, the results of faceting mixed
> > definitions is not useful.
> >
> > tl;dr;
> >
> > “When I use a word,’ Humpty Dumpty said in rather a scornful tone, ‘it
> > means just what I choose it to mean — neither more nor less.’
> >
> > So “undefined" in this case means “I don’t see any value at all in
> chasing
> > that info down” ;).
> >
> > Changing from regular text to SortableText means that the results will be
> > inaccurate no matter what. For example, I have a doc with the value “my
> dog
> > has fleas”. When NOT using SortableText, there are multiple tokens so
> facet
> > counts would be:
> >
> > my (1)
> > dog (1)
> > has (1)
> > fleas (1)
> >
> > But for SortableText will be:
> >
> > my dog has fleas (1)
> >
> > Consider doc1 with “my dog has fleas” and doc2 with “my cat has fleas”.
> > doc1 was  indexed before switching to SortableText and doc2 after.
> > Presumably  the output you want is:
> >
> > my dog has fleas (1)
> > my cat has fleas (1)
> >
> > But you can’t get that output.  There are three cases:
> >
> > 1> Lucene treats all documents as SortableText, faceting on the docValues
> > parts. No facets on doc1
> >
> > my  cat has fleas (1)
> >
> > 2> Lucene treats all documents as tokenized, faceting on each individual
> > token. Faceting is performed on the tokenized content of both,  docValues
> > in doc2  ignored
> >
> > my  (2)
> > dog (1)
> > has (2)
> > fleas (2)
> > cat (1)
> >
> >
> > 3> Lucene does the best it can, faceting on the tokens for docs without
> > SortableText and docValues if the doc was indexed with Sortable text.
> doc1
> > faceted on tokenized, doc2 on docValues
> >
> > my  (1)
> > dog (1)
> > has (1)
> > fleas (1)
> > my cat has fleas (1)
> >
> > Since none of those cases is what I want, there’s no point I can see in
> > chasing down what actually happens….
> >
> > Best,
> > Erick
> >
> > P.S. I _think_ Lucene tries to use the definition from the first segment,
> > but since whether the lists of segments to be  merged don’t look at the
> > field definitions at all. Whether the first segment in the list has
> > SortableText or not will not be predictable in a general way even within
> a
> > single run.
> >
> >
> > > On Jun 9, 2019, at 6:53 PM, John Davis <johndavis925...@gmail.com>
> > wrote:
> > >
> > > Understood, however code is rarely random/undefined. Does lucene look
> at
> > %
> > > docs in each state, or the first doc or something else?
> > >
> > > On Sun, Jun 9, 2019 at 1:58 PM Erick Erickson <erickerick...@gmail.com
> >
> > > wrote:
> > >
> > >> It’s basically undefined. When segments are merged that have
> dissimilar
> > >> definitions like this what can Lucene do? Consider:
> > >>
> > >> Faceting on a text (not sortable) means that each individual token in
> > the
> > >> index is uninverted on the Java heap and the facets are computed for
> > each
> > >> individual term.
> > >>
> > >> Faceting on a SortableText field just has a single term per document,
> > and
> > >> that in the docValues structures as opposed to the inverted index.
> > >>
> > >> Now you change the value and start indexing. At some point a segment
> > >> containing no docValues is merged with a segment containing docValues
> > for
> > >> the field. The resulting mixed segment is in this state. If you facet
> on
> > >> the field, should the docs without docValues have each individual term
> > >> counted? Or just the SortableText values in the docValues structure?
> > >> Neither one is right.
> > >>
> > >> Also remember that Lucene has no notion of schema. That’s entirely
> > imposed
> > >> on Lucene by Solr carefully constructing low-level analysis chains.
> > >>
> > >> So I’d _strongly_ recommend you re-index your corpus to a new
> collection
> > >> with the current definition, then perhaps use CREATEALIAS to
> seamlessly
> > >> switch.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >>> On Jun 9, 2019, at 12:50 PM, John Davis <johndavis925...@gmail.com>
> > >> wrote:
> > >>>
> > >>> Hi there,
> > >>> We recently changed a field from TextField + no docValues to
> > >>> SortableTextField which has docValues enabled by default. Once I did
> > >> this I
> > >>> do not see any facet values for the field. I know that once all the
> > docs
> > >>> are re-indexed facets should work again, however can someone clarify
> > the
> > >>> current logic of lucene/solr how facets will be computed when schema
> is
> > >>> changed from no docValues to docValues and vice-versa?
> > >>>
> > >>> 1. Until ALL the docs are re-indexed, no facets will be returned?
> > >>> 2. Once certain fraction of docs are re-indexed, those facets will be
> > >>> returned?
> > >>> 3. Something else?
> > >>>
> > >>>
> > >>> Varun
> > >>
> > >>
> >
> >
>

-- 
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Enabling/disabling docValues

Reply via email to