The bottom line for me, is that I am not going to upgrade to Lucene8 for a while.
The index migration would either cause a service interruption, or would require a little while to implement. I have more urgent technical debt to deal with. > On 21 Jun 2019, at 19:11, David Allouche <da...@allouche.net> wrote: > > Unfortunately, I cannot assume SolrCloud, because our software predates Solr. > > So I would either need to switch to Solr or reimplement a work-around for the > lack of index migration. I am reluctant to switch to Solr because it > increases the operational complexity. > > I understand the argument: if the algorithm fₙ() used to derive index data iₙ > from the raw data rₙ changes [iₙ=fₙ(rₙ)], the index data iₙ₊₁ may not be > derivable from iₙ [∃n∄g \ iₙ=g(iₙ₊₁)]. > > On the application level, one could store non-tokenized content (I guess > that's why ElasticSearch has .raw fields). And traverse the index. I already > have index traversal code that I use for garbage collection of old entries. > Use the non-tokenized content to build a new index. So the progress of the > conversion could be recorded as the index into LeafReader.getLiveDocs(). > > https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs-- > > Alternatively, since I do not have all the non-tokenized content in the index > now, I could use the external document id to retrieve the original document > text. > > Is there a convenient place to store the getLiveDocs index across process > interruptions? Or should I use something stupid like a file to store the > counter? > > That is still a lot of hassle, but I understand how it makes sense for Lucene > to consider index migration should be handled up the stack. > > >> On 21 Jun 2019, at 18:06, Erick Erickson <erickerick...@gmail.com> wrote: >> >> Assuming SolrCloud, reindex from scratch into a new collection then use >> collection aliasing when you were ready to switch. You don’t need to stop >> your clients when you use CREATEALIAS. >> >> Prior to writing the marker, Lucene would appear to work with older indexes, >> but there would be subtle errors because the information needed to score >> docs just wasn’t there. >> >> Here are two quotes from people who know that crystalized the problem Lucene >> faces for me: >> >> From Robert Muir: >> >> “I think the key issue here is Lucene is an index not a database. Because it >> is a lossy index and does not retain all of the user's data, its not >> possible to safely migrate some things automagically. In the norms case >> IndexWriter needs to re-analyze the text ("re-index") and compute stats to >> get back the value, so it can be re-encoded. The function is y = f(x) and if >> x is not available its not possible, so lucene can't do it.” >> >> From Mike McCandless: >> >> “This really is the difference between an index and a database: we do not >> store, precisely, the original documents. We store an efficient >> derived/computed index from them. Yes, Solr/ES can add database-like >> behavior where they hold the true original source of the document and use >> that to rebuild Lucene indices over time. But Lucene really is just a >> "search index" and we need to be free to make important improvements with >> time.” >> >> Best, >> Erick >> >>> On Jun 21, 2019, at 7:10 AM, David Allouche <da...@allouche.net> wrote: >>> >>> Wow. That is annoying. What is the reason for this? >>> >>> I assumed there was a smooth upgrade path, but apparently, by design, one >>> has to rebuild the index at least once every two major releases. >>> >>> So, my question becomes, what is the recommended way of dealing with >>> reindex-from-scratch without service interruption? >>> >>> So I guess the upgrade path looks something like: >>> - Create Lucene6 index >>> - Update Lucene6 index >>> - Create Lucene7 index >>> - Separately keep track of which documents are indexed in Lucene7 and >>> Lucene6 indexes >>> - Make updates to Lucene6 index, concurrently build Lucene7 index from >>> scratch, user Lucene6 index for search. >>> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 >>> index for search. >>> >>> Rinse and repeat every major version. >>> >>> Really, isn't there something simpler already to handle Lucene major >>> version upgrades? >>> >>> >>>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerick...@gmail.com> wrote: >>>> >>>> Let’s back up a bit. What version of Lucene are you using? Starting with >>>> Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It >>>> does not matter if the index has been completely rewritten. It does not >>>> matter if it’s been run through IndexUpgraderTool, which just does a >>>> forceMerge to 1 segment. A marker is preserved when a segment is created, >>>> and the earliest one is preserved across merges. So say you have two >>>> segments, one created with 6 and one with 7. The Lucene 6 marker is >>>> preserved when they are merged. >>>> >>>> Now, if any segment has the Lucene 6 marker, the index will not be opened >>>> by Lucene. >>>> >>>> If you’re using Lucene 7, then this error implies that one or more of your >>>> segments was created with Lucene 5 or earlier. >>>> >>>> So you probably need to re-index from scratch on whatever version of >>>> Lucene you want to use. >>>> >>>> Best, >>>> Erick >>>> >>>> >>>> >>>>> On Jun 17, 2019, at 8:41 AM, David Allouche <da...@allouche.net> wrote: >>>>> >>>>> Hello, >>>>> >>>>> I use Lucene with PyLucene on a public-facing web application. We have a >>>>> moderately large index (~24M documents, ~11GB index data), with a >>>>> constant stream of new documents. >>>>> >>>>> I recently upgraded to PyLucene 7. >>>>> >>>>> When trying to test the new release of PyLucene 8, I encountered an >>>>> IndexFormatTooOld error because my index conversion from Lucene6 to >>>>> Lucene7 was not complete. >>>>> >>>>> I found IndexUpgrader, and I had a look at its implementation. I would >>>>> very much like to avoid putting down the service during the index >>>>> upgrade, so I believe I cannot use IndexUpgrader because I need the write >>>>> lock to be held by the web application to index new documents. >>>>> >>>>> So I figure I could get the desired result with an >>>>> IndexWriter.forceMerge(1). But the documentation says "This is a horribly >>>>> costly operation, especially when you pass a small maxNumSegments; >>>>> usually you should only call this if the index is static (will no longer >>>>> be changed)." >>>>> https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int- >>>>> >>>>> And indeed, forceMerge tends be killed the kernel OOM killer on my >>>>> development VM. I want to avoid this failure mode in production. I could >>>>> increase the VM until it works, but I would rather have a less brutal >>>>> approach to upgrading a live index. Something that could run in the >>>>> background with reasonable amounts of anonymous memory. >>>>> >>>>> What is the recommended approach to upgrading a live index? >>>>> >>>>> How can I know from the code that the index needs upgrading at all? I >>>>> could add a manual knob to start an upgrade, but it would be better if it >>>>> occurred transparently when I upgrade PyLucene. >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>> >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org