The bottom line for me, is that I am not going to upgrade to Lucene8 for a 
while.

The index migration would either cause a service interruption, or would require 
a little while  to implement.

I have more urgent technical debt to deal with.

> On 21 Jun 2019, at 19:11, David Allouche <da...@allouche.net> wrote:
> 
> Unfortunately, I cannot assume SolrCloud, because our software predates Solr.
> 
> So I would either need to switch to Solr or reimplement a work-around for the 
> lack of index migration. I am reluctant to switch to Solr because it 
> increases the operational complexity.
> 
> I understand the argument: if the algorithm fₙ() used to derive index data iₙ 
> from the raw data rₙ changes [iₙ=fₙ(rₙ)], the index data iₙ₊₁ may not be 
> derivable from iₙ [∃n∄g \ iₙ=g(iₙ₊₁)].
> 
> On the application level, one could store non-tokenized content (I guess 
> that's why ElasticSearch has .raw fields). And traverse the index. I already 
> have index traversal code that I use for garbage collection of old entries. 
> Use the non-tokenized content to build a new index. So the progress of the 
> conversion could be recorded as the index into LeafReader.getLiveDocs().
> 
> https://lucene.apache.org/core/8_1_1/core/org/apache/lucene/index/LeafReader.html#getLiveDocs--
> 
> Alternatively, since I do not have all the non-tokenized content in the index 
> now, I could use the external document id to retrieve the original document 
> text.
> 
> Is there a convenient place to store the getLiveDocs index across process 
> interruptions? Or should I use something stupid like a file to store the 
> counter?
> 
> That is still a lot of hassle, but I understand how it makes sense for Lucene 
> to consider index migration should be handled up the stack. 
> 
> 
>> On 21 Jun 2019, at 18:06, Erick Erickson <erickerick...@gmail.com> wrote:
>> 
>> Assuming SolrCloud, reindex from scratch into a new collection then use 
>> collection aliasing when you were ready to switch. You don’t need to stop 
>> your clients when you use CREATEALIAS.
>> 
>> Prior to writing the marker, Lucene would appear to work with older indexes, 
>> but there would be subtle errors because the information needed to score 
>> docs just wasn’t there.
>> 
>> Here are two quotes from people who know that crystalized the problem Lucene 
>> faces for me:
>> 
>> From Robert Muir: 
>> 
>> “I think the key issue here is Lucene is an index not a database. Because it 
>> is a lossy index and does not retain all of the user's data, its not 
>> possible to safely migrate some things automagically. In the norms case 
>> IndexWriter needs to re-analyze the text ("re-index") and compute stats to 
>> get back the value, so it can be re-encoded. The function is y = f(x) and if 
>> x is not available its not possible, so lucene can't do it.”
>> 
>> From Mike McCandless:
>> 
>> “This really is the difference between an index and a database: we do not 
>> store, precisely, the original documents.  We store an efficient 
>> derived/computed index from them.  Yes, Solr/ES can add database-like 
>> behavior where they hold the true original source of the document and use 
>> that to rebuild Lucene indices over time.  But Lucene really is just a 
>> "search index" and we need to be free to make important improvements with 
>> time.”
>> 
>> Best,
>> Erick
>> 
>>> On Jun 21, 2019, at 7:10 AM, David Allouche <da...@allouche.net> wrote:
>>> 
>>> Wow. That is annoying. What is the reason for this?
>>> 
>>> I assumed there was a smooth upgrade path, but apparently, by design, one 
>>> has to rebuild the index at least once every two major releases.
>>> 
>>> So, my question becomes, what is the recommended way of dealing with 
>>> reindex-from-scratch without service interruption? 
>>> 
>>> So I guess the upgrade path looks something like:
>>> - Create Lucene6 index
>>> - Update Lucene6 index
>>> - Create Lucene7 index
>>> - Separately keep track of which documents are indexed in Lucene7 and 
>>> Lucene6 indexes
>>> - Make updates to Lucene6 index, concurrently build Lucene7 index from 
>>> scratch, user Lucene6 index for search.
>>> - When Lucene7 index is fully built, remove Lucene6 index and use Lucene7 
>>> index for search.
>>> 
>>> Rinse and repeat every major version.
>>> 
>>> Really, isn't there something simpler already to handle Lucene major 
>>> version upgrades?
>>> 
>>> 
>>>> On 17 Jun 2019, at 18:04, Erick Erickson <erickerick...@gmail.com> wrote:
>>>> 
>>>> Let’s back up a bit. What version of Lucene are you using? Starting with 
>>>> Lucene 8, any index that’s ever been touched by Lucene 6 will not open. It 
>>>> does not matter if the index has been completely rewritten. It does not 
>>>> matter if it’s been run through IndexUpgraderTool, which just does a 
>>>> forceMerge to 1 segment. A marker is preserved when a segment is created, 
>>>> and the earliest one is preserved across merges. So say you have two 
>>>> segments, one created with 6 and one with 7. The Lucene 6 marker is 
>>>> preserved when they are merged.
>>>> 
>>>> Now, if any segment has the Lucene 6 marker, the index will not be opened 
>>>> by Lucene.
>>>> 
>>>> If you’re using Lucene 7, then this error implies that one or more of your 
>>>> segments was created with Lucene 5 or earlier.
>>>> 
>>>> So you probably need to re-index from scratch on whatever version of 
>>>> Lucene you want to use.
>>>> 
>>>> Best,
>>>> Erick
>>>> 
>>>> 
>>>> 
>>>>> On Jun 17, 2019, at 8:41 AM, David Allouche <da...@allouche.net> wrote:
>>>>> 
>>>>> Hello,
>>>>> 
>>>>> I use Lucene with PyLucene on a public-facing web application. We have a 
>>>>> moderately large index (~24M documents, ~11GB index data), with a 
>>>>> constant stream of new documents.
>>>>> 
>>>>> I recently upgraded to PyLucene 7.
>>>>> 
>>>>> When trying to test the new release of PyLucene 8, I encountered an 
>>>>> IndexFormatTooOld error because my index conversion from Lucene6 to 
>>>>> Lucene7 was not complete.
>>>>> 
>>>>> I found IndexUpgrader, and I had a look at its implementation. I would 
>>>>> very much like to avoid putting down the service during the index 
>>>>> upgrade, so I believe I cannot use IndexUpgrader because I need the write 
>>>>> lock to be held by the web application to index new documents.
>>>>> 
>>>>> So I figure I could get the desired result with an 
>>>>> IndexWriter.forceMerge(1). But the documentation says "This is a horribly 
>>>>> costly operation, especially when you pass a small maxNumSegments; 
>>>>> usually you should only call this if the index is static (will no longer 
>>>>> be changed)." 
>>>>> https://lucene.apache.org/core/7_7_2/core/org/apache/lucene/index/IndexWriter.html#forceMerge-int-
>>>>> 
>>>>> And indeed, forceMerge tends be killed the kernel OOM killer on my 
>>>>> development VM. I want to avoid this failure mode in production. I could 
>>>>> increase the VM until it works, but I would rather have a less brutal 
>>>>> approach to upgrading a live index. Something that could run in the 
>>>>> background with reasonable amounts of anonymous memory.
>>>>> 
>>>>> What is the recommended approach to upgrading a live index?
>>>>> 
>>>>> How can I know from the code that the index needs upgrading at all? I 
>>>>> could add a manual knob to start an upgrade, but it would be better if it 
>>>>> occurred transparently when I upgrade PyLucene.
>>>>> 
>>>>> 
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>>> 
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>>> 
>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to