Re: Best strategy migrate indexes

Pablo Vázquez Blázquez Mon, 07 Nov 2022 03:08:56 -0800

Hi!

> I am trying to create a tool to read docs from a lucene5 index and
generate lucene9 documents from them (with docValues). That might work,
right? I am shading both lucene5 and lucene9 to avoid package conflicts.


I am doing the following steps:

- create IndexReader with lucene5 package over a lucene5 index
- create IndexWriter with lucene7 package
- iterate over reader.numDocs() to process each Document (lucene5)
    - convert each Document (lucene5) to lucene7 Document
        - for each IndexableField (lucene5) from Document (lucene5) convert
it to create an IndexableField (lucene7)
            - create a SortedDocValuesField (lucene7) and add it to the
Document (lucene7)
            - add the field to the Document (lucene7)
    - add each converted Document to the writer
- close  IndexReader and IndexWriter

When I open the resulting migrated lucene7 index with Luke I got an error:
org.apache.lucene.index.IndexFormatTooNewException: Format version is not
supported (resource
BufferedChecksumIndexInput(MMapIndexInput(path="tests_small_index-7.x-migrator\segments_1"))):
9 (needs to be between 6 and 7)

When I use the tool "luceneupgrader
<https://github.com/hakanai/luceneupgrader>", I got:
java -jar luceneupgrader-0.5.2-SNAPSHOT.jar info
tests_small_index-7.x-migrator
Lucene index version: 7

What am I doing wrong or misleading?

Thanks!

El mié, 2 nov 2022 a las 21:13, Pablo Vázquez Blázquez (<pabl...@gmail.com>)
escribió:

> Hi,
>
> Luckily we were already using lucenemigrator
>
>
> What do you mean with "lucenemigrator"? Is it a public tool?
>
> I am trying to create a tool to read docs from a lucene5 index and
> generate lucene9 documents from them (with docValues). That might work,
> right? I am shading both lucene5 and lucene9 to avoid package conflicts.
>
> Thanks!
>
> El mar, 1 nov 2022 a las 0:35, Trejkaz (<trej...@trypticon.org>) escribió:
>
>> Well...
>>
>> There's a way, but I wouldn't necessarily recommend it.
>>
>> You can write custom migration code against some version of Lucene
>> which supports doc values, to create doc values fields. It's going to
>> involve writing a FilterCodecReader which wraps your real index and
>> then pretends to also have doc values, which you'll build in a custom
>> class which works similarly to UninvertingReader. Then you pass those
>> CodecReaders to IndexWriter.addIndexes to create a new index which
>> really has those doc values.
>>
>> We did that ourselves when we had the same issue. The only painful
>> thing about it is having to keep around older versions of lucene to do
>> that migration. Forever. Luckily we were already using lucenemigrator,
>> which has the older versions baked into it with package prefixes. So
>> that library will get fatter and fatter over time but at least our own
>> code only gets fatter at the rate migrations are added.
>>
>> The same approach works for any other kind of ad-hoc migration you
>> might want to perform. e.g., you might want to create points. Or
>> remove an index for a field. Or add an index for a field.
>>
>> TX
>>
>>
>> On Tue, 1 Nov 2022 at 02:57, Pablo Vázquez Blázquez <pabl...@gmail.com>
>> wrote:
>> >
>> > Hi all,
>> >
>> > Thank you all for your responses.
>> >
>> > So, when updating to a newer (major) Lucene version that modifies its
>> > codecs, there is no way to ensure everything keeps working properly,
>> unless
>> > re-indexing, right?
>> >
>> > Apart from not having some original sources that were indexed (which I
>> will
>> > try to solve by using the *IndexUpgrader *tool), I have another
>> problem: I
>> > was using the org.apache.lucene.uninverting.UninvertingReader to perform
>> > queries against the index, mainly using the grouping api. But
>> currently, it
>> > was removed (since Lucene 7.0). So, again, do I have any other
>> alternative,
>> > apart from re-indexing to use docValues?
>> >
>> > To give you more context, I am a developer of a tool that multiple
>> > customers can use to index their data (currently, with Lucene 5.5.5). We
>> > are planning to upgrade to Lucene 9 (because of some vulnerabilities
>> > affecting Lucene 5.5.5) and I think asking them to reindex will not go
>> down
>> > well :(
>> >
>> > Regards,
>> >
>> > El sáb, 29 oct 2022 a las 23:31, Matt Davis (<kryptonics...@gmail.com>)
>> > escribió:
>> >
>> > > Inside of Zulia search engine, the object being indexed is always a
>> > > JSON/BSON object and we store the BSON as a stored byte field in the
>> > > index.  This allows easy internal reindexing when the searchable
>> fields
>> > > change but also allows us to update to the latest lucene version.
>> > >  Combined with using lucene-backward-codecs an older index than the
>> current
>> > > major version can be opened and reindexed.  If you have stored all the
>> > > fields (or a json/bson) in the index, it would be easy to reindex in
>> the
>> > > new format.  If you have not, maybe opening with
>> lucene-backward-codecs
>> > > will be enough for your use case.
>> > >
>> > > Thanks,
>> > > Matt
>> > >
>> > > On Sat, Oct 29, 2022 at 2:30 PM Baris Kazar <baris.ka...@oracle.com>
>> > > wrote:
>> > >
>> > > > It is always great practice to retain non-indexed
>> > > > data since when Lucene changes version,
>> > > > even minor version, I always reindex.
>> > > >
>> > > > Best regards
>> > > > ________________________________
>> > > > From: Gus Heck <gus.h...@gmail.com>
>> > > > Sent: Saturday, October 29, 2022 2:17 PM
>> > > > To: java-user@lucene.apache.org <java-user@lucene.apache.org>
>> > > > Subject: Re: Best strategy migrate indexes
>> > > >
>> > > > Hi Pablo,
>> > > >
>> > > > The deafening silence is probably nobody wanting to give you the bad
>> > > news.
>> > > > You are on a mission that may not be feasible, and even if you can
>> get it
>> > > > to "work", the end result won't likely be equivalent to indexing the
>> > > > original data with Lucene 9.x. The indexing process is fundamentally
>> > > lossy
>> > > > and information originally used to produce non-stored fields will
>> have
>> > > been
>> > > > thrown out. A simple example is things like stopwords or anything
>> > > analyzed
>> > > > with subclasses of FilteringTokenFilter. If the stop word list
>> changed,
>> > > or
>> > > > the details of one of these filters changed (bugfix?), you will end
>> up
>> > > with
>> > > > a different result than indexing with 9.x. This is just one
>> > > > example, another would be stemming where the index likely only
>> contains
>> > > the
>> > > > stem, not the whole word. Other folks who are more interested in the
>> > > > details of our codecs than I am can probably provide further
>> examples on
>> > > a
>> > > > more fundamental level. Lucene is not a database, and the source
>> > > documents
>> > > > should always be retained in a form that can be reindexed. If you
>> have
>> > > > inherited a system where source material has not been retained, you
>> have
>> > > a
>> > > > difficult project and may have some potentially painful expectation
>> > > setting
>> > > > to perform.
>> > > >
>> > > > Best,
>> > > > Gus
>> > > >
>> > > >
>> > > >
>> > > > On Fri, Oct 28, 2022 at 8:01 AM Pablo Vázquez Blázquez <
>> > > pabl...@gmail.com>
>> > > > wrote:
>> > > >
>> > > > > Hi all,
>> > > > >
>> > > > > I have some indices indexed with lucene 5.5.0. I have updated my
>> > > > > dependencies and code to Lucene 7 (but my final goal is to use
>> Lucene
>> > > 9)
>> > > > > and when trying to work with them I am having the exception:
>> > > > > org.apache.lucene.index.IndexFormatTooOldException: Format
>> version is
>> > > not
>> > > > > supported (resource
>> > > > >
>> > > > >
>> > > >
>> > >
>> BufferedChecksumIndexInput(MMapIndexInput(path=".......\tests\segments_b"))):
>> > > > > this index is too old (version: 5.5.0). This version of Lucene
>> only
>> > > > > supports indexes created with release 6.0 and later.
>> > > > >
>> > > > > I want to migrate from Lucene 5.x to Lucene 9.x. Which is the best
>> > > > > strategy? Is there any tool to migrate the indices? Is it
>> mandatory to
>> > > > > reindex? In this case, how can I deal with this when I do not
>> have the
>> > > > > sources of documents that generated my current indices (I mean, I
>> just
>> > > > have
>> > > > > the indices themselves)?
>> > > > >
>> > > > > Thanks,
>> > > > >
>> > > > > --
>> > > > > Pablo Vázquez
>> > > > > (pabl...@gmail.com)
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > >
>> > >
>> https://urldefense.com/v3/__http://www.needhamsoftware.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuio4iIYARA$
>> > > >  (work)
>> > > >
>> > > >
>> > >
>> https://urldefense.com/v3/__http://www.the111shift.com__;!!ACWV5N9M2RV99hQ!PVR-c0gAs5FpIrnotHWeo3sEWScxV8oFJrVpGdItGZictcDbRvnp5aZSqCRhglMCYqQsewQOuirxfFWpEQ$
>> > > >  (play)
>> > > >
>> > >
>> >
>> >
>> > --
>> > Pablo Vázquez
>> > (pabl...@gmail.com)
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>
> --
> Pablo Vázquez
> (pabl...@gmail.com)
>


-- 
Pablo Vázquez
(pabl...@gmail.com)

Re: Best strategy migrate indexes

Reply via email to