Re: Fields, Index segments and docIds (second Try)

Jose Carlos Canova Tue, 29 Apr 2014 21:09:07 -0700

My suggestion is you not worry about the docId, in practice it is an
"internal lucene" id, quite similar with a rowId on a database, each index
may generate a different docId (it is their problem) from a translated
document, you may use your own ID that relates one document to another on
different index mainly because like you mention are translated documents
that on theory can be ranked differently from language to language (it is
not an obligation that a set of documents on different languages spams the
same rank order but i am not 100% sure about this),


Second reason is that 'they may change the internal structure of lucene
without warrant', and then you lose the forward compatibility.

I am not an expert on Lucene like Schindler, but reading their
documentation understood that they have a special attention on
"internal lucene" and "experimental lucene" which means internal is "non
warrant compatible", and experimental "may be removed".

For example they (apache-lucene) discover a "new manner" to relate each
document that is more efficient and change some mechanism, then your
application uses an internal mechanism that is high coupled with lucene
version xxx (marked as "internal-lucene") you can stuck on a specific
version and   on future have to rewrite some code because and this might
cause some "management conflict" if your project follows a continuous
integration and you are subordinated on a management structure (bad to
you).

I saw this on several projects that uses Lucene around they do not upgrade
their lucene components on their new releases one example if i am not wrong
still uses Lucene 3 and other that i saw around (e.g. Luke) which means
that "The project was abandoned because the manner how they integrate with
Lucene was not fully functional".

Another interesting thing is that developing around Lucene is more
effective, you guarantee that your product will work and they guarantee
that Lucene works too. This is related with design by contract.

Regards.







On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <[email protected]>wrote:

> Hello.
>
> Sorry to bring this up again. I don't want to be rudeand I mean no
> disrespect, but after thinking it through today,
> I need to and would really love to have the answer to the following
> question :
>
> 1) At lucene indexing time, is it possible to rewrite a read-only index so
> that some fields are only found in some segments (and how ?)
>
>
> Uwe Schindler suggested using different index and a MultiReader for my
> needs and It probably answers my second question, better formulated as "Is
> it possible to restrict  an index to some of it's segments ? " as a
> CompositeReader with AtomicReaders (or a custom Directory) that read the
> aforementioned segments might do the trick
>
> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't solve
> my needs as I have around 300000 documents of the following kind :
>
> READ ONLY Document :
> // common fields shipped with the App that aren't language related
> A:
> B:
> C:
> // fields shipped with the English package (a zip)
> EN:
> EN_Words:
> EN_Sentences:
> some DocValues
> // fields shipped with the German package (a zip)
> DE:
> DE_Words:
> DE_Sentences:
> some DocValues
> ...
> There might be hundreds of language package that my users might use
>
>
> If I use different indexes
> indexA for the common stuff,
> indexEN for the English package,
> indexDE for the german package,
>
> For sure, I will be able to make a big index out of those by using a
> MultiReader
> BUT it really makes an union out of the three index (right ?) which means
> I'll have 900000 documents
> and the documents in the indexA won't have any relations to the documents
> in indexEN (right ?) except if I give each document an id in each index and
> make a join at query time which is a big no no, because I use a queryParser
> and users may enter queries like "A:gah AND (DE:schlaffen OR EN:sleep)"
>
> Or I am mistaken and there is a way to create a document in three
> different index that stay in relations with the same docId ?
>
>
> My solution if question 1 is possible :
>
> In contrast, if I am able to build my index so that my READ ONLY Document
> are stored in
>
> SEGMENT 1
> // common fields shipped with the App that aren't language related
> A:
> B:
> C:
>
> SEGMENT 2
> // fields shipped with the English package (a zip)
> EN:
> EN_Words:
> EN_Sentences:
> some DocValues
>
> SEGMENT 3
> // fields shipped with the German package (a zip)
> DE:
> DE_Words:
> DE_Sentences:
> some DocValues
>
>
> I only need to ship SEGMENT 1 in the App and let users download SEGMENT 2
> or SEGMENT 3 whether they want english or german
> and use a composite reader with atomic readers (right ?) to use my
> frankenstein index at query time with a queryparser
>
>
> Also, In case question 1 is possible. I would really like to know too, if
> it is possible to remap at build time docIds in a read-only index.
> An application of this would be :
>
> At day 1, I shipp my app with 2 languages packages : English and german
> (documents are uniquely identified by a docId... or by an external id
> (thanks to a docId<-> external id map)
>
> At day 2, I ship an additional language package (French) because I'm able
> to build an index with English, German, French with the same exact docIds
> for each document that the index shipped at day 1
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Fields, Index segments and docIds (second Try)

Reply via email to