RE: Fields, Index segments and docIds

Uwe Schindler Mon, 28 Apr 2014 23:47:24 -0700

Hi Oliver,

To me it looks like you want to do it much too complicated. It also seems that 
you misunderstood join queries, which seems to be your problem. Comments inside:


> My lucene Index is built and stored in a zip file (uncompressed) which is used
> as a read-only Directory.
> 
> 1) At lucene indexing time, is it possible to rewrite the index so that some
> fields are only found in some segments Say :
> 
> EnglishWords, EnglishVerbs go to Segment 1 GermanWords,
> GermanSentences go to Segment 2 French, frenchWines go to Segment 3 ...

You can create the 100% same index structure manually without dealing with 
Lucene internals. Just index every language into a separate index with a 
separate IndexWriter. As those segments are read-only, you can call 
forceMerge(1) after indexing, so those indexes have exactly 1 segment -> every 
language has one single segment.

The only difference is: You would need a separate ZIP file for every language 
(which is what you probably need, because you want to ship "language packs"). 
Or you have to rewrite your ZIP-Directory implementation, to work on 
subdirectories inside the ZIP file.

> 2) In what file is the index structure written (number of index,
> docValues...) ? And, is it possible, to tamper in some way with this Say, in a
> Directory implementation...at start of my application, to tell the lucene 
> index
> to use this segment or not

If every language is a separate index, just use "new MultiReader(indexReader1, 
indexReader2, indexReader3)" to combine them and query the multiReader. This is 
the identical structure to a single DirectoryReader (which is also handled as a 
MultiReader internally) and therefore has no speed impact.

> If 1, 2 were possible, I think that it would allow me to ship my index
> in a modular way in my apps (with language packs)
> and do join queries as regular queries, with no speed penalty

The "join" keyword seems to be your main misunderstanding. There is no relation 
between join queries and multiple indexes. In Lucene "join" queries are to join 
between documents of different type in the same index! Queryng multiple indexes 
together is not joining, it is simple and very fast (because this is how Lucene 
was made): Just use the MultiReader approach from above to query all indexes at 
the same time. As a MultiReader with many 1-segments DirectoryReaders is 
identical to a large DirectoryReader with n segments, there is no difference at 
all.

This is something different:

> 3) At lucene indexing time, is it possible to remap the docId values  (I saw
> some MergeState.mapDocId method...) Say
>   0 -> 4
>   1 -> 3
>   2 -> 1
>   3 -> 0
>   4 -> 2
> 
> > If 3 is possible, It would allow me to have some sort of
> forward/backward compatibilities with my shipped language packs
> and also to have fast implementations for some id related methods

What do you want to do? Why do you want to do this? (please refer to 
XY-Problem: <https://people.apache.org/~hossman/#xyproblem>).

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Fields, Index segments and docIds

Reply via email to