Re: Fields, Index segments and docIds (second Try)

Shai Erera Fri, 02 May 2014 05:43:58 -0700

I don't think that you need to be concerned with the internal docIDs much.
Just imagine the indexes as a big table with multiple columns, where
columns are grouped together. Each group is a different index. If a
document does not have a value in one column, then you have an empty cell.
if a document doesn't have a value in entire group of columns, then you
denote that by adding an empty document.


Oh, and make sure to use a LogMergePolicy, so segments are merged in the
same order across all indexes.

And given that you rebuild the indexes every time, you can create them
one-by-one. You don't need to do that in parallel to all indexes, unless
it's more convenient for you.

Shai


On Fri, May 2, 2014 at 9:28 AM, Olivier Binda <[email protected]>wrote:

> On 05/02/2014 06:05 AM, Shai Erera wrote:
>
>> If you're always rebuilding, let alone forceMerge, you shouldn't have too
>> much trouble implementing it. Just make sure that you add documents in the
>> same order to all indexes.
>>
>> If you're always rebuilding, how come you have deletions? Anyway, you must
>> also delete in all indexes.
>>
>
> Indeed, I don't have deletions and I'm mainly concerned with merges.
> But I just want to understand the whole docId remapping process,
> out of curiosity and also because obtaining a docId (and not losing it)
> seems to be the key of parallel indexes
>
>  On May 2, 2014 1:57 AM, "Olivier Binda" <[email protected]> wrote:
>>
>>  On 05/01/2014 10:28 AM, Shai Erera wrote:
>>>
>>>  I'm glad it helped you. Good luck with the implementation.
>>>>
>>>>  Thanks. First I started looking at the lucene internal code. To
>>> understand
>>> when/where and why docIds are changing/need to be changed (in merge and
>>> doc
>>> deletions) .
>>> I have always wanted to understand this and I think the understanding may
>>> help me somehow.
>>>
>>>  One thing I didn't mention (though it's in the jdocs) -- it's not enough
>>>> to
>>>> have the documents of each index aligned, you also have to have the
>>>> segments aligned. That is, if both indexes have documents 0-5 aligned,
>>>> but
>>>> one index contains a single segment and the other one 2 segments, that's
>>>> not going to work.
>>>>
>>>>  That's good to know.
>>>
>>>   It is possible to do w/ some care -- when you build the German index,
>>>
>>>> disable merges (use NoMergePolicy) and flush whenever you indexed enough
>>>> documents to match an existing segment on e.g. the Common index.
>>>>
>>>> Or, if rebuilding all indexes won't take long, you can always rebuild
>>>> all
>>>> of them.
>>>>
>>>>  Yes. That's what I am usually doing (it takes less than 1 minute)
>>> Yet, I usually do a forceMarge too to only have 1 segment :/
>>>
>>>   Shai
>>>
>>>>
>>>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda <
>>>> [email protected]>
>>>> wrote:
>>>>
>>>>   On 04/30/2014 10:48 AM, Shai Erera wrote:
>>>>
>>>>>   I hope I got all the details right, if I didn't then please clarify.
>>>>>
>>>>>> Also,
>>>>>> I haven't read the entire thread, so if someone already suggested this
>>>>>> ...
>>>>>> well, it probably means it's the right solution :)
>>>>>>
>>>>>> It sounds like you could use Lucene's ParallelCompositeReader, which
>>>>>> already handles multiple IndexReaders that are aligned by their
>>>>>> internal
>>>>>> document IDs. The way it would work, as far as I understand your
>>>>>> scenario
>>>>>> is something like the following table (columns denote different
>>>>>> indexes).
>>>>>> Each index contains a subset of relevant fields, where common contains
>>>>>> the
>>>>>> common fields, and each language index contains the respective
>>>>>> language
>>>>>> fields.
>>>>>>
>>>>>> DocID        LuceneID  Common  English       German        ....
>>>>>> "FirstDoc"   0         A,B,C   EN_words,     DE_words,
>>>>>>                                   EN_sentences  DE_sentences
>>>>>> "SecondDoc"  1         A,B,C
>>>>>> "ThirdDoc"   2         A,B,C
>>>>>>
>>>>>> Each index can contain all relevant fields, or only a subset (e.g.
>>>>>> maybe
>>>>>> not all documents have a value for the 'B' field in the 'common'
>>>>>> index).
>>>>>> What's absolutely very important here though is that the indexes are
>>>>>> created very carefully, and if e.g. SecondDoc is not translated into
>>>>>> German, *you must still have an empty document* in the German index,
>>>>>> or
>>>>>> otherwise, document IDs will not align.
>>>>>>
>>>>>>   That's exactly how I saw it and what I need to do. So, I'll have a
>>>>>> very
>>>>>>
>>>>> good look at
>>>>>
>>>>> ParallelCompositeReader
>>>>>
>>>>>
>>>>>   Lucene does not offer a way to build those indexes though (patches
>>>>>
>>>>>> welcome!!).
>>>>>>
>>>>>>   This answers my question 1. Thanks.  :)
>>>>>>
>>>>> I somehow hoped that there was already support for that kind of
>>>>> situation
>>>>> in lucene but well,
>>>>> now at least I know that I won't find an already made solution to my
>>>>> problem in the lucene classes and that I will have to code one myself,
>>>>> by taking inspiration in the lucene classes that do similar processing.
>>>>>
>>>>>   We've started some effort very long time ago on LUCENE-1879
>>>>>
>>>>>> (there's a patch and a discussion for an alternative approach) as well
>>>>>> as
>>>>>> there is a very useful suggestion in ParallelCompositeReader's jdocs
>>>>>> (use
>>>>>> LogDocMergePolicy).
>>>>>>
>>>>>>   Wow, priceless. This gives me some headstart and inspiration. :)
>>>>>>
>>>>>
>>>>>   One challenge is how to support multi-threaded indexing, but perhaps
>>>>>
>>>>>> this
>>>>>> isn't a problem in your application? It sounds like, by you writing
>>>>>> that a
>>>>>> user will "download the german index", that the indexes are built
>>>>>> offline?
>>>>>>
>>>>>>   Indeed. The index is built offline, in a single thread, and once it
>>>>>> is
>>>>>>
>>>>> built, it is read only.
>>>>> Cant find an easier situation. :)
>>>>>
>>>>>
>>>>>    Another challenge is how to control segment merging, so that the
>>>>> *exact
>>>>>
>>>>>  same segments* are merged over the parallel indexes. Again, if your
>>>>>> application builds the indexes offline, then this should be easier to
>>>>>> accomplish.
>>>>>>
>>>>>> I assume though that when you index e.g. the German documents, then
>>>>>> the
>>>>>> already indexes 'common' fields do not change for a document. If they
>>>>>> do,
>>>>>> you will need to rebuild the 'common' index too.
>>>>>>
>>>>>> Once you achieve a correct parallel index, it is very easy to open a
>>>>>> ParallelCompositeReader on any subset of the indexes, e.g.
>>>>>> Common+English,
>>>>>> Common+German, or Common+English+German and search it, since the
>>>>>> internal
>>>>>> document IDs are perfectly aligned.
>>>>>>
>>>>>> Shai
>>>>>>
>>>>>>   Many thanks for the awesome answer and the help (I love you).
>>>>>>
>>>>> As I really really really need this to happen, I'm going to start
>>>>> working
>>>>> on this really soon.
>>>>>
>>>>> I'm definately not an expert on threads/filesystems/and lucene inner
>>>>> workings, so I can't promise to contribute a miracoulous patch though.
>>>>> Especially since I won't work on the muli-thread aspect of the problem.
>>>>> But I'll do the best I can and contribute back whatever code I can
>>>>> produce.
>>>>>
>>>>> Many thanks, again. :)
>>>>>
>>>>>
>>>>>  On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>    My suggestion is you not worry about the docId, in practice it is
>>>>>> an
>>>>>>
>>>>>>  "internal lucene" id, quite similar with a rowId on a database, each
>>>>>>> index
>>>>>>> may generate a different docId (it is their problem) from a
>>>>>>> translated
>>>>>>> document, you may use your own ID that relates one document to
>>>>>>> another
>>>>>>> on
>>>>>>> different index mainly because like you mention are translated
>>>>>>> documents
>>>>>>> that on theory can be ranked differently from language to language
>>>>>>> (it
>>>>>>> is
>>>>>>> not an obligation that a set of documents on different languages
>>>>>>> spams
>>>>>>> the
>>>>>>> same rank order but i am not 100% sure about this),
>>>>>>>
>>>>>>> Second reason is that 'they may change the internal structure of
>>>>>>> lucene
>>>>>>> without warrant', and then you lose the forward compatibility.
>>>>>>>
>>>>>>> I am not an expert on Lucene like Schindler, but reading their
>>>>>>> documentation understood that they have a special attention on
>>>>>>> "internal lucene" and "experimental lucene" which means internal is
>>>>>>> "non
>>>>>>> warrant compatible", and experimental "may be removed".
>>>>>>>
>>>>>>> For example they (apache-lucene) discover a "new manner" to relate
>>>>>>> each
>>>>>>> document that is more efficient and change some mechanism, then your
>>>>>>> application uses an internal mechanism that is high coupled with
>>>>>>> lucene
>>>>>>> version xxx (marked as "internal-lucene") you can stuck on a specific
>>>>>>> version and   on future have to rewrite some code because and this
>>>>>>> might
>>>>>>> cause some "management conflict" if your project follows a continuous
>>>>>>> integration and you are subordinated on a management structure (bad
>>>>>>> to
>>>>>>> you).
>>>>>>>
>>>>>>> I saw this on several projects that uses Lucene around they do not
>>>>>>> upgrade
>>>>>>> their lucene components on their new releases one example if i am not
>>>>>>> wrong
>>>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which
>>>>>>> means
>>>>>>> that "The project was abandoned because the manner how they integrate
>>>>>>> with
>>>>>>> Lucene was not fully functional".
>>>>>>>
>>>>>>> Another interesting thing is that developing around Lucene is more
>>>>>>> effective, you guarantee that your product will work and they
>>>>>>> guarantee
>>>>>>> that Lucene works too. This is related with design by contract.
>>>>>>>
>>>>>>> Regards.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
>>>>>>> [email protected]
>>>>>>>
>>>>>>>   wrote:
>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no
>>>>>>>> disrespect, but after thinking it through today,
>>>>>>>> I need to and would really love to have the answer to the following
>>>>>>>> question :
>>>>>>>>
>>>>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only
>>>>>>>> index
>>>>>>>>
>>>>>>>>   so
>>>>>>>>
>>>>>>>   that some fields are only found in some segments (and how ?)
>>>>>>>
>>>>>>>>
>>>>>>>> Uwe Schindler suggested using different index and a MultiReader for
>>>>>>>> my
>>>>>>>> needs and It probably answers my second question, better formulated
>>>>>>>> as
>>>>>>>>
>>>>>>>>   "Is
>>>>>>>>
>>>>>>>   it possible to restrict  an index to some of it's segments ? " as a
>>>>>>>
>>>>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read
>>>>>>>> the
>>>>>>>> aforementioned segments might do the trick
>>>>>>>>
>>>>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
>>>>>>>>
>>>>>>>>   solve
>>>>>>>>
>>>>>>>   my needs as I have around 300000 documents of the following kind :
>>>>>>>
>>>>>>>> READ ONLY Document :
>>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>>> A:
>>>>>>>> B:
>>>>>>>> C:
>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>> EN:
>>>>>>>> EN_Words:
>>>>>>>> EN_Sentences:
>>>>>>>> some DocValues
>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>> DE:
>>>>>>>> DE_Words:
>>>>>>>> DE_Sentences:
>>>>>>>> some DocValues
>>>>>>>> ...
>>>>>>>> There might be hundreds of language package that my users might use
>>>>>>>>
>>>>>>>>
>>>>>>>> If I use different indexes
>>>>>>>> indexA for the common stuff,
>>>>>>>> indexEN for the English package,
>>>>>>>> indexDE for the german package,
>>>>>>>>
>>>>>>>> For sure, I will be able to make a big index out of those by using a
>>>>>>>> MultiReader
>>>>>>>> BUT it really makes an union out of the three index (right ?) which
>>>>>>>> means
>>>>>>>> I'll have 900000 documents
>>>>>>>> and the documents in the indexA won't have any relations to the
>>>>>>>> documents
>>>>>>>> in indexEN (right ?) except if I give each document an id in each
>>>>>>>> index
>>>>>>>>
>>>>>>>>   and
>>>>>>>>
>>>>>>>   make a join at query time which is a big no no, because I use a
>>>>>>>
>>>>>>>>   queryParser
>>>>>>>>
>>>>>>>   and users may enter queries like "A:gah AND (DE:schlaffen OR
>>>>>>>
>>>>>>>> EN:sleep)"
>>>>>>>>
>>>>>>>> Or I am mistaken and there is a way to create a document in three
>>>>>>>> different index that stay in relations with the same docId ?
>>>>>>>>
>>>>>>>>
>>>>>>>> My solution if question 1 is possible :
>>>>>>>>
>>>>>>>> In contrast, if I am able to build my index so that my READ ONLY
>>>>>>>> Document
>>>>>>>> are stored in
>>>>>>>>
>>>>>>>> SEGMENT 1
>>>>>>>> // common fields shipped with the App that aren't language related
>>>>>>>> A:
>>>>>>>> B:
>>>>>>>> C:
>>>>>>>>
>>>>>>>> SEGMENT 2
>>>>>>>> // fields shipped with the English package (a zip)
>>>>>>>> EN:
>>>>>>>> EN_Words:
>>>>>>>> EN_Sentences:
>>>>>>>> some DocValues
>>>>>>>>
>>>>>>>> SEGMENT 3
>>>>>>>> // fields shipped with the German package (a zip)
>>>>>>>> DE:
>>>>>>>> DE_Words:
>>>>>>>> DE_Sentences:
>>>>>>>> some DocValues
>>>>>>>>
>>>>>>>>
>>>>>>>> I only need to ship SEGMENT 1 in the App and let users download
>>>>>>>> SEGMENT
>>>>>>>> 2
>>>>>>>> or SEGMENT 3 whether they want english or german
>>>>>>>> and use a composite reader with atomic readers (right ?) to use my
>>>>>>>> frankenstein index at query time with a queryparser
>>>>>>>>
>>>>>>>>
>>>>>>>> Also, In case question 1 is possible. I would really like to know
>>>>>>>> too,
>>>>>>>> if
>>>>>>>> it is possible to remap at build time docIds in a read-only index.
>>>>>>>> An application of this would be :
>>>>>>>>
>>>>>>>> At day 1, I shipp my app with 2 languages packages : English and
>>>>>>>> german
>>>>>>>> (documents are uniquely identified by a docId... or by an external
>>>>>>>> id
>>>>>>>> (thanks to a docId<-> external id map)
>>>>>>>>
>>>>>>>> At day 2, I ship an additional language package (French) because I'm
>>>>>>>> able
>>>>>>>> to build an index with English, German, French with the same exact
>>>>>>>> docIds
>>>>>>>> for each document that the index shipped at day 1
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> ------------------------------------------------------------
>>>>>>>> ---------
>>>>>>>> To unsubscribe, e-mail: [email protected]
>>>>>>>> For additional commands, e-mail: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>   ------------------------------------------------------------
>>>>>>>>
>>>>>>> ---------
>>>>> To unsubscribe, e-mail: [email protected]
>>>>> For additional commands, e-mail: [email protected]
>>>>>
>>>>>
>>>>>
>>>>>  ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Fields, Index segments and docIds (second Try)

Reply via email to