I don't think that you need to be concerned with the internal docIDs much. Just imagine the indexes as a big table with multiple columns, where columns are grouped together. Each group is a different index. If a document does not have a value in one column, then you have an empty cell. if a document doesn't have a value in entire group of columns, then you denote that by adding an empty document.
Oh, and make sure to use a LogMergePolicy, so segments are merged in the same order across all indexes. And given that you rebuild the indexes every time, you can create them one-by-one. You don't need to do that in parallel to all indexes, unless it's more convenient for you. Shai On Fri, May 2, 2014 at 9:28 AM, Olivier Binda <olivier.bi...@wanadoo.fr>wrote: > On 05/02/2014 06:05 AM, Shai Erera wrote: > >> If you're always rebuilding, let alone forceMerge, you shouldn't have too >> much trouble implementing it. Just make sure that you add documents in the >> same order to all indexes. >> >> If you're always rebuilding, how come you have deletions? Anyway, you must >> also delete in all indexes. >> > > Indeed, I don't have deletions and I'm mainly concerned with merges. > But I just want to understand the whole docId remapping process, > out of curiosity and also because obtaining a docId (and not losing it) > seems to be the key of parallel indexes > > On May 2, 2014 1:57 AM, "Olivier Binda" <olivier.bi...@wanadoo.fr> wrote: >> >> On 05/01/2014 10:28 AM, Shai Erera wrote: >>> >>> I'm glad it helped you. Good luck with the implementation. >>>> >>>> Thanks. First I started looking at the lucene internal code. To >>> understand >>> when/where and why docIds are changing/need to be changed (in merge and >>> doc >>> deletions) . >>> I have always wanted to understand this and I think the understanding may >>> help me somehow. >>> >>> One thing I didn't mention (though it's in the jdocs) -- it's not enough >>>> to >>>> have the documents of each index aligned, you also have to have the >>>> segments aligned. That is, if both indexes have documents 0-5 aligned, >>>> but >>>> one index contains a single segment and the other one 2 segments, that's >>>> not going to work. >>>> >>>> That's good to know. >>> >>> It is possible to do w/ some care -- when you build the German index, >>> >>>> disable merges (use NoMergePolicy) and flush whenever you indexed enough >>>> documents to match an existing segment on e.g. the Common index. >>>> >>>> Or, if rebuilding all indexes won't take long, you can always rebuild >>>> all >>>> of them. >>>> >>>> Yes. That's what I am usually doing (it takes less than 1 minute) >>> Yet, I usually do a forceMarge too to only have 1 segment :/ >>> >>> Shai >>> >>>> >>>> On Thu, May 1, 2014 at 12:00 AM, Olivier Binda < >>>> olivier.bi...@wanadoo.fr> >>>> wrote: >>>> >>>> On 04/30/2014 10:48 AM, Shai Erera wrote: >>>> >>>>> I hope I got all the details right, if I didn't then please clarify. >>>>> >>>>>> Also, >>>>>> I haven't read the entire thread, so if someone already suggested this >>>>>> ... >>>>>> well, it probably means it's the right solution :) >>>>>> >>>>>> It sounds like you could use Lucene's ParallelCompositeReader, which >>>>>> already handles multiple IndexReaders that are aligned by their >>>>>> internal >>>>>> document IDs. The way it would work, as far as I understand your >>>>>> scenario >>>>>> is something like the following table (columns denote different >>>>>> indexes). >>>>>> Each index contains a subset of relevant fields, where common contains >>>>>> the >>>>>> common fields, and each language index contains the respective >>>>>> language >>>>>> fields. >>>>>> >>>>>> DocID LuceneID Common English German .... >>>>>> "FirstDoc" 0 A,B,C EN_words, DE_words, >>>>>> EN_sentences DE_sentences >>>>>> "SecondDoc" 1 A,B,C >>>>>> "ThirdDoc" 2 A,B,C >>>>>> >>>>>> Each index can contain all relevant fields, or only a subset (e.g. >>>>>> maybe >>>>>> not all documents have a value for the 'B' field in the 'common' >>>>>> index). >>>>>> What's absolutely very important here though is that the indexes are >>>>>> created very carefully, and if e.g. SecondDoc is not translated into >>>>>> German, *you must still have an empty document* in the German index, >>>>>> or >>>>>> otherwise, document IDs will not align. >>>>>> >>>>>> That's exactly how I saw it and what I need to do. So, I'll have a >>>>>> very >>>>>> >>>>> good look at >>>>> >>>>> ParallelCompositeReader >>>>> >>>>> >>>>> Lucene does not offer a way to build those indexes though (patches >>>>> >>>>>> welcome!!). >>>>>> >>>>>> This answers my question 1. Thanks. :) >>>>>> >>>>> I somehow hoped that there was already support for that kind of >>>>> situation >>>>> in lucene but well, >>>>> now at least I know that I won't find an already made solution to my >>>>> problem in the lucene classes and that I will have to code one myself, >>>>> by taking inspiration in the lucene classes that do similar processing. >>>>> >>>>> We've started some effort very long time ago on LUCENE-1879 >>>>> >>>>>> (there's a patch and a discussion for an alternative approach) as well >>>>>> as >>>>>> there is a very useful suggestion in ParallelCompositeReader's jdocs >>>>>> (use >>>>>> LogDocMergePolicy). >>>>>> >>>>>> Wow, priceless. This gives me some headstart and inspiration. :) >>>>>> >>>>> >>>>> One challenge is how to support multi-threaded indexing, but perhaps >>>>> >>>>>> this >>>>>> isn't a problem in your application? It sounds like, by you writing >>>>>> that a >>>>>> user will "download the german index", that the indexes are built >>>>>> offline? >>>>>> >>>>>> Indeed. The index is built offline, in a single thread, and once it >>>>>> is >>>>>> >>>>> built, it is read only. >>>>> Cant find an easier situation. :) >>>>> >>>>> >>>>> Another challenge is how to control segment merging, so that the >>>>> *exact >>>>> >>>>> same segments* are merged over the parallel indexes. Again, if your >>>>>> application builds the indexes offline, then this should be easier to >>>>>> accomplish. >>>>>> >>>>>> I assume though that when you index e.g. the German documents, then >>>>>> the >>>>>> already indexes 'common' fields do not change for a document. If they >>>>>> do, >>>>>> you will need to rebuild the 'common' index too. >>>>>> >>>>>> Once you achieve a correct parallel index, it is very easy to open a >>>>>> ParallelCompositeReader on any subset of the indexes, e.g. >>>>>> Common+English, >>>>>> Common+German, or Common+English+German and search it, since the >>>>>> internal >>>>>> document IDs are perfectly aligned. >>>>>> >>>>>> Shai >>>>>> >>>>>> Many thanks for the awesome answer and the help (I love you). >>>>>> >>>>> As I really really really need this to happen, I'm going to start >>>>> working >>>>> on this really soon. >>>>> >>>>> I'm definately not an expert on threads/filesystems/and lucene inner >>>>> workings, so I can't promise to contribute a miracoulous patch though. >>>>> Especially since I won't work on the muli-thread aspect of the problem. >>>>> But I'll do the best I can and contribute back whatever code I can >>>>> produce. >>>>> >>>>> Many thanks, again. :) >>>>> >>>>> >>>>> On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova < >>>>>> jose.carlos.can...@gmail.com> wrote: >>>>>> >>>>>> My suggestion is you not worry about the docId, in practice it is >>>>>> an >>>>>> >>>>>> "internal lucene" id, quite similar with a rowId on a database, each >>>>>>> index >>>>>>> may generate a different docId (it is their problem) from a >>>>>>> translated >>>>>>> document, you may use your own ID that relates one document to >>>>>>> another >>>>>>> on >>>>>>> different index mainly because like you mention are translated >>>>>>> documents >>>>>>> that on theory can be ranked differently from language to language >>>>>>> (it >>>>>>> is >>>>>>> not an obligation that a set of documents on different languages >>>>>>> spams >>>>>>> the >>>>>>> same rank order but i am not 100% sure about this), >>>>>>> >>>>>>> Second reason is that 'they may change the internal structure of >>>>>>> lucene >>>>>>> without warrant', and then you lose the forward compatibility. >>>>>>> >>>>>>> I am not an expert on Lucene like Schindler, but reading their >>>>>>> documentation understood that they have a special attention on >>>>>>> "internal lucene" and "experimental lucene" which means internal is >>>>>>> "non >>>>>>> warrant compatible", and experimental "may be removed". >>>>>>> >>>>>>> For example they (apache-lucene) discover a "new manner" to relate >>>>>>> each >>>>>>> document that is more efficient and change some mechanism, then your >>>>>>> application uses an internal mechanism that is high coupled with >>>>>>> lucene >>>>>>> version xxx (marked as "internal-lucene") you can stuck on a specific >>>>>>> version and on future have to rewrite some code because and this >>>>>>> might >>>>>>> cause some "management conflict" if your project follows a continuous >>>>>>> integration and you are subordinated on a management structure (bad >>>>>>> to >>>>>>> you). >>>>>>> >>>>>>> I saw this on several projects that uses Lucene around they do not >>>>>>> upgrade >>>>>>> their lucene components on their new releases one example if i am not >>>>>>> wrong >>>>>>> still uses Lucene 3 and other that i saw around (e.g. Luke) which >>>>>>> means >>>>>>> that "The project was abandoned because the manner how they integrate >>>>>>> with >>>>>>> Lucene was not fully functional". >>>>>>> >>>>>>> Another interesting thing is that developing around Lucene is more >>>>>>> effective, you guarantee that your product will work and they >>>>>>> guarantee >>>>>>> that Lucene works too. This is related with design by contract. >>>>>>> >>>>>>> Regards. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda < >>>>>>> olivier.bi...@wanadoo.fr >>>>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello. >>>>>>>> >>>>>>>> Sorry to bring this up again. I don't want to be rudeand I mean no >>>>>>>> disrespect, but after thinking it through today, >>>>>>>> I need to and would really love to have the answer to the following >>>>>>>> question : >>>>>>>> >>>>>>>> 1) At lucene indexing time, is it possible to rewrite a read-only >>>>>>>> index >>>>>>>> >>>>>>>> so >>>>>>>> >>>>>>> that some fields are only found in some segments (and how ?) >>>>>>> >>>>>>>> >>>>>>>> Uwe Schindler suggested using different index and a MultiReader for >>>>>>>> my >>>>>>>> needs and It probably answers my second question, better formulated >>>>>>>> as >>>>>>>> >>>>>>>> "Is >>>>>>>> >>>>>>> it possible to restrict an index to some of it's segments ? " as a >>>>>>> >>>>>>>> CompositeReader with AtomicReaders (or a custom Directory) that read >>>>>>>> the >>>>>>>> aforementioned segments might do the trick >>>>>>>> >>>>>>>> Yet, if I am not mistaken (please tell me if I am wrong), it doesn't >>>>>>>> >>>>>>>> solve >>>>>>>> >>>>>>> my needs as I have around 300000 documents of the following kind : >>>>>>> >>>>>>>> READ ONLY Document : >>>>>>>> // common fields shipped with the App that aren't language related >>>>>>>> A: >>>>>>>> B: >>>>>>>> C: >>>>>>>> // fields shipped with the English package (a zip) >>>>>>>> EN: >>>>>>>> EN_Words: >>>>>>>> EN_Sentences: >>>>>>>> some DocValues >>>>>>>> // fields shipped with the German package (a zip) >>>>>>>> DE: >>>>>>>> DE_Words: >>>>>>>> DE_Sentences: >>>>>>>> some DocValues >>>>>>>> ... >>>>>>>> There might be hundreds of language package that my users might use >>>>>>>> >>>>>>>> >>>>>>>> If I use different indexes >>>>>>>> indexA for the common stuff, >>>>>>>> indexEN for the English package, >>>>>>>> indexDE for the german package, >>>>>>>> >>>>>>>> For sure, I will be able to make a big index out of those by using a >>>>>>>> MultiReader >>>>>>>> BUT it really makes an union out of the three index (right ?) which >>>>>>>> means >>>>>>>> I'll have 900000 documents >>>>>>>> and the documents in the indexA won't have any relations to the >>>>>>>> documents >>>>>>>> in indexEN (right ?) except if I give each document an id in each >>>>>>>> index >>>>>>>> >>>>>>>> and >>>>>>>> >>>>>>> make a join at query time which is a big no no, because I use a >>>>>>> >>>>>>>> queryParser >>>>>>>> >>>>>>> and users may enter queries like "A:gah AND (DE:schlaffen OR >>>>>>> >>>>>>>> EN:sleep)" >>>>>>>> >>>>>>>> Or I am mistaken and there is a way to create a document in three >>>>>>>> different index that stay in relations with the same docId ? >>>>>>>> >>>>>>>> >>>>>>>> My solution if question 1 is possible : >>>>>>>> >>>>>>>> In contrast, if I am able to build my index so that my READ ONLY >>>>>>>> Document >>>>>>>> are stored in >>>>>>>> >>>>>>>> SEGMENT 1 >>>>>>>> // common fields shipped with the App that aren't language related >>>>>>>> A: >>>>>>>> B: >>>>>>>> C: >>>>>>>> >>>>>>>> SEGMENT 2 >>>>>>>> // fields shipped with the English package (a zip) >>>>>>>> EN: >>>>>>>> EN_Words: >>>>>>>> EN_Sentences: >>>>>>>> some DocValues >>>>>>>> >>>>>>>> SEGMENT 3 >>>>>>>> // fields shipped with the German package (a zip) >>>>>>>> DE: >>>>>>>> DE_Words: >>>>>>>> DE_Sentences: >>>>>>>> some DocValues >>>>>>>> >>>>>>>> >>>>>>>> I only need to ship SEGMENT 1 in the App and let users download >>>>>>>> SEGMENT >>>>>>>> 2 >>>>>>>> or SEGMENT 3 whether they want english or german >>>>>>>> and use a composite reader with atomic readers (right ?) to use my >>>>>>>> frankenstein index at query time with a queryparser >>>>>>>> >>>>>>>> >>>>>>>> Also, In case question 1 is possible. I would really like to know >>>>>>>> too, >>>>>>>> if >>>>>>>> it is possible to remap at build time docIds in a read-only index. >>>>>>>> An application of this would be : >>>>>>>> >>>>>>>> At day 1, I shipp my app with 2 languages packages : English and >>>>>>>> german >>>>>>>> (documents are uniquely identified by a docId... or by an external >>>>>>>> id >>>>>>>> (thanks to a docId<-> external id map) >>>>>>>> >>>>>>>> At day 2, I ship an additional language package (French) because I'm >>>>>>>> able >>>>>>>> to build an index with English, German, French with the same exact >>>>>>>> docIds >>>>>>>> for each document that the index shipped at day 1 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------ >>>>>>>> --------- >>>>>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> ------------------------------------------------------------ >>>>>>>> >>>>>>> --------- >>>>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>>>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>>>> >>>>> >>>>> >>>>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >>> For additional commands, e-mail: java-user-h...@lucene.apache.org >>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >