On Mon, Oct 2, 2017 at 2:25 PM, Dawid Weiss <dawid.we...@gmail.com> wrote:
> I think the delayed deletes might have to do w/ segment warming? > > I'll have to digest the scenario you described tomorrow. I didn't hit > any exceptions when running those modified code snippets (which I'd be > very grateful to see -- they'd provide an immediate proof something is > wrong...). Yeah, it's disappointing the test didn't fail when you removed it. If my theory is right (and I'm not sure it is!), removing that code would make much higher NRT latency after a big merge finished, because the refresh thread would pay the price of going off and building the parallel index for the newly merged segment, instead of the bg merge thread. > I am glad you're finding a use for this crazy class! > > It's super-useful for people who wish to low-level tweak the index > format. I dreaded this for a long time, but for us it'd provide many > benefits. We have a scenario where documents can be indexed once (and > stay in the primary index) and certain derived indexes (features > indexed on top of those documents) can be placed in the secondary > index. The benefit here is that our data used to index features can > change from time to time (as new documents emerge); then we can simply > drop those existing secondary indexes and provide up-to-date ones. > This saves disk I/O and is still fairly transparent to the rest of the > application (because fields never clash between the primary and the > secondary index and documents are always aligned). > Great! That's exactly what it should work well for! > Your 'demo' class is a great example of how this can be done. The > class is surely advanced. Read: it crams way too many aspects into one > class :) Each of these could be a separate demo: > Sorry :) This is why it's a test class. If you have ideas to make it easier to use, please refactor away! I think it can open up all sorts of unexpected use cases for Lucene, letting you change your mind / experiment later about how exactly to index your raw content. > - splitting indexes into parallel once (primary/ secondary), with > automatic secondary index creation on merges and startup. > - folding back secondary index data into the primary index on merges > (we don't need it, but I imagine there exist a scenario for this), > - keeping multiple versions of the secondary index (those "generations"). > I agree these are separate concerns if we can tease them out. > And probably lots more. It's a very interesting advanced use case. > > > And how did you find this test :) > > I've been looking at ParallelCompositeReader for some time; as I was > scanning it internally for its use cases within the code I somehow > came across that "demo" class which leveraged its lower-level > internals. It did take me some time to go through the class's internal > workings because of confusingly named variables (I ended up renaming > them to 'primary' and 'secondary' index instead of the original > 'parallel'). But hey, I don't complain -- it's still an awesome piece > of code! Thanks :) Keep up the renaming/refactoring! I'm am still unsure why I tracked ref counts at the leaf reader level; did this somehow enable re-using the parallel leaf readers on each refresh vs. opening all leaves on each reopen? Mike McCandless http://blog.mikemccandless.com