Re: Question concerning refs on TestDemoParallelLeafReader
On Mon, Oct 2, 2017 at 2:25 PM, Dawid Weiss wrote: > I think the delayed deletes might have to do w/ segment warming? > > I'll have to digest the scenario you described tomorrow. I didn't hit > any exceptions when running those modified code snippets (which I'd be > very grateful to see -- they'd provide an immediate proof something is > wrong...). Yeah, it's disappointing the test didn't fail when you removed it. If my theory is right (and I'm not sure it is!), removing that code would make much higher NRT latency after a big merge finished, because the refresh thread would pay the price of going off and building the parallel index for the newly merged segment, instead of the bg merge thread. > I am glad you're finding a use for this crazy class! > > It's super-useful for people who wish to low-level tweak the index > format. I dreaded this for a long time, but for us it'd provide many > benefits. We have a scenario where documents can be indexed once (and > stay in the primary index) and certain derived indexes (features > indexed on top of those documents) can be placed in the secondary > index. The benefit here is that our data used to index features can > change from time to time (as new documents emerge); then we can simply > drop those existing secondary indexes and provide up-to-date ones. > This saves disk I/O and is still fairly transparent to the rest of the > application (because fields never clash between the primary and the > secondary index and documents are always aligned). > Great! That's exactly what it should work well for! > Your 'demo' class is a great example of how this can be done. The > class is surely advanced. Read: it crams way too many aspects into one > class :) Each of these could be a separate demo: > Sorry :) This is why it's a test class. If you have ideas to make it easier to use, please refactor away! I think it can open up all sorts of unexpected use cases for Lucene, letting you change your mind / experiment later about how exactly to index your raw content. > - splitting indexes into parallel once (primary/ secondary), with > automatic secondary index creation on merges and startup. > - folding back secondary index data into the primary index on merges > (we don't need it, but I imagine there exist a scenario for this), > - keeping multiple versions of the secondary index (those "generations"). > I agree these are separate concerns if we can tease them out. > And probably lots more. It's a very interesting advanced use case. > > > And how did you find this test :) > > I've been looking at ParallelCompositeReader for some time; as I was > scanning it internally for its use cases within the code I somehow > came across that "demo" class which leveraged its lower-level > internals. It did take me some time to go through the class's internal > workings because of confusingly named variables (I ended up renaming > them to 'primary' and 'secondary' index instead of the original > 'parallel'). But hey, I don't complain -- it's still an awesome piece > of code! Thanks :) Keep up the renaming/refactoring! I'm am still unsure why I tracked ref counts at the leaf reader level; did this somehow enable re-using the parallel leaf readers on each refresh vs. opening all leaves on each reopen? Mike McCandless http://blog.mikemccandless.com
Re: Question concerning refs on TestDemoParallelLeafReader
Thanks again for the explanation, Mike. I understand it now. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Question concerning refs on TestDemoParallelLeafReader
Hi Mike, Thanks for the feedback. > I think the delayed deletes might have to do w/ segment warming? I'll have to digest the scenario you described tomorrow. I didn't hit any exceptions when running those modified code snippets (which I'd be very grateful to see -- they'd provide an immediate proof something is wrong...). > I am glad you're finding a use for this crazy class! It's super-useful for people who wish to low-level tweak the index format. I dreaded this for a long time, but for us it'd provide many benefits. We have a scenario where documents can be indexed once (and stay in the primary index) and certain derived indexes (features indexed on top of those documents) can be placed in the secondary index. The benefit here is that our data used to index features can change from time to time (as new documents emerge); then we can simply drop those existing secondary indexes and provide up-to-date ones. This saves disk I/O and is still fairly transparent to the rest of the application (because fields never clash between the primary and the secondary index and documents are always aligned). Your 'demo' class is a great example of how this can be done. The class is surely advanced. Read: it crams way too many aspects into one class :) Each of these could be a separate demo: - splitting indexes into parallel once (primary/ secondary), with automatic secondary index creation on merges and startup. - folding back secondary index data into the primary index on merges (we don't need it, but I imagine there exist a scenario for this), - keeping multiple versions of the secondary index (those "generations"). And probably lots more. It's a very interesting advanced use case. > And how did you find this test :) I've been looking at ParallelCompositeReader for some time; as I was scanning it internally for its use cases within the code I somehow came across that "demo" class which leveraged its lower-level internals. It did take me some time to go through the class's internal workings because of confusingly named variables (I ended up renaming them to 'primary' and 'secondary' index instead of the original 'parallel'). But hey, I don't complain -- it's still an awesome piece of code! Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Question concerning refs on TestDemoParallelLeafReader
On Mon, Oct 2, 2017 at 9:34 AM Michael McCandless wrote: > I am glad you're finding a use for this crazy class! I think it is a > powerful way for Lucene to efficiently add "derived fields" at search time. > +1 agreed! Could be used for NRT updates as well. But very expert; it'd be nice if it was easier to use achieve higher level goals. -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com
Re: Question concerning refs on TestDemoParallelLeafReader
I think the delayed deletes might have to do w/ segment warming? I.e., after a merge finishes, but before IW exposes that segment in the current SIS, it's merged, at which point (via the merged segment warmer the test installs) we build its parallel index, but then I think (maybe!) its parallel reader is closed? But we don't want to rm its index directory, because on the next NRT refresh the merged segment becomes live and we will open that parallel index. This ensures that it's the BG merge thread that pays the cost to build the parallel index, not the NRT reopen thread, keeping NRT reopen latency low (ish). I am glad you're finding a use for this crazy class! I think it is a powerful way for Lucene to efficiently add "derived fields" at search time. Can you share any details on how you are using it? And how did you find this test :) Mike McCandless http://blog.mikemccandless.com On Sun, Oct 1, 2017 at 7:01 AM, Dawid Weiss wrote: > > I'll have to think about the first 2 questions still, but MDW stands for > > MockDirectoryWrapper! > > Ah, sure thing. For what it's worth, I locally removed this delayed > 'delete' list and removed the leaf folder immediately -- the tests > passed without any problems on my Windows machine. Could be I didn't > hit the corner case, so I'm interested in any follow-up you might > have, Mike. > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Re: Question concerning refs on TestDemoParallelLeafReader
> I'll have to think about the first 2 questions still, but MDW stands for > MockDirectoryWrapper! Ah, sure thing. For what it's worth, I locally removed this delayed 'delete' list and removed the leaf folder immediately -- the tests passed without any problems on my Windows machine. Could be I didn't hit the corner case, so I'm interested in any follow-up you might have, Mike. Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Question concerning refs on TestDemoParallelLeafReader
Hi Dawid, I'll have to think about the first 2 questions still, but MDW stands for MockDirectoryWrapper! Mike McCandless http://blog.mikemccandless.com On Fri, Sep 29, 2017 at 5:55 AM, Dawid Weiss wrote: > This one is probably to Mike since he originally wrote this "demo", > but sending to dev@ for posteriority. > > I'm looking at something similar to what is shown in > TestDemoParallelLeafReader -- creating (or recreating) secondary > segments on the fly based on primary segments' data. I spent some time > going through the code in TestDemoParallelLeafReader to understand how > it works and I get the gist of it, but there are certain things I > don't quite grasp. > > 1) What's the purpose of handling all the refcounts on the secondary > index LeafReaders? I get there is a cache of those LeafReaders and it > sort of updates itself automatically on zero count, but why bother > with refcounting at all -- wouldn't it be simpler to assume that when > you acquire the ParallelLeafDirectoryReader wrapper everything > (primary and secondary leaf readers) are simply closed when the parent > is closed? Is the refcounting an optimization for NRT-heavy reopens > (where indeed I see the point where those caches may be handy)? > > 2) I don't get the purpose of keeping closedSegments lookup. See this > snippet: > > > dir.close(); > > > > // Must do this after dir is closed, else another thread could "rm -rf" > while we are closing > > (which makes MDW.close's checkIndex angry): > > closedSegments.add(segIDGen); > > What's odd to be is that dir is closed in the code above (and so it is > closed in ParallelReaderClosed hook invoked on leaf reader's close > callback, before the segment is added to closedSegments). What does > MDW refer to here? > > Dawid > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Question concerning refs on TestDemoParallelLeafReader
This one is probably to Mike since he originally wrote this "demo", but sending to dev@ for posteriority. I'm looking at something similar to what is shown in TestDemoParallelLeafReader -- creating (or recreating) secondary segments on the fly based on primary segments' data. I spent some time going through the code in TestDemoParallelLeafReader to understand how it works and I get the gist of it, but there are certain things I don't quite grasp. 1) What's the purpose of handling all the refcounts on the secondary index LeafReaders? I get there is a cache of those LeafReaders and it sort of updates itself automatically on zero count, but why bother with refcounting at all -- wouldn't it be simpler to assume that when you acquire the ParallelLeafDirectoryReader wrapper everything (primary and secondary leaf readers) are simply closed when the parent is closed? Is the refcounting an optimization for NRT-heavy reopens (where indeed I see the point where those caches may be handy)? 2) I don't get the purpose of keeping closedSegments lookup. See this snippet: > dir.close(); > > // Must do this after dir is closed, else another thread could "rm -rf" while > we are closing > (which makes MDW.close's checkIndex angry): > closedSegments.add(segIDGen); What's odd to be is that dir is closed in the code above (and so it is closed in ParallelReaderClosed hook invoked on leaf reader's close callback, before the segment is added to closedSegments). What does MDW refer to here? Dawid - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org