Re: Question concerning refs on TestDemoParallelLeafReader

2017-10-03 Thread Michael McCandless
On Mon, Oct 2, 2017 at 2:25 PM, Dawid Weiss  wrote:

> I think the delayed deletes might have to do w/ segment warming?
>
> I'll have to digest the scenario you described tomorrow. I didn't hit
> any exceptions when running those modified code snippets (which I'd be
> very grateful to see -- they'd provide an immediate proof something is
> wrong...).


Yeah, it's disappointing the test didn't fail when you removed it.  If my
theory is right (and I'm not sure it is!), removing that code would make
much higher NRT latency after a big merge finished, because the refresh
thread would pay the price of going off and building the parallel index for
the newly merged segment, instead of the bg merge thread.

> I am glad you're finding a use for this crazy class!
>
> It's super-useful for people who wish to low-level tweak the index
> format. I dreaded this for a long time, but for us it'd provide many
> benefits. We have a scenario where documents can be indexed once (and
> stay in the primary index) and certain derived indexes (features
> indexed on top of those documents) can be placed in the secondary
> index. The benefit here is that our data used to index features can
> change from time to time (as new documents emerge); then we can simply
> drop those existing secondary indexes and provide up-to-date ones.
> This saves disk I/O and is still fairly transparent to the rest of the
> application (because fields never clash between the primary and the
> secondary index and documents are always aligned).
>

Great!  That's exactly what it should work well for!


> Your 'demo' class is a great example of how this can be done. The
> class is surely advanced. Read: it crams way too many aspects into one
> class :) Each of these could be a separate demo:
>

Sorry :)  This is why it's a test class.

If you have ideas to make it easier to use, please refactor away!  I think
it can open up all sorts of unexpected use cases for Lucene, letting you
change your mind / experiment later about how exactly to index your raw
content.


> - splitting indexes into parallel once (primary/ secondary), with
> automatic secondary index creation on merges and startup.
> - folding back secondary index data into the primary index on merges
> (we don't need it, but I imagine there exist a scenario for this),
> - keeping multiple versions of the secondary index (those "generations").
>

I agree these are separate concerns if we can tease them out.


> And probably lots more. It's a very interesting advanced use case.
>
> > And how did you find this test :)
>
> I've been looking at ParallelCompositeReader for some time; as I was
> scanning it internally for its use cases within the code I somehow
> came across that "demo" class which leveraged its lower-level
> internals. It did take me some time to go through the class's internal
> workings because of confusingly named variables (I ended up renaming
> them to 'primary' and 'secondary' index instead of the original
> 'parallel'). But hey, I don't complain -- it's still an awesome piece
> of code!


Thanks :)  Keep up the renaming/refactoring!

I'm am still unsure why I tracked ref counts at the leaf reader level; did
this somehow enable re-using the parallel leaf readers on each refresh vs.
opening all leaves on each reopen?

Mike McCandless

http://blog.mikemccandless.com


Re: Question concerning refs on TestDemoParallelLeafReader

2017-10-03 Thread Dawid Weiss
Thanks again for the explanation, Mike. I understand it now.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Question concerning refs on TestDemoParallelLeafReader

2017-10-02 Thread Dawid Weiss
Hi Mike,

Thanks for the feedback.

> I think the delayed deletes might have to do w/ segment warming?

I'll have to digest the scenario you described tomorrow. I didn't hit
any exceptions when running those modified code snippets (which I'd be
very grateful to see -- they'd provide an immediate proof something is
wrong...).

> I am glad you're finding a use for this crazy class!

It's super-useful for people who wish to low-level tweak the index
format. I dreaded this for a long time, but for us it'd provide many
benefits. We have a scenario where documents can be indexed once (and
stay in the primary index) and certain derived indexes (features
indexed on top of those documents) can be placed in the secondary
index. The benefit here is that our data used to index features can
change from time to time (as new documents emerge); then we can simply
drop those existing secondary indexes and provide up-to-date ones.
This saves disk I/O and is still fairly transparent to the rest of the
application (because fields never clash between the primary and the
secondary index and documents are always aligned).

Your 'demo' class is a great example of how this can be done. The
class is surely advanced. Read: it crams way too many aspects into one
class :) Each of these could be a separate demo:

- splitting indexes into parallel once (primary/ secondary), with
automatic secondary index creation on merges and startup.
- folding back secondary index data into the primary index on merges
(we don't need it, but I imagine there exist a scenario for this),
- keeping multiple versions of the secondary index (those "generations").

And probably lots more. It's a very interesting advanced use case.

> And how did you find this test :)

I've been looking at ParallelCompositeReader for some time; as I was
scanning it internally for its use cases within the code I somehow
came across that "demo" class which leveraged its lower-level
internals. It did take me some time to go through the class's internal
workings because of confusingly named variables (I ended up renaming
them to 'primary' and 'secondary' index instead of the original
'parallel'). But hey, I don't complain -- it's still an awesome piece
of code!

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Question concerning refs on TestDemoParallelLeafReader

2017-10-02 Thread David Smiley
On Mon, Oct 2, 2017 at 9:34 AM Michael McCandless 
wrote:

> I am glad you're finding a use for this crazy class!  I think it is a
> powerful way for Lucene to efficiently add "derived fields" at search time.
>

+1 agreed!   Could be used for NRT updates as well.  But very expert; it'd
be nice if it was easier to use achieve higher level goals.
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: Question concerning refs on TestDemoParallelLeafReader

2017-10-02 Thread Michael McCandless
I think the delayed deletes might have to do w/ segment warming?

I.e., after a merge finishes, but before IW exposes that segment in the
current SIS, it's merged, at which point (via the merged segment warmer the
test installs) we build its parallel index, but then I think (maybe!) its
parallel reader is closed?  But we don't want to rm its index directory,
because on the next NRT refresh the merged segment becomes live and we will
open that parallel index.  This ensures that it's the BG merge thread that
pays the cost to build the parallel index, not the NRT reopen thread,
keeping NRT reopen latency low (ish).

I am glad you're finding a use for this crazy class!  I think it is a
powerful way for Lucene to efficiently add "derived fields" at search
time.  Can you share any details on how you are using it?
And how did you find this test :)

Mike McCandless

http://blog.mikemccandless.com

On Sun, Oct 1, 2017 at 7:01 AM, Dawid Weiss  wrote:

> > I'll have to think about the first 2 questions still, but MDW stands for
> > MockDirectoryWrapper!
>
> Ah, sure thing. For what it's worth, I locally removed this delayed
> 'delete' list and removed the leaf folder immediately -- the tests
> passed without any problems on my Windows machine. Could be I didn't
> hit the corner case, so I'm interested in any follow-up you might
> have, Mike.
>
> Dawid
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Question concerning refs on TestDemoParallelLeafReader

2017-10-01 Thread Dawid Weiss
> I'll have to think about the first 2 questions still, but MDW stands for
> MockDirectoryWrapper!

Ah, sure thing. For what it's worth, I locally removed this delayed
'delete' list and removed the leaf folder immediately -- the tests
passed without any problems on my Windows machine. Could be I didn't
hit the corner case, so I'm interested in any follow-up you might
have, Mike.

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: Question concerning refs on TestDemoParallelLeafReader

2017-09-30 Thread Michael McCandless
Hi Dawid,

I'll have to think about the first 2 questions still, but MDW stands for
MockDirectoryWrapper!

Mike McCandless

http://blog.mikemccandless.com

On Fri, Sep 29, 2017 at 5:55 AM, Dawid Weiss  wrote:

> This one is probably to Mike since he originally wrote this "demo",
> but sending to dev@ for posteriority.
>
> I'm looking at something similar to what is shown in
> TestDemoParallelLeafReader  -- creating (or recreating) secondary
> segments on the fly based on primary segments' data. I spent some time
> going through the code in TestDemoParallelLeafReader to understand how
> it works and I get the gist of it, but there are certain things I
> don't quite grasp.
>
> 1) What's the purpose of handling all the refcounts on the secondary
> index LeafReaders? I get there is a cache of those LeafReaders and it
> sort of updates itself automatically on zero count, but why bother
> with refcounting at all -- wouldn't it be simpler to assume that when
> you acquire the ParallelLeafDirectoryReader wrapper everything
> (primary and secondary leaf readers) are simply closed when the parent
> is closed? Is the refcounting an optimization for NRT-heavy reopens
> (where indeed I see the point where those caches may be handy)?
>
> 2) I don't get the purpose of keeping closedSegments lookup. See this
> snippet:
>
> > dir.close();
> >
> > // Must do this after dir is closed, else another thread could "rm -rf"
> while we are closing
> > (which makes MDW.close's checkIndex angry):
> > closedSegments.add(segIDGen);
>
> What's odd to be is that dir is closed in the code above (and so it is
> closed in ParallelReaderClosed hook invoked on leaf reader's close
> callback, before the segment is added to closedSegments). What does
> MDW refer to here?
>
> Dawid
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Question concerning refs on TestDemoParallelLeafReader

2017-09-29 Thread Dawid Weiss
This one is probably to Mike since he originally wrote this "demo",
but sending to dev@ for posteriority.

I'm looking at something similar to what is shown in
TestDemoParallelLeafReader  -- creating (or recreating) secondary
segments on the fly based on primary segments' data. I spent some time
going through the code in TestDemoParallelLeafReader to understand how
it works and I get the gist of it, but there are certain things I
don't quite grasp.

1) What's the purpose of handling all the refcounts on the secondary
index LeafReaders? I get there is a cache of those LeafReaders and it
sort of updates itself automatically on zero count, but why bother
with refcounting at all -- wouldn't it be simpler to assume that when
you acquire the ParallelLeafDirectoryReader wrapper everything
(primary and secondary leaf readers) are simply closed when the parent
is closed? Is the refcounting an optimization for NRT-heavy reopens
(where indeed I see the point where those caches may be handy)?

2) I don't get the purpose of keeping closedSegments lookup. See this snippet:

> dir.close();
>
> // Must do this after dir is closed, else another thread could "rm -rf" while 
> we are closing
> (which makes MDW.close's checkIndex angry):
> closedSegments.add(segIDGen);

What's odd to be is that dir is closed in the code above (and so it is
closed in ParallelReaderClosed hook invoked on leaf reader's close
callback, before the segment is added to closedSegments). What does
MDW refer to here?

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org