Re: Is there a way to customize segment names?
Hi Mike, Robert Thanks for replying, the system is almost like what Mike has described: one writer is primary, and the other is trying to catch up and wait, but in our internal discussion we found there might be small chances where the secondary mistakenly think itself as primary (due to errors of other component) while primary is still alive and thus goes into the situation I described. And because we want to tolerate the error in case we can't prevent it from happening, we're looking for customizing filenames. Thanks again for discussing this with me and I've learnt that playing with filenames can become quite troublesome, but still, even out of my own curiosity, I want to understand whether we're able to control the segment names in some way? Best Patrick On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov wrote: > +1 trying to coordinate multiple writers running independently will > not work. My 2c for availability: you can have a single primary active > writer with a backup one waiting, receiving all the segments from the > primary. Then if the primary goes down, the secondary one has the most > recent commit replicated from the primary (identical commit, same > segments etc) and can pick up from there. You would need a mechanism > to replay the writes the primary never had a chance to commit. > > On Fri, Dec 16, 2022 at 5:41 AM Robert Muir wrote: > > > > You are still talking "Multiple writers". Like i said, going down this > > path (playing tricks with filenames) isn't going to work out well. > > > > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai wrote: > > > > > > Hi Robert, > > > > > > Maybe I didn't explain it clearly but we're not going to constantly > switch > > > between writers or share effort between writers, it's purely for > > > availability: the second writer only kicks in when the first writer is > not > > > available for some reason. > > > And as far as I know the replicator/nrt module has not provided a > solution > > > on when the primary node (main indexer) is down, how would we recover > with > > > a back up indexer? > > > > > > Thanks > > > Patrick > > > > > > > > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir wrote: > > > > > > > This multiple-writer isn't going to work and customizing names won't > > > > allow it anyway. Each file also contains a unique identifier tied to > > > > its commit so that we know everything is intact. > > > > > > > > I would look at the segment replication in lucene/replicator and not > > > > try to play games with files and mixing multiple writers. > > > > > > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai > wrote: > > > > > > > > > > Hi Folks, > > > > > > > > > > We're trying to build a search architecture using segment > replication > > > > (indexer and searcher are separated and indexer shipping new > segments to > > > > searchers) right now and one of the problems we're facing is: for > > > > availability reason we need to have multiple indexers running, and > when the > > > > searcher is switching from consuming one indexer to another, there > are > > > > chances where the segment names collide with each other (because > segment > > > > names are count based) and the searcher have to reload the whole > index. > > > > > To avoid that we're looking for a way to name the segments so that > > > > Lucene is able to tell the difference and load only the difference > (by > > > > calling `openIfChanged`). I've checked the IndexWriter and the > > > > DocumentsWriter and it seems it is controlled by a private final > method > > > > `newSegmentName()` so likely not possible there. So I wonder whether > > > > there's any other ways people are aware of that can help control the > > > > segment names? > > > > > > > > > > A example of the situation described above: > > > > > Searcher previously consuming from indexer 1, and have following > > > > segments: _1, _2, _3, _4 > > > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3 > > > > segments, and produced its own 4th segments (notioned as _4', but it > shares > > > > the same "_4" name): _1, _2, _3, _4' > > > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to > Indexer > > > > 2, then when it finished downloading the segments and trying to > refresh the > > > > reader, it will likely hit the exception here, and seems all we can > do > > > > right now is to reload the whole index and that could be potentially > a high > > > > cost. > > > > > > > > > > Sorry for the long email and thank you in advance for any replies! > > > > > > > > > > Best > > > > > Patrick > > > > > > > > > > > > > - > > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands,
Re: Heap Size Space and Span Queries
Spans seem to have the problem of creating huge "List" during query iteration to track some stuff. I never understood the code, but to me it was always crazy to have Lists populated during execution. We replaced all SpanQueries by Intervals in patent search and speed is much faster and heap usage is tiny. A span/phrase with inOrder=false can always replaced by a phrase with slop. The slop is always without order, as it is an "edit distance" only (see documentation). If you need in order, an interval is required. Phrases are only in order for "slop=0". Compare to "slop=1" which means "next to each other" and is no longer in order. Uwe Am 15.12.2022 um 16:44 schrieb Mikhail Khludnev: Michael, thanks for stepping in! > it seems that simple phrase queries would suffice here in place of spanNear? I think it wouldn't. It seems to me 4 is slop, and false is inOrder. Sjoerd, can you comment about particualt span queries you uses? Also, do you have any heap dump summary to confirm high memory consumption by spans? On Thu, Dec 15, 2022 at 5:33 PM Michael Gibney wrote: I don't think that nested boolean disjunctions consisting of isolated spanNear queries at the leaves should have memory issues (as opposed to nested spanNear queries around disjunctions, which might well do). Am I misreading the string representation of that query? A little bit more explicit information about how the query is built, so that we can be certain of what we're dealing with, would be helpful. It'd certainly be worth trying IntervalsQuery -- but part of what makes me think I must be missing something in interpreting the string representation of the query provided: it seems that simple phrase queries would suffice here in place of spanNear? Regarding SpanQuery vs. IntervalsQuery performance and characteristics, there's some possibly-relevant discussion on LUCENE-9204: https://issues.apache.org/jira/browse/LUCENE-9204?focusedCommentId=17352589#comment-17352589 Michael On Wed, Dec 14, 2022 at 1:27 PM Mikhail Khludnev wrote: > > Developers, > Is it expected for Spans? Can IntervalsQuery help here? > > On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets wrote: >> >> Hi, >> >> I've implemented a Span Query parser and when running the below query, I'm >> seeing Heap Size Space messages on certain shards: >> >> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException: >> java.lang.OutOfMemoryError: Java heap space >> >> The span query that I'm running is the following: >> >> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4, false) >> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false)) >> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false)) >> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false) >> >> The heap size at the moment is set to 48Gb. We are running 4 shards in 1 >> JVM and the 4 shards combined have 24M docs evenly distributed across the >> shards. We do use the collapse feature as well. >> >> This is on Solr 8.6.0 >> >> What are the considerations for running Span Queries and heap sizes? >> >> Any suggestions are welcome >> >> Sjoerd > > > > -- > Sincerely yours > Mikhail Khludnev - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org -- Sincerely yours Mikhail Khludnev -- Uwe Schindler Achterdiek 19, D-28357 Bremen https://www.thetaphi.de eMail:u...@thetaphi.de
Re: Heap Size Space and Span Queries
> It seems to me 4 is slop, and false is inOrder Yes, sorry I misspoke; I was wondering whether it'd be possible to replace the uses of SpanNear in this case with something like `"term1 term2"~4` -- this should build a standard `PhraseQuery`, which does support the concept of slop, and I think the default (only) behavior of PhraseQuery analogous to SpanNear `inOrder` is equivalent to `inOrder=false`. But really I was more asking the question because I'm wondering whether the SpanNears are wrapped in SpanOr query or something (in a way that's not explicit from the provided string representation of the query)? On Thu, Dec 15, 2022 at 5:01 PM Mikhail Khludnev wrote: > > Hi > I scratched a simple qparser plugin to experiment with intervals in Solr. > https://github.com/mkhludnev/solr-flexible-qparser > I pushed the jar under releases, and described how to use it in README.md. > Sjoerd, > if spans really blows all heap, you can give a try with intervals with this > plugin. Notice the minimum Solr version required. > > On Wed, Dec 14, 2022 at 9:26 PM Mikhail Khludnev wrote: >> >> Developers, >> Is it expected for Spans? Can IntervalsQuery help here? >> >> On Wed, Dec 14, 2022 at 5:41 PM Sjoerd Smeets wrote: >>> >>> Hi, >>> >>> I've implemented a Span Query parser and when running the below query, I'm >>> seeing Heap Size Space messages on certain shards: >>> >>> o.a.s.s.HttpSolrCall null:java.lang.RuntimeException: >>> java.lang.OutOfMemoryError: Java heap space >>> >>> The span query that I'm running is the following: >>> >>> ((spanNear([unstemmed_text:charge, unstemmed_text:account], 4, false) >>> spanNear([unstemmed_text:pledge, unstemmed_text:account], 4, false)) >>> spanNear([unstemmed_text:pledge, unstemmed_text:deposit], 4, false)) >>> spanNear([unstemmed_text:charge, unstemmed_text:deposit], 4, false) >>> >>> The heap size at the moment is set to 48Gb. We are running 4 shards in 1 >>> JVM and the 4 shards combined have 24M docs evenly distributed across the >>> shards. We do use the collapse feature as well. >>> >>> This is on Solr 8.6.0 >>> >>> What are the considerations for running Span Queries and heap sizes? >>> >>> Any suggestions are welcome >>> >>> Sjoerd >> >> >> >> -- >> Sincerely yours >> Mikhail Khludnev > > > > -- > Sincerely yours > Mikhail Khludnev - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Is there a way to customize segment names?
+1 trying to coordinate multiple writers running independently will not work. My 2c for availability: you can have a single primary active writer with a backup one waiting, receiving all the segments from the primary. Then if the primary goes down, the secondary one has the most recent commit replicated from the primary (identical commit, same segments etc) and can pick up from there. You would need a mechanism to replay the writes the primary never had a chance to commit. On Fri, Dec 16, 2022 at 5:41 AM Robert Muir wrote: > > You are still talking "Multiple writers". Like i said, going down this > path (playing tricks with filenames) isn't going to work out well. > > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai wrote: > > > > Hi Robert, > > > > Maybe I didn't explain it clearly but we're not going to constantly switch > > between writers or share effort between writers, it's purely for > > availability: the second writer only kicks in when the first writer is not > > available for some reason. > > And as far as I know the replicator/nrt module has not provided a solution > > on when the primary node (main indexer) is down, how would we recover with > > a back up indexer? > > > > Thanks > > Patrick > > > > > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir wrote: > > > > > This multiple-writer isn't going to work and customizing names won't > > > allow it anyway. Each file also contains a unique identifier tied to > > > its commit so that we know everything is intact. > > > > > > I would look at the segment replication in lucene/replicator and not > > > try to play games with files and mixing multiple writers. > > > > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai wrote: > > > > > > > > Hi Folks, > > > > > > > > We're trying to build a search architecture using segment replication > > > (indexer and searcher are separated and indexer shipping new segments to > > > searchers) right now and one of the problems we're facing is: for > > > availability reason we need to have multiple indexers running, and when > > > the > > > searcher is switching from consuming one indexer to another, there are > > > chances where the segment names collide with each other (because segment > > > names are count based) and the searcher have to reload the whole index. > > > > To avoid that we're looking for a way to name the segments so that > > > Lucene is able to tell the difference and load only the difference (by > > > calling `openIfChanged`). I've checked the IndexWriter and the > > > DocumentsWriter and it seems it is controlled by a private final method > > > `newSegmentName()` so likely not possible there. So I wonder whether > > > there's any other ways people are aware of that can help control the > > > segment names? > > > > > > > > A example of the situation described above: > > > > Searcher previously consuming from indexer 1, and have following > > > segments: _1, _2, _3, _4 > > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3 > > > segments, and produced its own 4th segments (notioned as _4', but it > > > shares > > > the same "_4" name): _1, _2, _3, _4' > > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer > > > 2, then when it finished downloading the segments and trying to refresh > > > the > > > reader, it will likely hit the exception here, and seems all we can do > > > right now is to reload the whole index and that could be potentially a > > > high > > > cost. > > > > > > > > Sorry for the long email and thank you in advance for any replies! > > > > > > > > Best > > > > Patrick > > > > > > > > > > - > > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: Is there a way to customize segment names?
You are still talking "Multiple writers". Like i said, going down this path (playing tricks with filenames) isn't going to work out well. On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai wrote: > > Hi Robert, > > Maybe I didn't explain it clearly but we're not going to constantly switch > between writers or share effort between writers, it's purely for > availability: the second writer only kicks in when the first writer is not > available for some reason. > And as far as I know the replicator/nrt module has not provided a solution > on when the primary node (main indexer) is down, how would we recover with > a back up indexer? > > Thanks > Patrick > > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir wrote: > > > This multiple-writer isn't going to work and customizing names won't > > allow it anyway. Each file also contains a unique identifier tied to > > its commit so that we know everything is intact. > > > > I would look at the segment replication in lucene/replicator and not > > try to play games with files and mixing multiple writers. > > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai wrote: > > > > > > Hi Folks, > > > > > > We're trying to build a search architecture using segment replication > > (indexer and searcher are separated and indexer shipping new segments to > > searchers) right now and one of the problems we're facing is: for > > availability reason we need to have multiple indexers running, and when the > > searcher is switching from consuming one indexer to another, there are > > chances where the segment names collide with each other (because segment > > names are count based) and the searcher have to reload the whole index. > > > To avoid that we're looking for a way to name the segments so that > > Lucene is able to tell the difference and load only the difference (by > > calling `openIfChanged`). I've checked the IndexWriter and the > > DocumentsWriter and it seems it is controlled by a private final method > > `newSegmentName()` so likely not possible there. So I wonder whether > > there's any other ways people are aware of that can help control the > > segment names? > > > > > > A example of the situation described above: > > > Searcher previously consuming from indexer 1, and have following > > segments: _1, _2, _3, _4 > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3 > > segments, and produced its own 4th segments (notioned as _4', but it shares > > the same "_4" name): _1, _2, _3, _4' > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer > > 2, then when it finished downloading the segments and trying to refresh the > > reader, it will likely hit the exception here, and seems all we can do > > right now is to reload the whole index and that could be potentially a high > > cost. > > > > > > Sorry for the long email and thank you in advance for any replies! > > > > > > Best > > > Patrick > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > > For additional commands, e-mail: dev-h...@lucene.apache.org > > > > - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org