Re: [VOTE] Release Lucene/Solr 5.4.1 RC2

2016-01-21 Thread Michael Froh
Should the Solr release notes reference the additional fixes that went in
there?

>From your email to start the thread:

- SOLR-8496: multi-select faceting and getDocSet(List) can match
deleted docs
 - SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with
MLTHandler and Simple/CloudMLTQParser
 - SOLR-8561: Add fallback to ZkController.getLeaderProps for a mixed
5.4-pre-5.4 deployments

On Thu, Jan 21, 2016, 6:47 AM Adrien Grand  wrote:

> Thanks all for voting. This vote has passed, I will start releasing these
> artifacts now.
>
> I haven't had much feedback about the release notes. If someone could just
> check them out to make sure they make sense, I would appreciate:
>  - http://wiki.apache.org/lucene-java/ReleaseNote541
>  - http://wiki.apache.org/solr/ReleaseNote541
>
> Le mer. 20 janv. 2016 à 16:22, Noble Paul  a écrit :
>
>> +1 on java 7
>> SUCCESS! [1:50:00.109516]
>>
>> On Wed, Jan 20, 2016 at 3:37 PM, Ahmet Arslan 
>> wrote:
>> >
>> > +1
>> > SUCCESS! [1:50:21.498224]
>> >
>> > On Wednesday, January 20, 2016 1:28 AM, Tomás Fernández Löbbe <
>> tomasflo...@gmail.com> wrote:
>> >
>> >
>> >
>> > +1
>> > SUCCESS! [1:27:55.987215]
>> >
>> >
>> >
>> > On Tue, Jan 19, 2016 at 12:25 PM, Yonik Seeley 
>> wrote:
>> >
>> > +1
>> >>
>> >>-Yonik
>> >>
>> >>
>> >>On Mon, Jan 18, 2016 at 9:38 AM, Adrien Grand 
>> wrote:
>> >>> Please vote for the RC2 release candidate for Lucene/Solr 5.4.1
>> >>>
>> >>> This release candidate contains 3 additional changes compared to the
>> RC1:
>> >>>  - SOLR-8496: multi-select faceting and getDocSet(List) can
>> match
>> >>> deleted docs
>> >>>  - SOLR-8418: Adapt to changes in LUCENE-6590 for use of boosts with
>> >>> MLTHandler and Simple/CloudMLTQParser
>> >>>  - SOLR-8561: Add fallback to ZkController.getLeaderProps for a mixed
>> >>> 5.4-pre-5.4 deployments
>> >>>
>> >>> The artifacts can be downloaded from:
>> >>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-5.4.1-RC2-rev1725212
>> >>>
>> >>> You can run the smoke tester directly with this command:
>> >>> python3 -u dev-tools/scripts/smokeTestRelease.py
>> >>>
>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-5.4.1-RC2-rev1725212
>> >>>
>> >>> The smoke tester already passed for me both with the local and remote
>> >>> artifacts, so here is my +1.
>> >>
>> >>
>> >>-
>> >>To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> >>For additional commands, e-mail: dev-h...@lucene.apache.org
>> >>
>> >>
>> >
>> > -
>> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: dev-h...@lucene.apache.org
>> >
>>
>>
>>
>> --
>> -
>> Noble Paul
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>
>>


Dense union of doc IDs

2022-11-03 Thread Michael Froh
Hi,

I was recently poking around in the createWeight implementation for
MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
queries, and I realized that the worst-case performance could be pretty
bad, but (maybe) possible to optimize for.

Imagine if we have a segment with N docs and our MultiTermQuery expands to
hit M terms, where each of the M terms matches N-1 docs. (If we matched all
N docs, then Greg Miller's recent optimization to replace the
MultiTermQuery with a TermQuery would kick in.) In this case, my
understanding is that we would iterate through all the terms and pass each
one's postings to a DocIdSetBuilder, which will iterate through the
postings to set bits. This whole thing would be O(MN), I think.

I was thinking that it would be cool if the DocIdSetBuilder could detect
long runs of set bits and advance() each DISI to skip over them (since
they're guaranteed not to contribute anything new to the union). In the
worst case that I described above, I think it would make the whole thing
O(M log N) (assuming advance() takes log time).

At the risk of overcomplicating things, maybe DocIdSetBuilder could use a
third ("dense") BulkAdder implementation that kicks in once enough bits are
set, to efficiently implement the "or" operation to skip over known
(sufficiently long) runs of set bits?

Would something like that be useful? Is the "dense union of doc IDs" case
common enough to warrant it?

Thanks,
Froh


Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Michael Froh
Hi all,

I was looking into a customer issue where they noticed some increased GC
time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps
from both systems, the big difference was tracked down to the float[256]
allocated (as a norms cache) when creating a BM25Scorer (in
BM25Similarity.scorer()).

The change seems to have come in with
https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1,
which removed some of the special-case logic around the "non-scoring
similarity" embedded in IndexSearcher (returned in the false case from the
old IndexSearcher#scorer(boolean needsScores)).

While I really like that we no longer have that special-case logic in
IndexSearcher, we now have the issue that every time we create a new
TermWeight (or other Weight) it allocates a float[256], even if the
TermWeight doesn't need scores. Also, I think it's the exact same
float[256] for all non-scoring weights, since it's being computed using the
same "all 1s" CollectionStatistics and TermStatistics.

(For the record, yes, the queries in question have an obscene number of
TermQueries, so 1024 bytes times lots of TermWeights, times multiple
queries running concurrently makes lots of heap allocation.)

I'd like to submit a patch to fix this, but I'm wondering what approach to
take. One option I'm considering is precomputing a singleton float[256] for
the non-scoring case (where CollectionStatistics and TermStatistics are all
1s). That would have the least functional impact, but would let all
non-scoring clauses share the same array. Is there a better way to tackle
this?

Thanks,
Froh


Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Michael Froh
> This seems ok if it isn't invasive. I still feel like something is
> "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> you have way too many segments?

>From what I saw, it's 1KB per "leaf query" to create the
BM25Scorer instance (at the Weight level), but then that BM25Scorer is
shared across all scorer (DISI) instances for all segments. So it doesn't
scale with segment count. It looks like the old logic used to allocate a
SimScorer per segment, so this is a big improvement in that regard (for
scoring clauses, since the non-scoring clauses had a super-lightweight
SimScorer).

In this particular case, they're running these gnarly machine-generated
BoolenQuery trees with at least 512 non-scoring TermQuery clauses (across a
bunch of different fields, so TermInSetQuery isn't an option). From what I
can see, each of those TermQueries produces a TermWeight that holds a
BM25Scorer that holds yet another instance of this float[256] array, for
512KB+ of these caches per running query. It's definitely only going to be
an issue for folks who are flying close to the max clause count.

> One last thought: we should re-check if the cache is still needed :) I
> think decoding norms used to be more expensive in the past. This cache
> is now only precomputing part of the bm25 formula to save some
> add/multiply/divide.

Yeah -- when I saw the cache calculation, it reminded me of precomputed
tables of trigonometric functions in the demoscene.

I could try inlining those calculations and measuring the impact with the
luceneutil benchmarks.


On Tue, May 2, 2023 at 11:34 AM Robert Muir  wrote:

> On Tue, May 2, 2023 at 12:49 PM Michael Froh  wrote:
> >
> > Hi all,
> >
> > I was looking into a customer issue where they noticed some increased GC
> time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps
> from both systems, the big difference was tracked down to the float[256]
> allocated (as a norms cache) when creating a BM25Scorer (in
> BM25Similarity.scorer()).
> >
> > The change seems to have come in with
> https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1,
> which removed some of the special-case logic around the "non-scoring
> similarity" embedded in IndexSearcher (returned in the false case from the
> old IndexSearcher#scorer(boolean needsScores)).
> >
> > While I really like that we no longer have that special-case logic in
> IndexSearcher, we now have the issue that every time we create a new
> TermWeight (or other Weight) it allocates a float[256], even if the
> TermWeight doesn't need scores. Also, I think it's the exact same
> float[256] for all non-scoring weights, since it's being computed using the
> same "all 1s" CollectionStatistics and TermStatistics.
> >
> > (For the record, yes, the queries in question have an obscene number of
> TermQueries, so 1024 bytes times lots of TermWeights, times multiple
> queries running concurrently makes lots of heap allocation.)
> >
> > I'd like to submit a patch to fix this, but I'm wondering what approach
> to take. One option I'm considering is precomputing a singleton float[256]
> for the non-scoring case (where CollectionStatistics and TermStatistics are
> all 1s). That would have the least functional impact, but would let all
> non-scoring clauses share the same array. Is there a better way to tackle
> this?
> >
>
> This seems ok if it isn't invasive. I still feel like something is
> "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> you have way too many segments?
>
> Originally (for various similar reasons) there was a place in the API
> to do this, so it would only happen per-Weight instead of per-Scorer,
> which was the SimWeight that got eliminated by the commit you point
> to. But I'd love if we could steer clear of that complexity:
> simplifying the API here was definitely the right move. Its been more
> than 5 years since this change was made, and this is the first
> complaint i've heard about the 1KB, which is why i asked about your
> setup.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

2023-05-02 Thread Michael Froh
> So I'm actually still confused why this float[256] stands out in your
> measurejments vs two long[128]'s. Maybe its a profiler ghost?

Huh... that's a really good point.

I'm going to spend a bit more time digging and see if I can reliably
reproduce it on my own machine. I've just been comparing heap dumps from
production hosts so far, so I'll try measuring in an environment where I
can see what's going on.

On Tue, May 2, 2023 at 1:14 PM Robert Muir  wrote:

> On Tue, May 2, 2023 at 3:24 PM Michael Froh  wrote:
> >
> > > This seems ok if it isn't invasive. I still feel like something is
> > > "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> > > you have way too many segments?
> >
> > From what I saw, it's 1KB per "leaf query" to create the BM25Scorer
> instance (at the Weight level), but then that BM25Scorer is shared across
> all scorer (DISI) instances for all segments. So it doesn't scale with
> segment count. It looks like the old logic used to allocate a SimScorer per
> segment, so this is a big improvement in that regard (for scoring clauses,
> since the non-scoring clauses had a super-lightweight SimScorer).
> >
> > In this particular case, they're running these gnarly machine-generated
> BoolenQuery trees with at least 512 non-scoring TermQuery clauses (across a
> bunch of different fields, so TermInSetQuery isn't an option). From what I
> can see, each of those TermQueries produces a TermWeight that holds a
> BM25Scorer that holds yet another instance of this float[256] array, for
> 512KB+ of these caches per running query. It's definitely only going to be
> an issue for folks who are flying close to the max clause count.
> >
>
> Yeah, but the same situation could be said for buffers like this one:
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java#L311-L312
> So I'm actually still confused why this float[256] stands out in your
> measurejments vs two long[128]'s. Maybe its a profiler ghost?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Multimodal search

2023-10-12 Thread Michael Froh
We recently added multimodal search in OpenSearch:
https://github.com/opensearch-project/neural-search/pull/359

Since Lucene ultimately just cares about embeddings, does Lucene itself
really need to be multimodal? Wherever the embeddings come from, Lucene can
index the vectors and combine with textual queries, right?

Thanks,
Froh

On Thu, Oct 12, 2023 at 12:59 PM Michael Wechner 
wrote:

> Hi
>
> Did anyone of the Lucene committers consider making Lucene multimodal?
>
> With a quick Google search I found for example
>
> https://dl.acm.org/doi/abs/10.1145/3503161.3548768
>
> https://sigir-ecom.github.io/ecom2018/ecom18Papers/paper7.pdf
>
> Thanks
>
> Michael
>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Boolean field type

2023-11-08 Thread Michael Froh
Hey,

I've been musing about ideas for a "clever" Boolean field type on Lucene
for a while, and I think I might have an idea that could work. That said,
this popped into my head this afternoon and has not been fully-baked. It
may not be very clever at all.

My experience is that Boolean fields tend to be overwhelmingly true or
overwhelmingly false. I've had pretty good luck with using a keyword-style
field, where the only term represents the more sparse value. (For example,
I did a thing years ago with explicit tombstones, where versioned deletes
would have the field "deleted" with a value of "true", and live
documents didn't have the deleted field at all. Every query would add a
filter on "NOT deleted:true".)

That's great when you know up-front what the sparse value is going to be.
Working on OpenSearch, I just created an issue suggesting that we take a
hint from users for which value they think is going to be more common so we
only index the less common one:
https://github.com/opensearch-project/OpenSearch/issues/11143

At the Lucene level, though, we could index a Boolean field type as the
less common term when we flush (by counting the values and figuring out
which is less common). Then, per segment, we can rewrite any query for the
more common value as NOT the less common value.

You can compute upper/lower bounds on the value frequencies cheaply during
a merge, so I think you could usually write the doc IDs for the less common
value directly (without needing to count them first), even when input
segments disagree on which is the more common value.

If your Boolean field is not overwhelmingly lopsided, you might even want
to split segments to be 100% true or 100% false, such that queries against
the Boolean field become match-all or match-none. On a retail website,
maybe you have some toggle for "only show me results with property X" -- if
all your property X products are in one segment or a handful of segments,
you can drop the property X clause from the matching segments and skip the
other segments.

I guess one icky part of this compared to the usual Lucene field model is
that I'm assuming a Boolean field is never missing (or I guess missing
implies "false" by default?). Would that be a deal-breaker?

Thanks,
Froh


Re: Boolean field type

2023-11-10 Thread Michael Froh
Thanks Mikhail and Mike!

Mikhail, since you replied, I remembered your work on block joins in Solr
(thank you for that, by the way!), which reminded me that it's not unusual
for docs in a Lucene index to "mix" their schemata, like in parent/child
blocks. If 90% of parent docs are "true" on a Boolean field, but the field
doesn't exist for the child docs, my suggested approach would probably see
"true" as the sparse value (assuming there are at least as many children as
parents). Ideally, I would want to only track the "false" parents (and
leave the field off of the children).

Indeed the idea of a "required" field in Lucene is tricky (though Mike's
suggestion of missing defaults could help). Even worse, I think we'd also
need a way to enforce "exactly one value", since a "Boolean" term field can
really have four states -- true, false, neither, or both.

It feels like there's not a workable short-term solution to implement
something like this as a regular IndexableField in Lucene (or at least I'm
not seeing it).

I don't think I see enough value to pitch the idea of adding a new
field-like "thing" (that would have exactly one value for every doc, and
maybe could be counted relative to docs in a block). For now, I think it's
probably only practical to implement something like this as part of a
schema definition in one of the higher-level search servers.

Thanks for the discussion!
Froh

On Thu, Nov 9, 2023 at 5:01 AM Michael Sokolov  wrote:

> Can you require the user to specify missing: true or missing: false
> semantics. With that you can decide what to do with the missing values
>
> On Thu, Nov 9, 2023, 7:55 AM Mikhail Khludnev  wrote:
>
>> Hello Michael.
>> This optimization "NOT the less common value" assumes that boolean field
>> is required, but how to enforce this mandatory field constraint in Lucene?
>> I'm not aware of something like Solr schema or mapping.
>> If saying foo:true is common, it means that the posting list goes like
>> dense sequentially increasing numbers 1,2,3,4,5.. May it already be
>> compressed by codecs like
>> https://lucene.apache.org/core/9_2_0/core/org/apache/lucene/util/packed/MonotonicBlockPackedWriter.html
>> ?
>>
>> On Thu, Nov 9, 2023 at 3:31 AM Michael Froh  wrote:
>>
>>> Hey,
>>>
>>> I've been musing about ideas for a "clever" Boolean field type on Lucene
>>> for a while, and I think I might have an idea that could work. That said,
>>> this popped into my head this afternoon and has not been fully-baked. It
>>> may not be very clever at all.
>>>
>>> My experience is that Boolean fields tend to be overwhelmingly true or
>>> overwhelmingly false. I've had pretty good luck with using a keyword-style
>>> field, where the only term represents the more sparse value. (For example,
>>> I did a thing years ago with explicit tombstones, where versioned deletes
>>> would have the field "deleted" with a value of "true", and live
>>> documents didn't have the deleted field at all. Every query would add a
>>> filter on "NOT deleted:true".)
>>>
>>> That's great when you know up-front what the sparse value is going to
>>> be. Working on OpenSearch, I just created an issue suggesting that we take
>>> a hint from users for which value they think is going to be more common so
>>> we only index the less common one:
>>> https://github.com/opensearch-project/OpenSearch/issues/11143
>>>
>>> At the Lucene level, though, we could index a Boolean field type as the
>>> less common term when we flush (by counting the values and figuring out
>>> which is less common). Then, per segment, we can rewrite any query for the
>>> more common value as NOT the less common value.
>>>
>>> You can compute upper/lower bounds on the value frequencies cheaply
>>> during a merge, so I think you could usually write the doc IDs for the less
>>> common value directly (without needing to count them first), even when
>>> input segments disagree on which is the more common value.
>>>
>>> If your Boolean field is not overwhelmingly lopsided, you might even
>>> want to split segments to be 100% true or 100% false, such that queries
>>> against the Boolean field become match-all or match-none. On a retail
>>> website, maybe you have some toggle for "only show me results with property
>>> X" -- if all your property X products are in one segment or a handful of
>>> segments, you can drop the property X clause from the matching segments and
>>> skip the other segments.
>>>
>>> I guess one icky part of this compared to the usual Lucene field model
>>> is that I'm assuming a Boolean field is never missing (or I guess missing
>>> implies "false" by default?). Would that be a deal-breaker?
>>>
>>> Thanks,
>>> Froh
>>>
>>
>>
>> --
>> Sincerely yours
>> Mikhail Khludnev
>>
>


UTF-8 well-formedness for SimpleTextCodec

2023-12-18 Thread Michael Froh
Hi there,

I was recently writing up a short Lucene file format tutorial (
https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
using SimpleTextCodec for educational purposes.

I found that SimpleTextSegmentInfo tries to output the segment ID as raw
bytes, which will often result in malformed UTF-8 output. I wrote a little
fix to output as the text representation of a byte array (
https://github.com/apache/lucene/pull/12897). I noticed that it's a similar
sort of thing with binary doc values (where the bytes get written
directly).

Is there any general desire for SImpleTextCodec to output well-formed UTF-8
where possible?

Thanks,
Froh


Computing weight.count() cheaply in the face of deletes?

2024-02-02 Thread Michael Froh
Hi,

On OpenSearch, we've been taking advantage of the various O(1)
Weight#count() implementations to quickly compute various aggregations
without needing to iterate over all the matching documents (at least when
the top-level query is functionally a match-all at the segment level). Of
course, from what I've seen, every clever Weight#count()
implementation falls apart (returns -1) in the face of deletes.

I was thinking that we could still handle small numbers of deletes
efficiently if only we could get a DocIdSetIterator for deleted docs.

Like suppose you're doing a date histogram aggregation, you could get the
counts for each bucket from the points tree (ignoring deletes), then
iterate through the deleted docs and decrement their contribution from the
relevant bucket (determined based on a docvalues lookup). Assuming the
number of deleted docs is small, it should be cheap, right?

The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
not great for iteration. I'm imagining adding a supplementary "deleted docs
iterator" that could sit next to the FixedBitSet if and only if the number
of deletes is "small". Is there a better way that I should be thinking
about this?

Thanks,
Froh


Re: Computing weight.count() cheaply in the face of deletes?

2024-02-05 Thread Michael Froh
Thanks Adrien!

My thinking with a separate iterator was that nextClearBit() is relatively
expensive (O(maxDoc) to traverse everything, I think). The solution I was
imagining would involve an index-time change to output, say, an int[] of
deleted docIDs if the number is sufficiently small (like maybe less than
1000). Then the livedocs interface could optionally return a cheap deleted
docs iterator (i.e. only if the number of deleted docs is less than the
threshold). Technically, the cost would be O(1), since we set a constant
bound on the effort and fail otherwise. :)

I think 1000 doc value lookups would be cheap, but I don't know if the
guarantee is cheap enough to make it into Weight#count.

That said, I'm going to see if iterating with nextClearBit() is
sufficiently cheap. Hmm... precomputing that int[] for deleted docIDs on
refresh could be an option too.

Thanks again,
Froh

On Fri, Feb 2, 2024 at 11:38 PM Adrien Grand  wrote:

> Hi Michael,
>
> Indeed, only MatchAllDocsQuery knows how to produce a count when there are
> deletes.
>
> Your idea sounds good to me, do you actually need a side car iterator for
> deletes, or could you use a nextClearBit() operation on the bit set?
>
> I don't think we can fold it into Weight#count since there is an
> expectation that it is negligible compared with the cost of a naive count,
> but we may be able to do it in IndexSearcher#count or on the OpenSearch
> side.
>
> Le ven. 2 févr. 2024, 23:50, Michael Froh  a écrit :
>
>> Hi,
>>
>> On OpenSearch, we've been taking advantage of the various O(1)
>> Weight#count() implementations to quickly compute various aggregations
>> without needing to iterate over all the matching documents (at least when
>> the top-level query is functionally a match-all at the segment level). Of
>> course, from what I've seen, every clever Weight#count()
>> implementation falls apart (returns -1) in the face of deletes.
>>
>> I was thinking that we could still handle small numbers of deletes
>> efficiently if only we could get a DocIdSetIterator for deleted docs.
>>
>> Like suppose you're doing a date histogram aggregation, you could get the
>> counts for each bucket from the points tree (ignoring deletes), then
>> iterate through the deleted docs and decrement their contribution from the
>> relevant bucket (determined based on a docvalues lookup). Assuming the
>> number of deleted docs is small, it should be cheap, right?
>>
>> The current LiveDocs implementation is just a FixedBitSet, so AFAIK it's
>> not great for iteration. I'm imagining adding a supplementary "deleted docs
>> iterator" that could sit next to the FixedBitSet if and only if the number
>> of deletes is "small". Is there a better way that I should be thinking
>> about this?
>>
>> Thanks,
>> Froh
>>
>


Re: Improve testing

2024-05-24 Thread Michael Froh
Is your new test uncommitted?

The Gradle check will fail if you have uncommitted files, to avoid the
situation where it "works on my machine (because of a file that I forgot to
commit)".

The rough workflow is:

1. Develop stuff (code and/or tests).
2. Commit it.
3. Gradle check.
4. If Gradle check fails, then make changes and amend your commit. Go to 3.

Hope that helps,
Froh


On Fri, May 24, 2024 at 3:31 PM Chang Hank  wrote:

> After I added the new test case, I failed the ./gradlew check and it seems
> like the check failed because I added the new test case.
> Is there anything I need to do before executing ./gradlew check?
>
> Best,
> Hank
>
> > On May 24, 2024, at 12:53 PM, Chang Hank 
> wrote:
> >
> > Hi Robert,
> >
> > Thanks for your advice, will look into it!!
> >
> > Best,
> > Hank
> >> On May 24, 2024, at 12:46 PM, Robert Muir  wrote:
> >>
> >> On Fri, May 24, 2024 at 2:33 PM Chang Hank 
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I want to improve the code coverage for Lucene, which package would
> you recommend testing to do so? Do we need more coverage in the Core
> package?
> >>>
> >>
> >> Hello,
> >>
> >> I'd recommend looking at the help/tests.txt file, you can generate the
> >> coverage report easily and find untested code:
> >>
> >> https://github.com/apache/lucene/blob/main/help/tests.txt#L193
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> >> For additional commands, e-mail: dev-h...@lucene.apache.org
> >>
> >
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Can we import an HNSW graph into lucene index ?

2024-06-14 Thread Michael Froh
Hi Anand,

Interesting that you should bring this up!

There was a talk just this week at Berlin Buzzwords talking about using
cuVS with Lucene: https://www.youtube.com/watch?v=qiW7iIDFJC0

>From that talk, it sounds like the folks at SearchScale have managed to
integrate cuVS as a custom codec under Lucene. There was also mention in
the talk that a CAGRA graph can be built using GPUs (to get some impressive
speedups), then they can be searched using CPU-based logic (so you don't
need to provision GPU hosts for your search fleet).

I wasn't able to find the code for the codec at
https://github.com/SearchScale/lucene-cuvs/tree/main, but Ishan and Noble
should be on this list and might be able to find it.

- Froh

On Fri, Jun 14, 2024 at 7:57 AM Benjamin Trent 
wrote:

> Anand,
>
> In short, I think it's feasible, but I don't think it's simple. I also
> don't think Lucene should directly provide an interface to the format
> that says "Give me the graph". You could have a custom writer that
> does this however.
>
> All formats are nominally based, so if your GPU merge format writes
> out the appropriate name and format, it should be readable.
>
> > One issue we have been running into is long build times with higher
> dimensional vectors.
>
> Are you building the graph with a single thread?
>
> What vector dimensions are you using?
>
> As an aside, building the graph via quantized vectors can help speed
> things up. Though I understand the desire to do graph building with a
> GPU.
>
> Very interesting ideas indeed Anand.
>
> Ben
>
> On Fri, Jun 14, 2024 at 4:49 AM Anand Kotriwal 
> wrote:
> >
> > Hi all,
> >
> > We extensively use Lucene and HNSW graph search capability for ANN
> searches.
> > One issue we have been running into is long build times with higher
> dimensional vectors. To address this, we are exploring ways where we can
> build the hnsw index on the GPU and merge it into an existing Lucene index
> to serve queries. For example, Nvidia's cuvs library supports building a
> CAGRA index and  transforming it into a hnswlib graph.
> >
> > My idea is - once the hnswgraph is built on the GPUs, we can import the
> graph. We need the graph vertices and their connections. We can then write
> it to a lucene compatible segment file format. We also map the docids to
> embeddings and update the fieldinfos.
> >
> > I would like feedback from the community on whether this sounds feasible
> and any implementation pointers you might have.
> >
> >
> > Thanks,
> > Anand Kotriwal
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: IndexWriter.getReader speed (NRT)

2024-07-29 Thread Michael Froh
Hi David,

Great meeting you at Buzzwords last month!

(Sorry for the late reply -- I was on vacation for weeks.)

You can modify maxFullFlushMergeWaitMillis at the IndexWriterConfig level,
but that sets it for any caller who tries to open an IndexReader from the
IndexWriter.

It sounds like you would like an IndexWriter that "normally" merges on
getReader calls (with the default 500ms wait), but can also return a reader
without merging if explicitly requested (in the getReader call).

Adding an overload to IndexWriter.getReader would be pretty easy, but that
method is package-private. The hairier part probably involves deciding
which open/openIfChanged methods to overload in DirectoryReader, since I
think that's where the parameter would ultimately need to be exposed to
users.

Is that a fair assessment of what you're looking for?

Thanks,
Froh

On Fri, Jul 5, 2024 at 8:00 PM David Smiley  wrote:

> In the context of looking at NRT search in Solr (the so-called
> realtime searcher), I was looking at IndexWriter.getReader to see what
> it does, as it seems to be the primary method for opening a new view
> quickly.  If Lucene has a better entry point for opening a view ASAP,
> let me know.  I see that getReader writes a segment.  A very long time
> ago in ~2013, I recall Michael Busch, then at Twitter, was
> contributing some amazing features to Lucene to search the index
> writing buffers in-place.  Do people remember what became of that?
> Maybe I'm wrong -- no contribution.  I don't see anything in
> CHANGES.txt from Michael that's pertinent.  I didn't yet look in JIRA.
> The presentation I saw (in person) by Michael was at Lucene/Solr
> Revolution in Boston, 2013 -- recording:
> https://www.youtube.com/watch?v=F2CHE4VyB3c
>
> getReader may also merge segments (configured to do so by default
> since 9.3), thus making this not very NRT friendly for a subset of
> use-cases that a search app may have.  In my experience these relate
> to indexing to do conditional updates based on index state or to
> return index statistics per batch.  Wouldn't it be useful for
> maxFullFlushMergeWaitMillis to be over-rideable as a parameter in case
> the caller wishes to get a view without waiting at all for merges but
> leaving open other use-cases to use the default/configured time?
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


AbstractMultiTermQueryConstantScoreWrapper cost estimates (https://github.com/apache/lucene/issues/13029)

2024-08-01 Thread Michael Froh
Hi there,

For a few months, some of us have been running into issues with the cost
estimate from AbstractMultiTermQueryConstantScoreWrapper. (
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300
)

In https://github.com/apache/lucene/issues/13029, the problem was raised in
terms of queries not being cached, because the estimated cost was too high.

We've also run into problems in OpenSearch, since we started wrapping
MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost
estimate, so IndexOrDocValueQuery decides it should be a DV query, even
though the MTQ would really only match a handful of docs (and should be
lead iterator).

I opened a PR back in March (https://github.com/apache/lucene/pull/13201)
to try to handle the case where a MultiTermQuery matches a small number of
terms. Since Mayya pulled the rewrite logic that expands up to 16 terms (to
rewrite as a Boolean disjunction) earlier in the workflow (in
https://github.com/apache/lucene/pull/13454), we get the better cost
estimate for MTQs on few terms "for free".

What do folks think?

Thanks,
Froh


Re: AbstractMultiTermQueryConstantScoreWrapper cost estimates (https://github.com/apache/lucene/issues/13029)

2024-08-02 Thread Michael Froh
Exactly!

My initial implementation added some potential cost. (I think I enumerated
up to 128 terms before giving up.) Now that Mayya moved the (probably tiny)
cost of expanding the first 16 terms upfront, my change is theoretically
"free".

Froh

On Fri, Aug 2, 2024 at 3:25 PM Greg Miller  wrote:

> Hey Froh-
>
> I got some time to look through your PR (most of the time was actually
> refreshing my memory on the change history leading up to your PR and
> digesting the issue described). I think this makes a ton of sense. If I'm
> understanding properly, the latest version of your PR essentially takes
> advantage of Mayya's recent change (
> https://github.com/apache/lucene/pull/13454) in the score supplier
> behavior that is now doing _some_ up-front work to iterate the first <= 16
> terms when building the scoreSupplier and computes a more
> accurate/reasonable cost based on that already-done work. Am I getting this
> right? If so, this seems like it has no downsides and all upside.
>
> I'll do a proper pass through the PR here shortly, but I love the idea
> (assuming I'm understanding it properly on a Friday afternoon after a
> long-ish week...).
>
> Cheers,
> -Greg
>
> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller  wrote:
>
>> Hi Froh-
>>
>> Thanks for raising this and sorry I missed your tag in GH#13201 back in
>> June (had some vacation and was generally away). I'd be interested to see
>> what others think as well, but I'll at least commit to looking through your
>> PR tomorrow or Monday to get a better handle on what's being proposed. We
>> went through a few iterations of this originally before we landed on the
>> current version. One promising approach was to have a more intelligent
>> query that would load some number of terms up-front to get a better cost
>> estimate before making a decision, but it required a custom query
>> implementation that generally didn't get favorable feedback (it's nice to
>> be able to use the existing IndexOrDocValuesQuery abstraction instead). I
>> can dig up some of that conversation if it's helpful, but I'll better
>> understand what you've got in mind first.
>>
>> Unwinding a bit though, I'm also in favor in general that we should be
>> able to do a better job estimating cost here. I think the tricky part is
>> how we go about doing that effectively. Thanks again for kicking off this
>> thread!
>>
>> Cheers,
>> -Greg
>>
>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh  wrote:
>>
>>> Hi there,
>>>
>>> For a few months, some of us have been running into issues with the cost
>>> estimate from AbstractMultiTermQueryConstantScoreWrapper. (
>>> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300
>>> )
>>>
>>> In https://github.com/apache/lucene/issues/13029, the problem was
>>> raised in terms of queries not being cached, because the estimated cost was
>>> too high.
>>>
>>> We've also run into problems in OpenSearch, since we started wrapping
>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost
>>> estimate, so IndexOrDocValueQuery decides it should be a DV query, even
>>> though the MTQ would really only match a handful of docs (and should be
>>> lead iterator).
>>>
>>> I opened a PR back in March (https://github.com/apache/lucene/pull/13201)
>>> to try to handle the case where a MultiTermQuery matches a small number of
>>> terms. Since Mayya pulled the rewrite logic that expands up to 16 terms (to
>>> rewrite as a Boolean disjunction) earlier in the workflow (in
>>> https://github.com/apache/lucene/pull/13454), we get the better cost
>>> estimate for MTQs on few terms "for free".
>>>
>>> What do folks think?
>>>
>>> Thanks,
>>> Froh
>>>
>>


Re: AbstractMultiTermQueryConstantScoreWrapper cost estimates (https://github.com/apache/lucene/issues/13029)

2024-08-02 Thread Michael Froh
Incidentally, speaking as someone with only a superficial understanding of
how the FSTs work, I'm wondering if there is risk of cost in expanding the
first few terms.

Say we have a million terms, but only one contains an 'a'. If someone
searches for '*a*', does that devolve into a term scan? Or can the FST
efficiently identify an arc with an 'a' and efficiently identify terms
containing that arc?

Thanks,
Froh

On Fri, Aug 2, 2024 at 3:50 PM Michael Froh  wrote:

> Exactly!
>
> My initial implementation added some potential cost. (I think I enumerated
> up to 128 terms before giving up.) Now that Mayya moved the (probably tiny)
> cost of expanding the first 16 terms upfront, my change is theoretically
> "free".
>
> Froh
>
> On Fri, Aug 2, 2024 at 3:25 PM Greg Miller  wrote:
>
>> Hey Froh-
>>
>> I got some time to look through your PR (most of the time was actually
>> refreshing my memory on the change history leading up to your PR and
>> digesting the issue described). I think this makes a ton of sense. If I'm
>> understanding properly, the latest version of your PR essentially takes
>> advantage of Mayya's recent change (
>> https://github.com/apache/lucene/pull/13454) in the score supplier
>> behavior that is now doing _some_ up-front work to iterate the first <= 16
>> terms when building the scoreSupplier and computes a more
>> accurate/reasonable cost based on that already-done work. Am I getting this
>> right? If so, this seems like it has no downsides and all upside.
>>
>> I'll do a proper pass through the PR here shortly, but I love the idea
>> (assuming I'm understanding it properly on a Friday afternoon after a
>> long-ish week...).
>>
>> Cheers,
>> -Greg
>>
>> On Thu, Aug 1, 2024 at 7:47 PM Greg Miller  wrote:
>>
>>> Hi Froh-
>>>
>>> Thanks for raising this and sorry I missed your tag in GH#13201 back in
>>> June (had some vacation and was generally away). I'd be interested to see
>>> what others think as well, but I'll at least commit to looking through your
>>> PR tomorrow or Monday to get a better handle on what's being proposed. We
>>> went through a few iterations of this originally before we landed on the
>>> current version. One promising approach was to have a more intelligent
>>> query that would load some number of terms up-front to get a better cost
>>> estimate before making a decision, but it required a custom query
>>> implementation that generally didn't get favorable feedback (it's nice to
>>> be able to use the existing IndexOrDocValuesQuery abstraction instead). I
>>> can dig up some of that conversation if it's helpful, but I'll better
>>> understand what you've got in mind first.
>>>
>>> Unwinding a bit though, I'm also in favor in general that we should be
>>> able to do a better job estimating cost here. I think the tricky part is
>>> how we go about doing that effectively. Thanks again for kicking off this
>>> thread!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Thu, Aug 1, 2024 at 5:58 PM Michael Froh  wrote:
>>>
>>>> Hi there,
>>>>
>>>> For a few months, some of us have been running into issues with the
>>>> cost estimate from AbstractMultiTermQueryConstantScoreWrapper. (
>>>> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/AbstractMultiTermQueryConstantScoreWrapper.java#L300
>>>> )
>>>>
>>>> In https://github.com/apache/lucene/issues/13029, the problem was
>>>> raised in terms of queries not being cached, because the estimated cost was
>>>> too high.
>>>>
>>>> We've also run into problems in OpenSearch, since we started wrapping
>>>> MultiTermQueries in IndexOrDocValueQuery. The MTQ gets an exaggerated cost
>>>> estimate, so IndexOrDocValueQuery decides it should be a DV query, even
>>>> though the MTQ would really only match a handful of docs (and should be
>>>> lead iterator).
>>>>
>>>> I opened a PR back in March (
>>>> https://github.com/apache/lucene/pull/13201) to try to handle the case
>>>> where a MultiTermQuery matches a small number of terms. Since Mayya pulled
>>>> the rewrite logic that expands up to 16 terms (to rewrite as a Boolean
>>>> disjunction) earlier in the workflow (in
>>>> https://github.com/apache/lucene/pull/13454), we get the better cost
>>>> estimate for MTQs on few terms "for free".
>>>>
>>>> What do folks think?
>>>>
>>>> Thanks,
>>>> Froh
>>>>
>>>


Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

2020-10-02 Thread Michael Froh
I am currently working on migrating a project from an old version of Solr
to Elasticsearch, and came across a funny (to me at least) difference in
the "default" behavior of JapanesePartOfSpeechStopFilterFactory.

If JapanesePartOfSpeechStopFilterFactory is given empty args, it does
nothing. It doesn't load any stop tags, and just passes along the
TokenStream passed to create(). (By comparison, the Elasticsearch filter
will default to loading the stop tags shipped in the Kuromoji analyzer
JAR.) So, for many years, my project was not using
JapanesePartOfSpeechStopFilter, when I thought that it was.

I would like to create an issue and submit a patch, in case other users out
there are failing to use the filter factory correctly, but I'm not sure
what the best approach is, between:

1. If someone doesn't specify the tags argument, then throw an exception
(because the user probably doesn't know what they're doing).
2. If someone doesn't specify the tags argument, then load the default stop
tags (like JapaneseAnalyzer does).

I would lean more toward 1, to avoid a silent change in behavior.


Re: Default (no-args) behavior for JapanesePartOfSpeechStopFilterFactory

2020-10-07 Thread Michael Froh
Thanks!

I created an issue (https://issues.apache.org/jira/browse/LUCENE-9567) and
PR (https://github.com/apache/lucene-solr/pull/1961), and followed your
suggestion of using the default stop tags and modifying MIGRATE.md.

Given that the "do nothing" behavior has been around for years, I don't see
much need to change it in 8.x (though I'm happy to do that if someone asks).

On Fri, Oct 2, 2020 at 9:49 AM Michael McCandless 
wrote:

> +1 to make this less trappy.
>
> It looks like KoreanPartOfSpeechStopFilterFactory will fallback to default
> stop tags if no args were provided.  I think we should indeed make
> JapanesePartOfSpeechStopFilterFactory consistent.
>
> Maybe, we fix this only in next major release (9.0), add an entry to
> MIGRATE.txt explaining that, and go with option 2?  And possibly option 1
> for 8.x releases?  (Or maybe don't fix it in 8.x releases... not sure).
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Oct 2, 2020 at 12:10 PM Michael Froh  wrote:
>
>> I am currently working on migrating a project from an old version of Solr
>> to Elasticsearch, and came across a funny (to me at least) difference in
>> the "default" behavior of JapanesePartOfSpeechStopFilterFactory.
>>
>> If JapanesePartOfSpeechStopFilterFactory is given empty args, it does
>> nothing. It doesn't load any stop tags, and just passes along the
>> TokenStream passed to create(). (By comparison, the Elasticsearch filter
>> will default to loading the stop tags shipped in the Kuromoji analyzer
>> JAR.) So, for many years, my project was not using
>> JapanesePartOfSpeechStopFilter, when I thought that it was.
>>
>> I would like to create an issue and submit a patch, in case other users
>> out there are failing to use the filter factory correctly, but I'm not sure
>> what the best approach is, between:
>>
>> 1. If someone doesn't specify the tags argument, then throw an exception
>> (because the user probably doesn't know what they're doing).
>> 2. If someone doesn't specify the tags argument, then load the default
>> stop tags (like JapaneseAnalyzer does).
>>
>> I would lean more toward 1, to avoid a silent change in behavior.
>>
>


Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()

2020-11-18 Thread Michael Froh
I have some code that is kind of abusing IndexWriter.deleteAll(). In short,
I'm basically experimenting with using tiny (one block of joined
parent/child documents) indexes as a serialized format to index on one
fleet and then merge these tiny indexes on another fleet. I'm doing this by
indexing a block, committing, storing the contents of the index directory
in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
performance is not terrible. (Currently getting about 20% of the throughput
I see with regular indexing.)

Regardless of my serialization shenanigans above, I've found that
performance degrades over time for the process, as it spends more time
allocating and freeing memory. Analyzing some heap dumps, it's because
FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
doesn't truly reset state. Specifically, it calls
globalFieldNumberMap.clear(), which clears all of the FieldNumbers
collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
number keeps counting up, and new instances of FieldInfos allocate larger
and larger arrays (and only use the top indices).

Has anyone else encountered this? Can I open an issue for resetting
lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
doing so?

(For my specific use-case, I would be okay with not clearing
globalFieldNumberMap at all, since the set of fields is bounded, but
assigning new field numbers is probably among the least of my costs.)


Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()

2020-11-18 Thread Michael Froh
I didn't try creating a new IndexWriter for each batch, but I was assuming
that would be heavier, as it would allocate a new DocumentsWriter, and
through that new DocumentsWriterPerThreads. Skimming through the code for
DWPT, it looks like there are various pools involved in creating each
DWPT's instance of DefaultIndexingChain, which might be expensive to create
frequently, rather than reusing on flush().

Also I was partly motivated by laziness. The production code I'm borrowing
for this prototype doesn't make it easy to recreate the IndexWriterConfig,
and IWC is not reusable across IndexWriter instances.

On Wed, Nov 18, 2020 at 12:25 PM Michael Sokolov  wrote:

> I'm curious if you tried creating a new IndexWriter for each batch?
>
> On Wed, Nov 18, 2020 at 1:18 PM Michael Froh  wrote:
> >
> > I have some code that is kind of abusing IndexWriter.deleteAll(). In
> short, I'm basically experimenting with using tiny (one block of joined
> parent/child documents) indexes as a serialized format to index on one
> fleet and then merge these tiny indexes on another fleet. I'm doing this by
> indexing a block, committing, storing the contents of the index directory
> in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
> performance is not terrible. (Currently getting about 20% of the throughput
> I see with regular indexing.)
> >
> > Regardless of my serialization shenanigans above, I've found that
> performance degrades over time for the process, as it spends more time
> allocating and freeing memory. Analyzing some heap dumps, it's because
> FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
> doesn't truly reset state. Specifically, it calls
> globalFieldNumberMap.clear(), which clears all of the FieldNumbers
> collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
> number keeps counting up, and new instances of FieldInfos allocate larger
> and larger arrays (and only use the top indices).
> >
> > Has anyone else encountered this? Can I open an issue for resetting
> lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
> doing so?
> >
> > (For my specific use-case, I would be okay with not clearing
> globalFieldNumberMap at all, since the set of fields is bounded, but
> assigning new field numbers is probably among the least of my costs.)
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Possible resource leak in IndexWriter.deleteAll()/FieldNumbers.clear()

2020-11-18 Thread Michael Froh
Thanks David!

Created https://issues.apache.org/jira/browse/LUCENE-9617 and posted a PR:
https://github.com/apache/lucene-solr/pull/2088

On Wed, Nov 18, 2020 at 10:26 AM David Smiley  wrote:

> Thanks for sharing the background of your indexing serialization
> shenanigans :-) -- interesting.
>
> I think IndexWriter.deleteAll() should ultimately reset
> lowestUnassignedFieldNumber.  globalFieldNumberMap.clear() is only called
> by deleteAll, so this simple proposal makes sense to me.  File a JIRA issue.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Nov 18, 2020 at 1:17 PM Michael Froh  wrote:
>
>> I have some code that is kind of abusing IndexWriter.deleteAll(). In
>> short, I'm basically experimenting with using tiny (one block of joined
>> parent/child documents) indexes as a serialized format to index on one
>> fleet and then merge these tiny indexes on another fleet. I'm doing this by
>> indexing a block, committing, storing the contents of the index directory
>> in a zip file, invoking deleteAll(), and repeating. Believe it or not, the
>> performance is not terrible. (Currently getting about 20% of the throughput
>> I see with regular indexing.)
>>
>> Regardless of my serialization shenanigans above, I've found that
>> performance degrades over time for the process, as it spends more time
>> allocating and freeing memory. Analyzing some heap dumps, it's because
>> FieldInfos.byNumber is getting bigger and bigger. IndexWriter.deleteAll()
>> doesn't truly reset state. Specifically, it calls
>> globalFieldNumberMap.clear(), which clears all of the FieldNumbers
>> collections, but it doesn't reset lowestUnassignedFieldNumber. So, that
>> number keeps counting up, and new instances of FieldInfos allocate larger
>> and larger arrays (and only use the top indices).
>>
>> Has anyone else encountered this? Can I open an issue for resetting
>> lowestUnassignedFieldNumber in FieldNumbers.clear()? Is there any risk in
>> doing so?
>>
>> (For my specific use-case, I would be okay with not clearing
>> globalFieldNumberMap at all, since the set of fields is bounded, but
>> assigning new field numbers is probably among the least of my costs.)
>>
>


Processing query clause combinations at indexing time

2020-12-14 Thread Michael Froh
My team at work has a neat feature that we've built on top of Lucene that
has provided a substantial (20%+) increase in maximum qps and some
reduction in query latency.

Basically, we run a training process that looks at historical queries to
find frequently co-occurring combinations of required clauses, say "+A +B
+C +D". Then at indexing time, if a document satisfies one of these known
combinations, we add a new term to the doc, like "opto:ABCD". At query
time, we can then replace the required clauses with a single TermQuery for
the "optimized" term.

It adds a little bit of extra work at indexing time and requires the
offline training step, but we've found that it yields a significant boost
at query time.

We're interested in open-sourcing this feature. Is it something worth
adding to Lucene? Since it doesn't require any core changes, maybe as a
module?


Re: Processing query clause combinations at indexing time

2020-12-15 Thread Michael Froh
Huh... I didn't know about Luwak / the monitoring module. I spent some time
this morning going through it. It takes a very different approach to
matching at indexing time versus what we did, and looks more powerful.
Given that document-matching is one of the harder steps in the process, I'm
quite happy to leverage something that already exists.

The feature we built has two other parts -- an offline training piece and a
query-optimizing piece. They share a QueryVisitor that collects required
clauses. The training step identifies frequently co-occurring combinations
of required clauses (using an FP-Growth implementation) and the query
optimizer adds a matching TermQuery as a filter clause (and removes the
replaced clauses, if they're non-scoring). They're pretty lightweight
compared to document-matching, though.

On Tue, Dec 15, 2020 at 7:41 AM Michael Sokolov  wrote:

> I feel like there could be some considerable overlap with features
> provided by Luwak, which was contributed to Lucene fairly recently,
> and I think does the query inversion work required for this; maybe
> more of it already exists here? I don't know if that module handles
> the query rewriting, or the term indexing you're talking about though.
>
> On Mon, Dec 14, 2020 at 11:25 PM Atri Sharma  wrote:
> >
> > +1
> >
> > I would suggest that this be an independent project hosted on Github
> (there have been similar projects in the past that have seen success that
> way)
> >
> > On Tue, 15 Dec 2020, 09:37 David Smiley,  wrote:
> >>
> >> Great optimization!
> >>
> >> I'm dubious on it being a good contribution to Lucene itself however,
> because what you propose fits cleanly above Lucene.  Even at a ES/Solr
> layer (which I know you don't use, but hypothetically speaking), I'm
> dubious there as well.
> >>
> >> ~ David Smiley
> >> Apache Lucene/Solr Search Developer
> >> http://www.linkedin.com/in/davidwsmiley
> >>
> >>
> >> On Mon, Dec 14, 2020 at 2:37 PM Michael Froh  wrote:
> >>>
> >>> My team at work has a neat feature that we've built on top of Lucene
> that has provided a substantial (20%+) increase in maximum qps and some
> reduction in query latency.
> >>>
> >>> Basically, we run a training process that looks at historical queries
> to find frequently co-occurring combinations of required clauses, say "+A
> +B +C +D". Then at indexing time, if a document satisfies one of these
> known combinations, we add a new term to the doc, like "opto:ABCD". At
> query time, we can then replace the required clauses with a single
> TermQuery for the "optimized" term.
> >>>
> >>> It adds a little bit of extra work at indexing time and requires the
> offline training step, but we've found that it yields a significant boost
> at query time.
> >>>
> >>> We're interested in open-sourcing this feature. Is it something worth
> adding to Lucene? Since it doesn't require any core changes, maybe as a
> module?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Processing query clause combinations at indexing time

2020-12-15 Thread Michael Froh
It's conceptually similar to CommonGrams in the single-field case, though
it doesn't require terms to appear in any particular positions.

It's also able to match across fields, which is where we get a lot of
benefit. We have frequently-occurring filters that get added by various
front-end layers before they hit us (which vary depending on where the
query comes from). In that regard, it's kind of like Solr's filter cache,
except that we identify the filters offline by analyzing query logs, find
common combinations of filters (especially ones where the intersection is
smaller than the smallest term's postings list), and cache the filters in
the index the next time we reindex.

On Tue, Dec 15, 2020 at 9:10 AM Robert Muir  wrote:

> See also commongrams which is a very similar concept:
>
> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams
>
> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir  wrote:
> >
> > I wonder if it can be done in a fairly clean way. This sounds similar
> > to using a ShingleFilter to do this optimization, but adding some
> > conditionals so that the index is smaller? Now that we have
> > ConditionalTokenFilter (for branching), can the feature be implemented
> > cleanly?
> >
> > Ideally it wouldn't require a lot of new code, something like checking
> > a "set" + conditionaltokenfilter + shinglefilter?
> >
> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh  wrote:
> > >
> > > My team at work has a neat feature that we've built on top of Lucene
> that has provided a substantial (20%+) increase in maximum qps and some
> reduction in query latency.
> > >
> > > Basically, we run a training process that looks at historical queries
> to find frequently co-occurring combinations of required clauses, say "+A
> +B +C +D". Then at indexing time, if a document satisfies one of these
> known combinations, we add a new term to the doc, like "opto:ABCD". At
> query time, we can then replace the required clauses with a single
> TermQuery for the "optimized" term.
> > >
> > > It adds a little bit of extra work at indexing time and requires the
> offline training step, but we've found that it yields a significant boost
> at query time.
> > >
> > > We're interested in open-sourcing this feature. Is it something worth
> adding to Lucene? Since it doesn't require any core changes, maybe as a
> module?
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Re: Processing query clause combinations at indexing time

2020-12-15 Thread Michael Froh
We don't handle positional queries in our use-case, but that's just because
we don't happen to have many positional queries. But if we identify
documents at indexing time that contain a given phrase/slop/etc. query,
then we can tag the documents with a term that indicates that (or, more
likely, tag documents that contain that positional query AND some other
queries). We can identify documents that match a PhraseQuery, for example,
by adding appending a TokenFilter for the relevant field that "listens" for
the given phrase.

Our use-case has only needed TermQuery, numeric range queries, and
ToParentBlockJoinQuery clauses so far, though. For TermQuery, we can just
listen for individual terms (with a TokenFilter). For range queries, we
look at the IndexableField itself (typically an IntPoint) before submitting
the Document to the IndexWriter. For a ToParentBlockJoinQuery, we can just
apply the matching logic to each child document to detect a match before we
get to the parent. The downside is that for each Query type that we want to
be able to evaluate at indexing time, we need to add explicit support.

We're not scoring at matching time (relying on a static sort instead),
which allows us to remove the matched clauses altogether. That said, if the
match set of the conjunction of required clauses is small (at least smaller
than the match sets of the individual clauses), adding a "precomputed
intersection" filter should advance scorers more efficiently.

Does Lucene's filter caching match on subsets of required clauses? So, for
example, if some queries contain (somewhere in a BooleanQuery tree) clauses
that flatten to "+A +B +C", can I cache that and also have it kick in for a
BooleanQuery containing "+A +B +C +D", turning it into something like
"+cached('+A +B +C') +D" without having to explicitly do a cache lookup for
"+A +B +C"?

I guess another advantage of our approach is that it's effectively a
write-through cache, pushing the filter-matching burden to indexing time.
For read-heavy use-cases, that trade-off is worth it.




On Tue, Dec 15, 2020 at 3:42 PM Robert Muir  wrote:

> What are you doing with positional queries though? And how does the
> scoring work (it is unclear from your previous reply to me whether you
> are scoring).
>
> Lucene has filter caching too, so if you are doing this for
> non-scoring cases maybe something is off?
>
> On Tue, Dec 15, 2020 at 3:19 PM Michael Froh  wrote:
> >
> > It's conceptually similar to CommonGrams in the single-field case,
> though it doesn't require terms to appear in any particular positions.
> >
> > It's also able to match across fields, which is where we get a lot of
> benefit. We have frequently-occurring filters that get added by various
> front-end layers before they hit us (which vary depending on where the
> query comes from). In that regard, it's kind of like Solr's filter cache,
> except that we identify the filters offline by analyzing query logs, find
> common combinations of filters (especially ones where the intersection is
> smaller than the smallest term's postings list), and cache the filters in
> the index the next time we reindex.
> >
> > On Tue, Dec 15, 2020 at 9:10 AM Robert Muir  wrote:
> >>
> >> See also commongrams which is a very similar concept:
> >>
> https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/commongrams
> >>
> >> On Tue, Dec 15, 2020 at 12:08 PM Robert Muir  wrote:
> >> >
> >> > I wonder if it can be done in a fairly clean way. This sounds similar
> >> > to using a ShingleFilter to do this optimization, but adding some
> >> > conditionals so that the index is smaller? Now that we have
> >> > ConditionalTokenFilter (for branching), can the feature be implemented
> >> > cleanly?
> >> >
> >> > Ideally it wouldn't require a lot of new code, something like checking
> >> > a "set" + conditionaltokenfilter + shinglefilter?
> >> >
> >> > On Mon, Dec 14, 2020 at 2:37 PM Michael Froh 
> wrote:
> >> > >
> >> > > My team at work has a neat feature that we've built on top of
> Lucene that has provided a substantial (20%+) increase in maximum qps and
> some reduction in query latency.
> >> > >
> >> > > Basically, we run a training process that looks at historical
> queries to find frequently co-occurring combinations of required clauses,
> say "+A +B +C +D". Then at indexing time, if a document satisfies one of
> these known combinations, we add

Re: [VOTE] Release Lucene/Solr 8.6.0 RC1

2020-07-14 Thread Michael Froh
+1 (Non-binding)

Upgraded Amazon Product Search to this RC and found no issues.

On Fri, Jul 10, 2020 at 5:03 AM Namgyu Kim  wrote:

> +1 SUCCESS! [1:25:53.314724]
>
> On Fri, Jul 10, 2020 at 2:22 PM Tomás Fernández Löbbe <
> tomasflo...@gmail.com> wrote:
>
>> +1
>>
>> SUCCESS! [1:04:02.550893]
>>
>> On Thu, Jul 9, 2020 at 12:36 PM Michael Sokolov 
>> wrote:
>>
>>> +1
>>>
>>> SUCCESS! [0:59:20.777306]
>>>
>>> (tested on Graviton ARM processor)
>>>
>>> On Thu, Jul 9, 2020 at 1:10 PM Anshum Gupta 
>>> wrote:
>>> >
>>> > +1
>>> >
>>> > SUCCESS! [1:15:03.975368]
>>> >
>>> > On Wed, Jul 8, 2020 at 1:56 AM Bruno Roustant <
>>> bruno.roust...@gmail.com> wrote:
>>> >>
>>> >> Please vote for release candidate 1 for Lucene/Solr 8.6.0
>>> >>
>>> >> The artifacts can be downloaded from:
>>> >>
>>> >>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.0-RC1-reva9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc
>>> >>
>>> >> You can run the smoke tester directly with this command:
>>> >>
>>> >> python3 -u dev-tools/scripts/smokeTestRelease.py \
>>> >>
>>> https://dist.apache.org/repos/dist/dev/lucene/lucene-solr-8.6.0-RC1-reva9c5fb0da2dfc8c7375622c80dbf1a0cc26f44dc
>>> >>
>>> >> The vote will be open for at least 72 hours i.e. until 2020-07-11
>>> 09:00 UTC.
>>> >>
>>> >> [ ] +1  approve
>>> >> [ ] +0  no opinion
>>> >> [ ] -1  disapprove (and reason why)
>>> >>
>>> >> Here is my +1
>>> >
>>> >
>>> >
>>> > --
>>> > Anshum Gupta
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>>> For additional commands, e-mail: dev-h...@lucene.apache.org
>>>
>>>


[jira] [Created] (SOLR-5330) PerSegmentSingleValuedFaceting overwrites facet values

2013-10-10 Thread Michael Froh (JIRA)
Michael Froh created SOLR-5330:
--

 Summary: PerSegmentSingleValuedFaceting overwrites facet values
 Key: SOLR-5330
 URL: https://issues.apache.org/jira/browse/SOLR-5330
 Project: Solr
  Issue Type: Bug
Affects Versions: 4.2.1
Reporter: Michael Froh


I recently tried enabling facet.method=fcs for one of my indexes and found a 
significant performance improvement (with a large index, many facet values, and 
near-realtime updates). Unfortunately, the results were also wrong. 
Specifically, some facet values were being partially overwritten by other facet 
values. (That is, if I expected facet values like "abcdef" and "123", I would 
get a value like "123def".)

Debugging through the code, it looks like the problem was in 
PerSegmentSingleValuedFaceting, specifically in the getFacetCounts method, when 
BytesRef val is shallow-copied from the temporary per-segment BytesRef. The 
byte array assigned to val is shared with the byte array for seg.tempBR, and is 
overwritten a few lines down by the call to seg.tenum.next().

I managed to fix it locally by replacing the shallow copy with a deep copy.

While I encountered this problem on Solr 4.2.1, I see that the code is 
identical in 4.5. Unless the behavior of TermsEnum.next() has changed, I 
believe this bug still exists.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (SOLR-5330) PerSegmentSingleValuedFaceting overwrites facet values

2013-10-10 Thread Michael Froh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SOLR-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Froh updated SOLR-5330:
---

Attachment: solr-5330.patch

Patch attached

> PerSegmentSingleValuedFaceting overwrites facet values
> --
>
> Key: SOLR-5330
> URL: https://issues.apache.org/jira/browse/SOLR-5330
> Project: Solr
>  Issue Type: Bug
>Affects Versions: 4.2.1
>    Reporter: Michael Froh
> Attachments: solr-5330.patch
>
>
> I recently tried enabling facet.method=fcs for one of my indexes and found a 
> significant performance improvement (with a large index, many facet values, 
> and near-realtime updates). Unfortunately, the results were also wrong. 
> Specifically, some facet values were being partially overwritten by other 
> facet values. (That is, if I expected facet values like "abcdef" and "123", I 
> would get a value like "123def".)
> Debugging through the code, it looks like the problem was in 
> PerSegmentSingleValuedFaceting, specifically in the getFacetCounts method, 
> when BytesRef val is shallow-copied from the temporary per-segment BytesRef. 
> The byte array assigned to val is shared with the byte array for seg.tempBR, 
> and is overwritten a few lines down by the call to seg.tenum.next().
> I managed to fix it locally by replacing the shallow copy with a deep copy.
> While I encountered this problem on Solr 4.2.1, I see that the code is 
> identical in 4.5. Unless the behavior of TermsEnum.next() has changed, I 
> believe this bug still exists.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2012-06-08 Thread Michael Froh (JIRA)
Michael Froh created SOLR-3526:
--

 Summary: Remove classfile dependency on ZooKeeper from 
CoreContainer
 Key: SOLR-3526
 URL: https://issues.apache.org/jira/browse/SOLR-3526
 Project: Solr
  Issue Type: Wish
  Components: SolrCloud
Affects Versions: 4.0
Reporter: Michael Froh


We are using Solr as a library embedded within an existing application, and are 
currently developing toward using 4.0 when it is released.

We are currently instantiating SolrCores with null CoreDescriptors (and hence 
no CoreContainer), since we don't need SolrCloud functionality (and do not want 
to depend on ZooKeeper).

A couple of months ago, SearchHandler was modified to try to retrieve a 
ShardHandlerFactory from the CoreContainer. I was able to work around this by 
specifying a dummy ShardHandlerFactory in the config.

Now UpdateRequestProcessorChain is inserting a DistributedUpdateProcessor into 
my chains, again triggering a NPE when trying to dereference the CoreDescriptor.

I would happily place the SolrCores in CoreContainers, except that 
CoreContainer imports and references org.apache.zookeeper.KeeperException, 
which we do not have (and do not want) in our classpath. Therefore, I get a 
ClassNotFoundException when loading the CoreContainer class.

Ideally (IMHO), ZkController should isolate the ZooKeeper dependency, and 
simply rethrow KeeperExceptions as 
org.apache.solr.common.cloud.ZooKeeperException (or some Solr-hosted checked 
exception). Then CoreContainer could remove the offending import/references.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2012-06-11 Thread Michael Froh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13292689#comment-13292689
 ] 

Michael Froh commented on SOLR-3526:


Oh, thanks a lot for pointing that out, Hoss! I had completely missed that part.

My wish for the removal of the KeeperException reference from CoreContainer 
still stands, but using NoOpDistributingUpdateProcessorFactory lets me remove 
my current hacky solution (adding a dummy org.apache.zookeeper.KeeperException 
in one of my libraries).

> Remove classfile dependency on ZooKeeper from CoreContainer
> ---
>
> Key: SOLR-3526
> URL: https://issues.apache.org/jira/browse/SOLR-3526
> Project: Solr
>  Issue Type: Wish
>  Components: SolrCloud
>Affects Versions: 4.0
>Reporter: Michael Froh
>
> We are using Solr as a library embedded within an existing application, and 
> are currently developing toward using 4.0 when it is released.
> We are currently instantiating SolrCores with null CoreDescriptors (and hence 
> no CoreContainer), since we don't need SolrCloud functionality (and do not 
> want to depend on ZooKeeper).
> A couple of months ago, SearchHandler was modified to try to retrieve a 
> ShardHandlerFactory from the CoreContainer. I was able to work around this by 
> specifying a dummy ShardHandlerFactory in the config.
> Now UpdateRequestProcessorChain is inserting a DistributedUpdateProcessor 
> into my chains, again triggering a NPE when trying to dereference the 
> CoreDescriptor.
> I would happily place the SolrCores in CoreContainers, except that 
> CoreContainer imports and references org.apache.zookeeper.KeeperException, 
> which we do not have (and do not want) in our classpath. Therefore, I get a 
> ClassNotFoundException when loading the CoreContainer class.
> Ideally (IMHO), ZkController should isolate the ZooKeeper dependency, and 
> simply rethrow KeeperExceptions as 
> org.apache.solr.common.cloud.ZooKeeperException (or some Solr-hosted checked 
> exception). Then CoreContainer could remove the offending import/references.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Created] (LUCENE-4185) CharFilters being added twice in Solr

2012-07-02 Thread Michael Froh (JIRA)
Michael Froh created LUCENE-4185:


 Summary: CharFilters being added twice in Solr
 Key: LUCENE-4185
 URL: https://issues.apache.org/jira/browse/LUCENE-4185
 Project: Lucene - Java
  Issue Type: Bug
  Components: modules/analysis
Affects Versions: 4.0
Reporter: Michael Froh


Debugging one of my test cases, I found that a TokenStream from an Analyzer 
constructed by Solr contains the configured chain of CharFilters twice.

While I may be mistaken, the fix for LUCENE-4142 appears to make the fix for 
LUCENE-3721 unnecessary, and the combination of the fixes results in the 
repeated application of the CharFilters.

I came across this with a test case involving an HTMLStripCharFilter, where the 
input string contains "<h1>". After passing through one HTMLStripCharFilter, 
it becomes "", and then the HTML is removed by the second filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-4185) CharFilters being added twice in Solr

2012-07-02 Thread Michael Froh (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-4185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Froh updated LUCENE-4185:
-

Affects Version/s: (was: 4.0)
   4.0-ALPHA

> CharFilters being added twice in Solr
> -
>
> Key: LUCENE-4185
> URL: https://issues.apache.org/jira/browse/LUCENE-4185
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.0-ALPHA
>Reporter: Michael Froh
>
> Debugging one of my test cases, I found that a TokenStream from an Analyzer 
> constructed by Solr contains the configured chain of CharFilters twice.
> While I may be mistaken, the fix for LUCENE-4142 appears to make the fix for 
> LUCENE-3721 unnecessary, and the combination of the fixes results in the 
> repeated application of the CharFilters.
> I came across this with a test case involving an HTMLStripCharFilter, where 
> the input string contains "<h1>". After passing through one 
> HTMLStripCharFilter, it becomes "", and then the HTML is removed by the 
> second filter.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2015-11-19 Thread Michael Froh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014134#comment-15014134
 ] 

Michael Froh edited comment on SOLR-3526 at 11/19/15 6:53 PM:
--

3.5 years later, I decided to try taking a stab at this myself. In the 
meantime, references to KeeperException and ZooKeeper's Stat class have worked 
their way through more core Solr classes, including SolrCore and RequestParams.

After making these changes, I was able (from a clean work space) to 
successfully run the tests from TestEmbeddedSolrServerConstructors (removing 
the "extends SolrTestCaseJ4") without ZooKeeper on my classpath. (I couldn't 
extend SolrTestCaseJ4, since RevertDefaultThreadHandlerRule references 
org.apache.zookeeper.server.NIOServerCnxnFactory.)

I'm not sure how to add a test to the Solr build that will verify that someone 
is able to bring up an EmbeddedSolrServer and use core features without 
ZooKeeper. Does anyone have any suggestions?


was (Author: msfroh):
3.5 years later, I decided to try taking a stab at this myself. In the 
meantime, references to KeeperException and ZooKeeper's Stat class have worked 
there way through core Solr classes, including SolrCore and RequestParams.

After making these changes, I was able (from a clean work space) to 
successfully run the tests from TestEmbeddedSolrServerConstructors (removing 
the "extends SolrTestCaseJ4") without ZooKeeper on my classpath. (I couldn't 
extend SolrTestCaseJ4, since RevertDefaultThreadHandlerRule references 
org.apache.zookeeper.server.NIOServerCnxnFactory.)

I'm not sure how to add a test to the Solr build that will verify that someone 
is able to bring up an EmbeddedSolrServer and use core features without 
ZooKeeper. Does anyone have any suggestions?

> Remove classfile dependency on ZooKeeper from CoreContainer
> ---
>
> Key: SOLR-3526
> URL: https://issues.apache.org/jira/browse/SOLR-3526
> Project: Solr
>  Issue Type: Wish
>  Components: SolrCloud
>Affects Versions: 4.0-ALPHA
>Reporter: Michael Froh
>
> We are using Solr as a library embedded within an existing application, and 
> are currently developing toward using 4.0 when it is released.
> We are currently instantiating SolrCores with null CoreDescriptors (and hence 
> no CoreContainer), since we don't need SolrCloud functionality (and do not 
> want to depend on ZooKeeper).
> A couple of months ago, SearchHandler was modified to try to retrieve a 
> ShardHandlerFactory from the CoreContainer. I was able to work around this by 
> specifying a dummy ShardHandlerFactory in the config.
> Now UpdateRequestProcessorChain is inserting a DistributedUpdateProcessor 
> into my chains, again triggering a NPE when trying to dereference the 
> CoreDescriptor.
> I would happily place the SolrCores in CoreContainers, except that 
> CoreContainer imports and references org.apache.zookeeper.KeeperException, 
> which we do not have (and do not want) in our classpath. Therefore, I get a 
> ClassNotFoundException when loading the CoreContainer class.
> Ideally (IMHO), ZkController should isolate the ZooKeeper dependency, and 
> simply rethrow KeeperExceptions as 
> org.apache.solr.common.cloud.ZooKeeperException (or some Solr-hosted checked 
> exception). Then CoreContainer could remove the offending import/references.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2015-11-19 Thread Michael Froh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014134#comment-15014134
 ] 

Michael Froh commented on SOLR-3526:


3.5 years later, I decided to try taking a stab at this myself. In the 
meantime, references to KeeperException and ZooKeeper's Stat class have worked 
there way through core Solr classes, including SolrCore and RequestParams.

After making these changes, I was able (from a clean work space) to 
successfully run the tests from TestEmbeddedSolrServerConstructors (removing 
the "extends SolrTestCaseJ4") without ZooKeeper on my classpath. (I couldn't 
extend SolrTestCaseJ4, since RevertDefaultThreadHandlerRule references 
org.apache.zookeeper.server.NIOServerCnxnFactory.)

I'm not sure how to add a test to the Solr build that will verify that someone 
is able to bring up an EmbeddedSolrServer and use core features without 
ZooKeeper. Does anyone have any suggestions?

> Remove classfile dependency on ZooKeeper from CoreContainer
> ---
>
> Key: SOLR-3526
> URL: https://issues.apache.org/jira/browse/SOLR-3526
> Project: Solr
>  Issue Type: Wish
>  Components: SolrCloud
>Affects Versions: 4.0-ALPHA
>Reporter: Michael Froh
>
> We are using Solr as a library embedded within an existing application, and 
> are currently developing toward using 4.0 when it is released.
> We are currently instantiating SolrCores with null CoreDescriptors (and hence 
> no CoreContainer), since we don't need SolrCloud functionality (and do not 
> want to depend on ZooKeeper).
> A couple of months ago, SearchHandler was modified to try to retrieve a 
> ShardHandlerFactory from the CoreContainer. I was able to work around this by 
> specifying a dummy ShardHandlerFactory in the config.
> Now UpdateRequestProcessorChain is inserting a DistributedUpdateProcessor 
> into my chains, again triggering a NPE when trying to dereference the 
> CoreDescriptor.
> I would happily place the SolrCores in CoreContainers, except that 
> CoreContainer imports and references org.apache.zookeeper.KeeperException, 
> which we do not have (and do not want) in our classpath. Therefore, I get a 
> ClassNotFoundException when loading the CoreContainer class.
> Ideally (IMHO), ZkController should isolate the ZooKeeper dependency, and 
> simply rethrow KeeperExceptions as 
> org.apache.solr.common.cloud.ZooKeeperException (or some Solr-hosted checked 
> exception). Then CoreContainer could remove the offending import/references.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-3526) Remove classfile dependency on ZooKeeper from CoreContainer

2015-11-19 Thread Michael Froh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-3526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014167#comment-15014167
 ] 

Michael Froh commented on SOLR-3526:


Also worth highlighting -- the significant part of the change mostly involves 
decorating ZooKeeper calls in SolrZkClient to catch KeeperExceptions and 
rethrow as appropriately-typed SolrZkCheckedExceptions, I decorated those calls 
by turning them into lambdas. 

So, the change can't easily be backported to 5.x. More significantly, I 
suppose, I changed the method signatures of about a hundred methods, which 
probably prevents backporting anyway. 

> Remove classfile dependency on ZooKeeper from CoreContainer
> ---
>
> Key: SOLR-3526
> URL: https://issues.apache.org/jira/browse/SOLR-3526
> Project: Solr
>  Issue Type: Wish
>  Components: SolrCloud
>Affects Versions: 4.0-ALPHA
>Reporter: Michael Froh
>
> We are using Solr as a library embedded within an existing application, and 
> are currently developing toward using 4.0 when it is released.
> We are currently instantiating SolrCores with null CoreDescriptors (and hence 
> no CoreContainer), since we don't need SolrCloud functionality (and do not 
> want to depend on ZooKeeper).
> A couple of months ago, SearchHandler was modified to try to retrieve a 
> ShardHandlerFactory from the CoreContainer. I was able to work around this by 
> specifying a dummy ShardHandlerFactory in the config.
> Now UpdateRequestProcessorChain is inserting a DistributedUpdateProcessor 
> into my chains, again triggering a NPE when trying to dereference the 
> CoreDescriptor.
> I would happily place the SolrCores in CoreContainers, except that 
> CoreContainer imports and references org.apache.zookeeper.KeeperException, 
> which we do not have (and do not want) in our classpath. Therefore, I get a 
> ClassNotFoundException when loading the CoreContainer class.
> Ideally (IMHO), ZkController should isolate the ZooKeeper dependency, and 
> simply rethrow KeeperExceptions as 
> org.apache.solr.common.cloud.ZooKeeperException (or some Solr-hosted checked 
> exception). Then CoreContainer could remove the offending import/references.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org