Re: Query of Death Lucene/Solr 7.6

2019-02-08 Thread Michael Gibney
Hi Markus,
As of 7.6, LUCENE-8531 
reverted a graph/Spans-based phrase query implementation (introduced in 6.5
-- LUCENE-7699 ) to an
implementation that builds a separate phrase query for each possible
enumerated path through the graph described by a parsed query.
The potential for combinatoric explosion of the enumerated approach was (as
far as I can tell) one of the main motivations for introducing the
Spans-based implementation. Some real-world use cases would be good to
explore. Markus, could you send (as an attachment) the debug toString() for
the queries with/without synonyms enabled? I'm also guessing you may have
WordDelimiterGraphFilter on the query analyzer?
As an alternative to disabling pf, LUCENE-8531 only reverts to the
enumerated approach for phrase queries where slop>0, so setting ps=0 would
probably also help.
Michael

On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma 
wrote:

> Hello (apologies for cross-posting),
>
> While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the
> remaining four, we stumbled upon a situation where the 7.6 nodes quickly
> succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all
> unaffected (tested and confirmed).
>
> Following Smiley's suggestion i used Eclipse MAT to find the problem in
> the heap dump i obtained, this fantastic tool revealed within minutes that
> a query thread ate 65 % of all resources, in the class variables i could
> find the the query, and reproduce the problem.
>
> The problematic query is 'dubbele dijk/rijke dijkproject in het dijktracé
> eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString() output
> in edismax' newFieldQuery. If the node survives it takes 2+ seconds for the
> query to run (150 ms otherwise). If i disable all query time
> SynonymGraphFilters it still takes a second and produces just a 9 MB
> toString() for the query.
>
> I could not find anything like this in Jira. I did think of LUCENE-8479
> and LUCENE-8531 but they were about graphs, this problem looked related
> though.
>
> I think i tracked it further down to LUCENE-8589 or SOLR-12243. When i
> leave Solr's edismax' pf parameter empty, everything runs fast. When all
> fields are configured for pf, the node dies.
>
> I am now unsure whether this is a Solr or a Lucene issue.
>
> Please let me know.
>
> Many thanks,
> Markus
>
> ps. in Solr i even got an 'Impossible Exception', my first!
>


Re: Query of Death Lucene/Solr 7.6

2019-02-22 Thread Michael Gibney
Ah... I think there are two issues likely at play here. One is LUCENE-8531
, which reverts a bug
related to SpanNearQuery semantics, causing possible query paths to be
enumarated up front. Setting ps=0 (although perhaps not appropriate for
some use cases) should address problems related to this issue.

The other (likely affecting Gregg, for whom ps=0 did not help) is SOLR-12243
. Prior to 7.6,
SpanNearQuery (generated for relatively complex "graph" tokenized queries,
such as would be generated with WDGF, SynonymGraphFilter, etc.) were simply
getting dropped. This was surely a bug, in that pf did not contribute at
all to boosting such queries; but the silver lining was that performance
was great ;-)

Markus, Gregg, could send examples (parsed query toString()) of problematic
queries (and perhaps relevant analysis chain configs)?

Michael



On Fri, Feb 22, 2019 at 11:00 AM Gregg Donovan  wrote:

> FWIW: we have also seen serious Query of Death issues after our upgrade to
> Solr 7.6. Are there any open issues we can watch? Is Markus' findings
> around `pf` our best guess? We've seen these issues even with ps=0. We also
> use the WDF.
>
> On Fri, Feb 22, 2019 at 8:58 AM Markus Jelsma 
> wrote:
>
> > Hello Michael,
> >
> > Sorry it took so long to get back to this, too many things to do.
> >
> > Anyway, yes, we have WDF on our query-time analysers. I uploaded two log
> > files, both the same query of death with and without synonym filter
> enabled.
> >
> > https://mail.openindex.io/export/solr-8983-console.log 23 MB
> > https://mail.openindex.io/export/solr-8983-console-without-syns.log 1.9
> MB
> >
> > Without the synonym we still see a huge number of entries. Many different
> > parts of our analyser chain contribute to the expansion of queries, but
> pf
> > itself really turns the problem on or off.
> >
> > Since SOLR-12243 is new in 7.6, does anyone know that SOLR-12243 could
> > have this side-effect?
> >
> > Thanks,
> > Markus
> >
> >
> > -Original message-
> > > From:Michael Gibney 
> > > Sent: Friday 8th February 2019 17:19
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Query of Death Lucene/Solr 7.6
> > >
> > > Hi Markus,
> > > As of 7.6, LUCENE-8531 <
> > https://issues.apache.org/jira/browse/LUCENE-8531>
> > > reverted a graph/Spans-based phrase query implementation (introduced in
> > 6.5
> > > -- LUCENE-7699 ) to
> > an
> > > implementation that builds a separate phrase query for each possible
> > > enumerated path through the graph described by a parsed query.
> > > The potential for combinatoric explosion of the enumerated approach was
> > (as
> > > far as I can tell) one of the main motivations for introducing the
> > > Spans-based implementation. Some real-world use cases would be good to
> > > explore. Markus, could you send (as an attachment) the debug toString()
> > for
> > > the queries with/without synonyms enabled? I'm also guessing you may
> have
> > > WordDelimiterGraphFilter on the query analyzer?
> > > As an alternative to disabling pf, LUCENE-8531 only reverts to the
> > > enumerated approach for phrase queries where slop>0, so setting ps=0
> > would
> > > probably also help.
> > > Michael
> > >
> > > On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma <
> markus.jel...@openindex.io
> > >
> > > wrote:
> > >
> > > > Hello (apologies for cross-posting),
> > > >
> > > > While working on SOLR-12743, using 7.6 on two nodes and 7.2.1 on the
> > > > remaining four, we stumbled upon a situation where the 7.6 nodes
> > quickly
> > > > succumb when a 'Query-of-Death' is issued, 7.2.1 up to 7.5 are all
> > > > unaffected (tested and confirmed).
> > > >
> > > > Following Smiley's suggestion i used Eclipse MAT to find the problem
> in
> > > > the heap dump i obtained, this fantastic tool revealed within minutes
> > that
> > > > a query thread ate 65 % of all resources, in the class variables i
> > could
> > > > find the the query, and reproduce the problem.
> > > >
> > > > The problematic query is 'dubbele dijk/rijke dijkproject in het
> > dijktracé
> > > > eemshaven-delfzijl', on 7.6 this input produces a 40+ MB toString()
> > output
> > > > in edismax' newFieldQuery. If the node survives it takes 2+ seconds
> > for the
> > > > query to run (150 ms otherwise). If i disable all query time
> > > > SynonymGraphFilters it still takes a second and produces just a 9 MB
> > > > toString() for the query.
> > > >
> > > > I could not find anything like this in Jira. I did think of
> LUCENE-8479
> > > > and LUCENE-8531 but they were about graphs, this problem looked
> related
> > > > though.
> > > >
> > > > I think i tracked it further down to LUCENE-8589 or SOLR-12243. When
> i
> > > > leave Solr's edismax' pf parameter empty, everything runs fast. When
> > all
> > > > fields are configured for pf, the node dies.
> > > >
> > > > I am now unsure w

Re: Is anyone using proxy caching in front of solr?

2019-02-25 Thread Michael Gibney
Tangentially related, possibly of interest regarding solr-internal cache
hit ratio (esp. with a lot of replicas):
https://issues.apache.org/jira/browse/SOLR-13257

On Mon, Feb 25, 2019 at 11:33 AM Walter Underwood 
wrote:

> Don’t worry about one and two character queries, because they will almost
> always be served from cache.
>
> There are only 26 one-letter queries (36 if you use numbers). Almost all
> of those will be in the query results cache and will be very fast with very
> little server load. The common two-letter queries will also be cached.
>
> An external HTTP cache can be effective, especially if you have a lot of
> replicas. The single cache will have a higher hit rate than the individual
> servers.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 25, 2019, at 7:57 AM, Edward Ribeiro 
> wrote:
> >
> > Maybe you could add a length filter factory to filter out queries with 2
> or
> > 3 characters using
> >
> https://lucene.apache.org/solr/guide/7_4/filter-descriptions.html#FilterDescriptions-LengthFilter
> > ?
> >
> > PS: this filter requires a max length too.
> >
> > Edward
> >
> > Em qui, 21 de fev de 2019 04:52, Furkan KAMACI 
> > escreveu:
> >
> >> Hi Joakim,
> >>
> >> I suggest you to read these resources:
> >>
> >> http://lucene.472066.n3.nabble.com/Varnish-td4072057.html
> >> http://lucene.472066.n3.nabble.com/SolrJ-HTTP-caching-td490063.html
> >> https://wiki.apache.org/solr/SolrAndHTTPCaches
> >>
> >> which gives information about HTTP Caching including Varnish Cache,
> >> Last-Modified, ETag, Expires, Cache-Control headers.
> >>
> >> Kind Regards,
> >> Furkan KAMACI
> >>
> >> On Wed, Feb 20, 2019 at 11:18 PM Joakim Hansson <
> >> joakim.hansso...@gmail.com>
> >> wrote:
> >>
> >>> Hello dear user list!
> >>> I work at a company in retail where we use solr to perform searches as
> >> you
> >>> type.
> >>> As soon as you type more than 1 characters in the search field solr
> >> starts
> >>> serving hits.
> >>> Of course this generates a lot of "unnecessary" queries (in the sense
> >> that
> >>> they are never shown to the user) which is why I started thinking about
> >>> using something like squid or varnish to cache a bunch of these 2-4
> >>> character queries.
> >>>
> >>> It seems most stuff I find about it is from pretty old sources, but as
> >> far
> >>> as I know solrcloud doesn't have distributed cache support.
> >>>
> >>> Our indexes aren't updated that frequently, about 4 - 6 times a day. We
> >>> don't use a lot of shards and replicas (biggest index is split to 3
> >> shards
> >>> with 2 replicas). All shards/replicas are not on the same solr host.
> >>> Our solr setup handles around 80-200 queries per second during the day
> >> with
> >>> peaks at >1500 before holiday season and sales.
> >>>
> >>> I haven't really read up on the details yet but it seems like I could
> use
> >>> etags and Expires headers to work around having to do some of that
> >>> "unnecessary" work.
> >>>
> >>> Is anyone doing this? Why? Why not?
> >>>
> >>> - peace!
> >>>
> >>
>
>


Re: ExactStatsCache not working for distributed IDF

2019-03-14 Thread Michael Gibney
Are you basing your conclusion (that it's not working as expected) on the
scores as reported in the debug output? If you haven't already, try adding
"score" to the "fl" param -- if different (for a given doc) than the score
as reported in debug, then it's probably working as intended ... just a
little confusing in the debug output.

On Thu, Mar 14, 2019 at 3:23 PM Arnold Bronley 
wrote:

> Hi,
>
> I am using ExactStatsCache in SolrCloud (7.7.1) by adding following to
> solrconfig.xml file for all collections. I restarted and indexed the
> documents of all collections after this change just to be sure.
>
> 
>
> However, when I do multi-collection query, the scores do not change before
> and after adding ExactStatsCache. I can still see the docCount in debug
> output coming from individual shards and not even from whole collection. I
> was expecting that the docCount would be of addition of all docCounts of
> all collections included in search query.
>
> Do you know what I might be doing wrong?
>


Re: Solr index slow response

2019-03-19 Thread Michael Gibney
I'll second Emir's suggestion to try disabling swap. "I doubt swap would
affect it since there is such huge free memory." -- sounds reasonable, but
has not been my experience, and the stats you sent indicate that swap is in
fact being used. Also, note that in many cases setting vm.swappiness=0 is
not equivalent to disabling swap (i.e., swapoff -a). If you're inclined to
try disabling swap, verify that it's successfully disabled by checking (and
re-checking) actual swap usage (that may sound obvious or trivial, but
relying on possibly-incorrect assumptions related to amount of free memory,
swappiness, etc. can be misleading). Good luck!

On Tue, Mar 19, 2019 at 10:29 AM Walter Underwood 
wrote:

> Indexing is CPU bound. If you have enough RAM, SSD disks, and enough
> client threads, you should be able to drive CPU to over 90%.
>
> Start with two client threads per CPU. That allows one thread to be
> sending data over the network while another is waiting for Solr to process
> the batch.
>
> A couple of years ago, I was indexing a million docs per minute into a
> Solr Cloud cluster. I think that was four shards on instances with 16 CPUs,
> so it was 64 CPUs available for indexing. That was with Java 8, G1GC, and 8
> GB of heap.
>
> Your document are averaging about 50 kbytes, which is pretty big. Our
> documents average about 3.5 kbytes. A lot of the indexing work is handling
> the text, so those larger documents would be at least 10X slower than ours.
>
> Are you doing atomic updates? That would slow things down a lot.
>
> If you want to use G1GC, use the configuration I sent earlier.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Mar 19, 2019, at 7:05 AM, Bernd Fehling <
> bernd.fehl...@uni-bielefeld.de> wrote:
> >
> > Isn't there somthing about largePageTables which must be enabled
> > in JAVA and also supported by OS for such huge heaps?
> >
> > Just a guess.
> >
> > Am 19.03.19 um 15:01 schrieb Jörn Franke:
> >> It could be an issue with jdk 8 that may not be suitable for such large
> heaps. Have more nodes with smaller heaps (eg 31 gb)
> >>> Am 18.03.2019 um 11:47 schrieb Aaron Yingcai Sun :
> >>>
> >>> Hello, Solr!
> >>>
> >>>
> >>> We are having some performance issue when try to send documents for
> solr to index. The repose time is very slow and unpredictable some time.
> >>>
> >>>
> >>> Solr server is running on a quit powerful server, 32 cpus, 400GB RAM,
> while 300 GB is reserved for solr, while this happening, cpu usage is
> around 30%, mem usage is 34%.  io also look ok according to iotop. SSD disk.
> >>>
> >>>
> >>> Our application send 100 documents to solr per request, json encoded.
> the size is around 5M each time. some times the response time is under 1
> seconds, some times could be 300 seconds, the slow response happens very
> often.
> >>>
> >>>
> >>> "Soft AutoCommit: disabled", "Hard AutoCommit: if uncommited for
> 360ms; if 100 uncommited docs"
> >>>
> >>>
> >>> There are around 100 clients sending those documents at the same time,
> but each for the client is blocking call which wait the http response then
> send the next one.
> >>>
> >>>
> >>> I tried to make the number of documents smaller in one request, such
> as 20, but  still I see slow response time to time, like 80 seconds.
> >>>
> >>>
> >>> Would you help to give some hint how improve the response time?  solr
> does not seems very loaded, there must be a way to make the response faster.
> >>>
> >>>
> >>> BRs
> >>>
> >>> //Aaron
> >>>
> >>>
> >>>
>
>


Re: Query of death? Collapsing Query Parser - Solr 7.5

2019-03-26 Thread Michael Gibney
Would you be willing to share your query-time analysis chain config, and
perhaps the "debug=true" (or "debug=query") output for successful queries
of a similar nature to the problematic ones? Also, re: "only times out on
extreme queries" -- what do you consider to be an "extreme query", in this
context?

On Mon, Mar 25, 2019 at 10:06 PM IZaBEE_Keeper 
wrote:

> Hi..
>
> I'm wondering if I've found a query of death or just a really expensive
> query.. It's killing my solr with OOM..
>
> Collapsing query parser using:
> fq={!collapse field=domain nullPolicy=expand}
>
> Everything works fine using words & phrases.. However as soon as there are
> numbers involved it crashes out with OOM Killer..
>
> The server has nowhere near enough ram for the index of 800GB & 150M docs..
>
> But a dismax query like '1 2 s 2 s 3 e d 4 r f 3 e s 7 2 1 4 6 7 8 2 9 0 3'
> will make it crash..
>
> fq={!collapse field=domain nullPolicy=expand}
> PhraseFields( 'content^0.05 description^0.03 keywords^0.03 title^0.05
> url^0.06' )
> BoostQuery( 'host:"' . $q . '"^0.6 host:"twitter.com"^0.35 domain:"' . $q
> .
> '"^0.6' )
>
> Without the fq it works just fine and only times out on extreme queries..
> eventually it finds them..
>
> Do I just need more ram or is there another way to prevent solr from
> crashing?
>
> Solr 7.5 24GB ram 16gb heap with ssd lv..
>
>
>
> -
> Bee Keeper at IZaBEE.com
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: CommonTerms & slow queries

2019-03-29 Thread Michael Gibney
Can you post the query that's actually built for some of these inputs
("parsedquery" or "parsedquery_toString" output included for requests with
"debug=query" parameter)? What is performance like if you turn off pf
(i.e., no implicit phrase searching)?
Michael

On Fri, Mar 29, 2019 at 11:53 AM Erie Data Systems 
wrote:

> Using Solr 8.0.0, single instance, single core, 50m records (38gb  index)
> on one SSD, 96gb ram, 16 cores CPU
>
> Most queries run very very fast <1 sec however we have noticed queries
> containing "common" words are quite slow sometimes 10+sec , currently using
> edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0
>
> I came across these which describe the issue.
>
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
>
>
> https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html
>
> Test queries with issues :
> 1. things to do in seattle with eric
> 2. year of the cat
> 3. time of my life
> 4. when will i be loved
> 5. once upon a time in the west
>
> Stopwords are not an option as in the case of #2, if of and the are removed
> it essentially destroys relevance.  Is there a common suggested solution to
> what would seem to be a common issue besides adding stopwords.
>
> Thank you.
> Craig Stadler
>


Re: CommonTerms & slow queries

2019-03-29 Thread Michael Gibney
You might take a look at CommonGramsFilter (
https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-CommonGramsFilter),
especially if you're either not using pf, or if ps=0. An absolute setting
of mm=2 strikes me as unusual (though quite possibly appropriate for your
use case). mm=2 would force scoring of all docs for which >=2 terms match,
which for any query containing the words "a" and "the" for example, could
easily be the majority of the index.
Another thought, re: single-core: sharding would allow you to effectively
parallelize query processing to a certain extent, which I expect might
speed things up for your use case.

On Fri, Mar 29, 2019 at 1:13 PM Erie Data Systems 
wrote:

> Michael,
>
>
> select/?&rows=12&qf=title+description&q=once+upon+a+time+in+the+west&fl=*&hl=true&hl.field=desc&hl.fragsize=250&hl.maxAnalyzedChars=20&ps=1&qs=1&df=title&mm=2&defType=edismax&debugQuery=off&indent=on&wt=json&debug=true
> "rawquerystring":"once upon a time in the west",
> "querystring":"once upon a time in the west",
> "parsedquery":"+(DisjunctionMaxQuery((description:once | title:once))
> DisjunctionMaxQuery((description:upon | title:upon))
> DisjunctionMaxQuery((description:a | title:a))
> DisjunctionMaxQuery((description:time | title:time))
> DisjunctionMaxQuery((description:in | title:in))
> DisjunctionMaxQuery((description:the | title:the))
> DisjunctionMaxQuery((description:west | title:west)))~2",
> "parsedquery_toString":"+(((description:once | title:once)
> (description:upon | title:upon) (description:a | title:a) (description:time
> | title:time) (description:in | title:in) (description:the | title:the)
> (description:west | title:west))~2)"
>
> Removing pf cuts time almost half but its still 5+sec
>
> Thank you for your help, more than happy to include more output..
> -Craig
>
>
> On Fri, Mar 29, 2019 at 12:24 PM Michael Gibney  >
> wrote:
>
> > Can you post the query that's actually built for some of these inputs
> > ("parsedquery" or "parsedquery_toString" output included for requests
> with
> > "debug=query" parameter)? What is performance like if you turn off pf
> > (i.e., no implicit phrase searching)?
> > Michael
> >
> > On Fri, Mar 29, 2019 at 11:53 AM Erie Data Systems <
> eriedata...@gmail.com>
> > wrote:
> >
> > > Using Solr 8.0.0, single instance, single core, 50m records (38gb
> index)
> > > on one SSD, 96gb ram, 16 cores CPU
> > >
> > > Most queries run very very fast <1 sec however we have noticed queries
> > > containing "common" words are quite slow sometimes 10+sec , currently
> > using
> > > edismax with 2 text_general fields,. qf, and pf, qs=0,ps=0
> > >
> > > I came across these which describe the issue.
> > >
> > >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> > >
> > >
> > >
> >
> https://lucene.apache.org/core/5_5_3/queries/org/apache/lucene/queries/CommonTermsQuery.html
> > >
> > > Test queries with issues :
> > > 1. things to do in seattle with eric
> > > 2. year of the cat
> > > 3. time of my life
> > > 4. when will i be loved
> > > 5. once upon a time in the west
> > >
> > > Stopwords are not an option as in the case of #2, if of and the are
> > removed
> > > it essentially destroys relevance.  Is there a common suggested
> solution
> > to
> > > what would seem to be a common issue besides adding stopwords.
> > >
> > > Thank you.
> > > Craig Stadler
> > >
> >
>


Re: Performance problems with extremely common terms in collection (Solr 7.4)

2019-04-08 Thread Michael Gibney
In addition to Toke's suggestions (and those in the linked article), some
more ideas:
If single-term, bare queries are slow, it might be productive to check
config/performance of your queryResultCache (I realize this doesn't
directly address the concern of slow queries, but might nonetheless be
helpful in practice).
If multi-term queries that include these terms are slow, maybe check your
mm config to make sure it's not more inclusive than necessary for your use
case (scoring over union of docSets/clauses). If multi-term queries get
faster by disabling pf, you could try disabling main-query pf, and invoke
implicit phrase search (pseudo-pf) using ReRankQParser?
If you're able to share your configs (built queries, indexing/fieldType
config (positions, payloads?), etc.), that might enable more specific
advice.
I'm assuming the query-times posted are for queries that isolate the
performance of main query only (i.e., no other components, like facets,
etc.)?
Michael

On Mon, Apr 8, 2019 at 3:28 AM Ash Ramesh  wrote:

> Hi Toke,
>
> Thanks for the prompt reply. I'm glad to hear that this is a common
> problem. In regards to stop words, I've been thinking about trying that
> out. In our business case, most of these terms are keywords related to
> stock photography, therefore it's natural for 'photography' or 'background'
> to appear commonly in a document's keyword list. it seems unlikely we can
> use the common grams solution with our business case.
>
> Regards,
>
> Ash
>
> On Mon, Apr 8, 2019 at 5:01 PM Toke Eskildsen  wrote:
>
> > On Mon, 2019-04-08 at 09:58 +1000, Ash Ramesh wrote:
> > > We have a corpus of 50+ million documents in our collection. I've
> > > noticed that some queries with specific keywords tend to be extremely
> > > slow.
> > > E.g. the q=`photography' or q='background'. After digging into the
> > > raw documents, I could see that these two terms appear in greater
> > > than 90% of all documents, which means solr has to score each of
> > > those documents.
> >
> > That is known behaviour, which can be remedied somewhat. Stop words is
> > a common approach, but your samples does not seem to fit well with
> > that. Instead you can look at Common Grams, where your high-frequency
> > words gets concatenated with surrounding words. This only works with
> > phrases though. There's a nice article at
> >
> >
> >
> https://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2
> >
> > - Toke Eskildsen, Royal Danish Library
> >
> >
> >
>
> --
> *P.S. We've launched a new blog to share the latest ideas and case studies
> from our team. Check it out here: product.canva.com
> . ***
> ** Empowering the
> world to design
> Also, we're hiring. Apply here!
> 
>  
>  
>   
>   
>
>
>
>
>
>
>


Re: cursorMark and shards? (6.6.2)

2020-02-10 Thread Michael Gibney
Possibly worth mentioning, although it might not be appropriate for
your use case: if the fields you're interested in are configured with
docValues, you could use streaming expressions (or directly handle
thread-per-shard connections to the /export handler) and get
everything in a single shot without paging of any kind. (I'm actually
working on something of this nature now; though not quite ready for
prime time, it's reliably exporting 68 million records to a 24G
compressed zip archive in 23 minutes -- 24 shards).

On Mon, Feb 10, 2020 at 6:39 PM Erick Erickson  wrote:
>
> Any field that’s unique per doc would do, but yeah, that’s usually an ID.
>
> Hmmm, I don’t see why separate queries for 0-f are necessary if you’re firing
> at individual replicas. Each replica should have multiple UUIDs that start 
> with 0-f.
>
> Unless I misunderstand and you’re just firing off, say, 16 threads at the 
> entire
> collection rather than individual shards which would work too. But for 
> individual
> shards I think you need to look for all possible IDs...
>
> Erick
>
> > On Feb 10, 2020, at 5:37 PM, Walter Underwood  wrote:
> >
> >
> >> On Feb 10, 2020, at 2:24 PM, Walter Underwood  
> >> wrote:
> >>
> >> Not sure if range queries work on a UUID field, ...
> >
> > A search for id:0* took 260 ms, so it looks like they work just fine. I’ll 
> > try separate queries for 0-f.
> >
> > wunder
> > Walter Underwood
> > wun...@wunderwood.org
> > http://observer.wunderwood.org/  (my blog)
> >
>


Re: Phrase search and WordDelimiterGraphFilter not working as expected with mixed delimited and non-delimited tokens

2020-02-19 Thread Michael Gibney
There are many layers to this, but for the config you posted (applying
index-time WDGF configured to both split and catentate tokens), the
fundamental issue is that Lucene doesn't index positionLength, so the
graph structure (and token adjacency information) of the token stream
is lost when it's serialized to the index. Once the positionLength
information is discarded, it's impossible to restore/leverage it at
query time.

For now, if you use WGDF (or any analysis component capable of
generating "graph"-type output) at index-time, you'll have issues
unless you configure it such that it won't in practice generate graph
output. For WGDF this would mean either catenate output, or split
output, but not both on a single analysis chain. If you need both, one
option would be to index to (and search on) two fields: one for
catentated analysis, one for split analysis.

Graph output *is* respected at query-time, so you have more options
configuring WGDF on a query-time analyzer. But in that case, it's
worth being aware of the potential for exponential query expansion
(see discussion at https://issues.apache.org/jira/browse/SOLR-13336,
which restores a safety valve for extreme instances of this case).

Some other potentially relevant issues/links:
https://issues.apache.org/jira/browse/LUCENE-4312
https://issues.apache.org/jira/browse/LUCENE-7398
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
(Lucene, so applies also to Solr)
https://michaelgibney.net/lucene/graph/

On Wed, Feb 19, 2020 at 10:27 AM Jeroen Steggink | knowsy
 wrote:
>
> Hi,
>
> I have a question regarding phrase search in combination with a
> WordDelimiterGraphFilter (Solr 8.4.1).
>
> Whenever I try to search using a phrase where token combination consists
> of delimited and non-delimited tokens, I don't get any matches.
>
> This is the configuration:
>
> 
>
>  
>  
>generateWordParts="1"
>  generateNumberParts="1"
>  catenateWords="1"
>  catenateNumbers="0"
>  catenateAll="0"
>  splitOnCaseChange="1"
>  preserveOriginal="1"/>
>  
>  
>
>
>  
>  
>  
>
> 
>
>  omitTermFreqAndPositions="false" />
>
>
> Example document:
>
> {
>id: '1',
>text: 'mr. i.n.i.t. firstsirname secondsirname'
> }
>
> Queries and results:
>
> Query:
> "mr. i.n.i.t. firstsirname"
> -
> No result
>
> Query:
> "mr. i.n.i.t."
> -
> Result
>
> Query:
> "mr. i n i t"
> -
> Result
>
> Query:
> "mr. init"
> -
> Result
>
> Query:
> "mr init"
> -
> Result
>
> Query:
> "i.n.i.t. firstsirname"
> -
> No result
>
> Query:
> "init firstsirname"
> -
> No result
>
> Query:
> "i.n.i.t. firstsirname secondsirname"
> -
> No result
>
> Query:
> "init firstsirname secondsirname"
> -
> No result
>
>
> I don't quite understand why this is. When looking at the results of the
> analyzers I don't understand why it's working with just delimited or
> non-delimited tokens. However, as soon as the mixed combination of
> delimited and non-delimited is searched, there is no match.
>
> Could someone explain? And is there a solution to make it work?
>
> Best regards,
>
> Jeroen
>
>


Re: Nested Document with replicas slow

2020-04-13 Thread Michael Gibney
Depending on how you're measuring performance (and whether your use case
benefits from caching), it might be worth looking into stable replica
routing (configured with the "replica.base" sub-parameter of the
shards.preference

parameter).
With a single replica per shard, every request is routed to a single
replica for a given shard, ensuring effective use of replica-level caches.
With multiple replicas per shard, by default each request is routed
randomly to specific replicas. The more shards you have (and the more
replicas), the longer it takes to "warm" caches to the point where the user
actually perceives decreased latency. For replication factor > 1, stable
cache entries can be initialized by warming queries, but transient cache
entries (particularly the queryResultCache) can in some cases be rendered
effectively useless in combination with the default random replica routing.
More discussion can be found at SOLR-13257
.

To be sure, this may not affect your case, but if you're seeing performance
degradation associated with adding replicas, it's probably worth
considering.


On Mon, Apr 13, 2020 at 9:48 AM Jae Joo  wrote:

> I have multiple 100 M documents using Nested Document for joining. It is
> the fastest way for joining in a single replica. By adding more replicas (2
> or 3), the performance is slow down significantly. (about 100x times).
> Does anyone have same experience?
>
> Jae
>


Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
Hi Wei,

In considering this problem, I'm stumbling a bit on terminology
(particularly, where you mention "nodes", I think you're referring to
"replicas"?). Could you confirm that you have 10 TLOG replicas per
shard, for each of 6 shards? How many *nodes* (i.e., running solr
server instances) do you have, and what is the replica placement like
across those nodes? What, if any, non-TLOG replicas do you have per
shard (not that it's necessarily relevant, but just to get a complete
picture of the situation)?

If you're able without too much trouble, can you determine what the
behavior is like on Solr 8.3? (there were different changes introduced
to potentially relevant code in 8.3 and 8.4, and knowing whether the
behavior you're observing manifests on 8.3 would help narrow down
where to look for an explanation).

Michael

On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
>
> Update:  after I remove the shards.preference parameter from
> solrconfig.xml,  issue is gone and internal shard requests are now
> balanced. The same parameter works fine with solr 7.6.  Still not sure of
> the root cause, but I observed a strange coincidence: the nodes that are
> most frequently picked for shard requests are the first node in each shard
> returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> equally compared nodes when shards.preference is set.  Will report back if
> I find more.
>
> On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
>
> > Hi Eric,
> >
> > I am measuring the number of shard requests, and it's for query only, no
> > indexing requests.  I have an external load balancer and see each node
> > received about the equal number of external queries. However for the
> > internal shard queries,  the distribution is uneven:6 nodes (one in
> > each shard,  some of them are leaders and some are non-leaders ) gets about
> > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > requests.   I checked a few other parameters set:
> >
> > -Dsolr.disable.shardsWhitelist=true
> > shards.preference=replica.location:local,replica.type:TLOG
> >
> > Nothing seems to cause the strange behavior.  Any suggestions how to
> > debug this?
> >
> > -Wei
> >
> >
> > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> > wrote:
> >
> >> Wei:
> >>
> >> How are you measuring utilization here? The number of incoming requests
> >> or CPU?
> >>
> >> The leader for each shard are certainly handling all of the indexing
> >> requests since they’re TLOG replicas, so that’s one thing that might
> >> skewing your measurements.
> >>
> >> Best,
> >> Erick
> >>
> >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> >> >
> >> > Hi everyone,
> >> >
> >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 6
> >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> >> one
> >> > of the replicas in each shard is handling most of the distributed shard
> >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> >> There
> >> > is no change in shard handler configuration:
> >> >
> >> >  >> > "HttpShardHandlerFactory">
> >> >
> >> >3
> >> >
> >> >3
> >> >
> >> >500
> >> >
> >> > 
> >> >
> >> >
> >> > What could cause the unbalanced internal distributed request?
> >> >
> >> >
> >> > Thanks in advance.
> >> >
> >> >
> >> >
> >> > Wei
> >>
> >>


Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
Wei, probably no need to answer my earlier questions; I think I see
the problem here, and believe it is indeed a bug, introduced in 8.3.
Will file an issue and submit a patch shortly.
Michael

On Mon, May 11, 2020 at 12:49 PM Michael Gibney
 wrote:
>
> Hi Wei,
>
> In considering this problem, I'm stumbling a bit on terminology
> (particularly, where you mention "nodes", I think you're referring to
> "replicas"?). Could you confirm that you have 10 TLOG replicas per
> shard, for each of 6 shards? How many *nodes* (i.e., running solr
> server instances) do you have, and what is the replica placement like
> across those nodes? What, if any, non-TLOG replicas do you have per
> shard (not that it's necessarily relevant, but just to get a complete
> picture of the situation)?
>
> If you're able without too much trouble, can you determine what the
> behavior is like on Solr 8.3? (there were different changes introduced
> to potentially relevant code in 8.3 and 8.4, and knowing whether the
> behavior you're observing manifests on 8.3 would help narrow down
> where to look for an explanation).
>
> Michael
>
> On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> >
> > Update:  after I remove the shards.preference parameter from
> > solrconfig.xml,  issue is gone and internal shard requests are now
> > balanced. The same parameter works fine with solr 7.6.  Still not sure of
> > the root cause, but I observed a strange coincidence: the nodes that are
> > most frequently picked for shard requests are the first node in each shard
> > returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> > equally compared nodes when shards.preference is set.  Will report back if
> > I find more.
> >
> > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> >
> > > Hi Eric,
> > >
> > > I am measuring the number of shard requests, and it's for query only, no
> > > indexing requests.  I have an external load balancer and see each node
> > > received about the equal number of external queries. However for the
> > > internal shard queries,  the distribution is uneven:6 nodes (one in
> > > each shard,  some of them are leaders and some are non-leaders ) gets 
> > > about
> > > 80% of the shard requests, the other 54 nodes gets about 20% of the shard
> > > requests.   I checked a few other parameters set:
> > >
> > > -Dsolr.disable.shardsWhitelist=true
> > > shards.preference=replica.location:local,replica.type:TLOG
> > >
> > > Nothing seems to cause the strange behavior.  Any suggestions how to
> > > debug this?
> > >
> > > -Wei
> > >
> > >
> > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> > > wrote:
> > >
> > >> Wei:
> > >>
> > >> How are you measuring utilization here? The number of incoming requests
> > >> or CPU?
> > >>
> > >> The leader for each shard are certainly handling all of the indexing
> > >> requests since they’re TLOG replicas, so that’s one thing that might
> > >> skewing your measurements.
> > >>
> > >> Best,
> > >> Erick
> > >>
> > >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> > >> >
> > >> > Hi everyone,
> > >> >
> > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud has 
> > >> > 6
> > >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed that
> > >> one
> > >> > of the replicas in each shard is handling most of the distributed shard
> > >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> > >> There
> > >> > is no change in shard handler configuration:
> > >> >
> > >> >  > >> > "HttpShardHandlerFactory">
> > >> >
> > >> >3
> > >> >
> > >> >3
> > >> >
> > >> >500
> > >> >
> > >> > 
> > >> >
> > >> >
> > >> > What could cause the unbalanced internal distributed request?
> > >> >
> > >> >
> > >> > Thanks in advance.
> > >> >
> > >> >
> > >> >
> > >> > Wei
> > >>
> > >>


Re: Unbalanced shard requests

2020-05-11 Thread Michael Gibney
FYI: https://issues.apache.org/jira/browse/SOLR-14471
Wei, assuming you have only TLOG replicas, your "last place" matches
(to which the random fallback ordering would not be applied -- see
above issue) would be the same as the "first place" matches selected
for executing distributed requests.


On Mon, May 11, 2020 at 1:49 PM Michael Gibney
 wrote:
>
> Wei, probably no need to answer my earlier questions; I think I see
> the problem here, and believe it is indeed a bug, introduced in 8.3.
> Will file an issue and submit a patch shortly.
> Michael
>
> On Mon, May 11, 2020 at 12:49 PM Michael Gibney
>  wrote:
> >
> > Hi Wei,
> >
> > In considering this problem, I'm stumbling a bit on terminology
> > (particularly, where you mention "nodes", I think you're referring to
> > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > server instances) do you have, and what is the replica placement like
> > across those nodes? What, if any, non-TLOG replicas do you have per
> > shard (not that it's necessarily relevant, but just to get a complete
> > picture of the situation)?
> >
> > If you're able without too much trouble, can you determine what the
> > behavior is like on Solr 8.3? (there were different changes introduced
> > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > behavior you're observing manifests on 8.3 would help narrow down
> > where to look for an explanation).
> >
> > Michael
> >
> > On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> > >
> > > Update:  after I remove the shards.preference parameter from
> > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > balanced. The same parameter works fine with solr 7.6.  Still not sure of
> > > the root cause, but I observed a strange coincidence: the nodes that are
> > > most frequently picked for shard requests are the first node in each shard
> > > returned from the CLUSTERSTATUS api.  Seems something wrong with shuffling
> > > equally compared nodes when shards.preference is set.  Will report back if
> > > I find more.
> > >
> > > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> > >
> > > > Hi Eric,
> > > >
> > > > I am measuring the number of shard requests, and it's for query only, no
> > > > indexing requests.  I have an external load balancer and see each node
> > > > received about the equal number of external queries. However for the
> > > > internal shard queries,  the distribution is uneven:6 nodes (one in
> > > > each shard,  some of them are leaders and some are non-leaders ) gets 
> > > > about
> > > > 80% of the shard requests, the other 54 nodes gets about 20% of the 
> > > > shard
> > > > requests.   I checked a few other parameters set:
> > > >
> > > > -Dsolr.disable.shardsWhitelist=true
> > > > shards.preference=replica.location:local,replica.type:TLOG
> > > >
> > > > Nothing seems to cause the strange behavior.  Any suggestions how to
> > > > debug this?
> > > >
> > > > -Wei
> > > >
> > > >
> > > > On Mon, Apr 27, 2020 at 5:42 PM Erick Erickson 
> > > > wrote:
> > > >
> > > >> Wei:
> > > >>
> > > >> How are you measuring utilization here? The number of incoming requests
> > > >> or CPU?
> > > >>
> > > >> The leader for each shard are certainly handling all of the indexing
> > > >> requests since they’re TLOG replicas, so that’s one thing that might
> > > >> skewing your measurements.
> > > >>
> > > >> Best,
> > > >> Erick
> > > >>
> > > >> > On Apr 27, 2020, at 7:13 PM, Wei  wrote:
> > > >> >
> > > >> > Hi everyone,
> > > >> >
> > > >> > I have a strange issue after upgrade from 7.6.0 to 8.4.1. My cloud 
> > > >> > has 6
> > > >> > shards with 10 TLOG replicas each shard.  After upgrade I noticed 
> > > >> > that
> > > >> one
> > > >> > of the replicas in each shard is handling most of the distributed 
> > > >> > shard
> > > >> > requests, so 6 nodes are heavily loaded while other nodes are idle.
> > > >> There
> > > >> > is no change in shard handler configuration:
> > > >> >
> > > >> >  > > >> > "HttpShardHandlerFactory">
> > > >> >
> > > >> >3
> > > >> >
> > > >> >3
> > > >> >
> > > >> >500
> > > >> >
> > > >> > 
> > > >> >
> > > >> >
> > > >> > What could cause the unbalanced internal distributed request?
> > > >> >
> > > >> >
> > > >> > Thanks in advance.
> > > >> >
> > > >> >
> > > >> >
> > > >> > Wei
> > > >>
> > > >>


Re: Unbalanced shard requests

2020-05-15 Thread Michael Gibney
Hi Wei,
SOLR-14471 has been merged, so this issue should be fixed in 8.6.
Thanks for reporting the problem!
Michael

On Mon, May 11, 2020 at 7:51 PM Wei  wrote:
>
> Thanks Michael!  Yes in each shard I have 10 Tlog replicas,  no other type
> of replicas, and each Tlog replica is an individual solr instance on its
> own physical machine.  In the jira you mentioned 'when "last place matches"
> == "first place matches" – e.g. when shards.preference specified matches
> *all* available replicas'.   My setting is
> shards.preference=replica.location:local,replica.type:TLOG,
> I also tried just shards.preference=replica.location:local and it still has
> the issue. Can you explain a bit more?
>
> On Mon, May 11, 2020 at 12:26 PM Michael Gibney 
> wrote:
>
> > FYI: https://issues.apache.org/jira/browse/SOLR-14471
> > Wei, assuming you have only TLOG replicas, your "last place" matches
> > (to which the random fallback ordering would not be applied -- see
> > above issue) would be the same as the "first place" matches selected
> > for executing distributed requests.
> >
> >
> > On Mon, May 11, 2020 at 1:49 PM Michael Gibney
> >  wrote:
> > >
> > > Wei, probably no need to answer my earlier questions; I think I see
> > > the problem here, and believe it is indeed a bug, introduced in 8.3.
> > > Will file an issue and submit a patch shortly.
> > > Michael
> > >
> > > On Mon, May 11, 2020 at 12:49 PM Michael Gibney
> > >  wrote:
> > > >
> > > > Hi Wei,
> > > >
> > > > In considering this problem, I'm stumbling a bit on terminology
> > > > (particularly, where you mention "nodes", I think you're referring to
> > > > "replicas"?). Could you confirm that you have 10 TLOG replicas per
> > > > shard, for each of 6 shards? How many *nodes* (i.e., running solr
> > > > server instances) do you have, and what is the replica placement like
> > > > across those nodes? What, if any, non-TLOG replicas do you have per
> > > > shard (not that it's necessarily relevant, but just to get a complete
> > > > picture of the situation)?
> > > >
> > > > If you're able without too much trouble, can you determine what the
> > > > behavior is like on Solr 8.3? (there were different changes introduced
> > > > to potentially relevant code in 8.3 and 8.4, and knowing whether the
> > > > behavior you're observing manifests on 8.3 would help narrow down
> > > > where to look for an explanation).
> > > >
> > > > Michael
> > > >
> > > > On Fri, May 8, 2020 at 7:34 PM Wei  wrote:
> > > > >
> > > > > Update:  after I remove the shards.preference parameter from
> > > > > solrconfig.xml,  issue is gone and internal shard requests are now
> > > > > balanced. The same parameter works fine with solr 7.6.  Still not
> > sure of
> > > > > the root cause, but I observed a strange coincidence: the nodes that
> > are
> > > > > most frequently picked for shard requests are the first node in each
> > shard
> > > > > returned from the CLUSTERSTATUS api.  Seems something wrong with
> > shuffling
> > > > > equally compared nodes when shards.preference is set.  Will report
> > back if
> > > > > I find more.
> > > > >
> > > > > On Mon, Apr 27, 2020 at 5:59 PM Wei  wrote:
> > > > >
> > > > > > Hi Eric,
> > > > > >
> > > > > > I am measuring the number of shard requests, and it's for query
> > only, no
> > > > > > indexing requests.  I have an external load balancer and see each
> > node
> > > > > > received about the equal number of external queries. However for
> > the
> > > > > > internal shard queries,  the distribution is uneven:6 nodes
> > (one in
> > > > > > each shard,  some of them are leaders and some are non-leaders )
> > gets about
> > > > > > 80% of the shard requests, the other 54 nodes gets about 20% of
> > the shard
> > > > > > requests.   I checked a few other parameters set:
> > > > > >
> > > > > > -Dsolr.disable.shardsWhitelist=true
> > > > > > shards.preference=replica.location:local,replica.type:TLOG
> > > > > >
> > > > > > Nothing seems to cau

Re: [EXTERNAL] - SolR OOM error due to query injection

2020-06-11 Thread Michael Gibney
Guilherme,
The answer is likely to be dependent on the query parser, query parser
configuration, and analysis chains. If you post those it could aid in
helping troubleshoot. One thing that jumps to mind is the asterisks
("*") -- if they're interpreted as wildcards, that could be
problematic? More generally, it's of course true that Solr won't parse
this input as SQL, but as Isabelle pointed out, there are still
potentially lots of meta-characters (in addition to quite a few short,
common terms).
Michael


On Thu, Jun 11, 2020 at 7:43 AM Guilherme Viteri  wrote:
>
> Hi Isabelle
> Thanks for your input.
> In fact SolR returns 30 results out of this queries. Why does it behave in a 
> way that causes OOM ? Also the commands, they are SQL commands and solr would 
> parse it as normal character …
>
> Thanks
>
>
> > On 10 Jun 2020, at 22:50, Isabelle Giguere  
> > wrote:
> >
> > Hi Guilherme;
> >
> > The only thing I can think of right now is the number of non-alphanumeric 
> > characters.
> >
> > In the first 'q' in your examples, after resolving the character escapes, 
> > 1/3 of characters are non-alphanumeric (* / = , etc).
> >
> > Maybe filter-out queries that contain too many non-alphanumeric characters 
> > before sending the request to Solr ?  Whatever "too many" could be.
> >
> > Isabelle Giguère
> > Computational Linguist & Java Developer
> > Linguiste informaticienne & développeur java
> >
> >
> > 
> > De : Guilherme Viteri 
> > Envoyé : 10 juin 2020 16:57
> > À : solr-user@lucene.apache.org 
> > Objet : [EXTERNAL] - SolR OOM error due to query injection
> >
> > Hi,
> >
> > Environment: SolR 6.6.2, with org.apache.solr.solr-core:6.1.0. This setup 
> > has been running for at least 4 years without having OutOfMemory error. (it 
> > is never too late for an OOM…)
> >
> > This week, our search tool has been attacked via ‘sql injection’ like, and 
> > that led to an OOM. These requests weren’t aggressive that stressed the 
> > server with an excessive number of hits, however 5 to 10 request of this 
> > nature was enough to crash the server.
> >
> > I’ve come across a this link 
> > https://urldefense.com/v3/__https://stackoverflow.com/questions/26862474/prevent-from-solr-query-injections-when-using-solrj__;!!Obbck6kTJA!IdbT_RQCp3jXO5KJxMkWNJIRlNU9Hu1hnJsWqCWT_QS3zpZSAxYeFPM_hGWNwp3y$
> >   
> >  >  >, however, that’s not what I am after. In our case we do allow lucene 
> > query and field search like title:Title or our ids have dash and if it get 
> > escaped, then the search won’t work properly.
> >
> > Does anyone have an idea ?
> >
> > Cheers
> > G
> >
> > Here are some of the requests that appeared in the logs in relation to the 
> > attack (see below: sorry it is messy)
> > query?q=IPP%22%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28SELECT%2F%2A%2A%2F2%2A%28IF%28%28SELECT%2F%2A%2A%2F%2A%2F%2A%2A%2FFROM%2F%2A%2A%2F%28SELECT%2F%2A%2A%2FCONCAT%280x717a707871%2C%28SELECT%2F%2A%2A%2F%28ELT%283235%3D3235%2C1%29%29%29%2C0x717a626271%2C0x78%29%29s%29%2C%2F%2A%2A%2F8446744073709551610%2C%2F%2A%2A%2F8446744073709551610%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28%28%28%22YBXk%22%2F%2A%2A%2FLIKE%2F%2A%2A%2F%22YBXk&species=Homo%20sapiens&types=Reaction&types=Pathway&cluster=true
> >
> > q=IPP%22%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28SELECT%2F%2A%2A%2F2%2A%28IF%28%28SELECT%2F%2A%2A%2F%2A%2F%2A%2A%2FFROM%2F%2A%2A%2F%28SELECT%2F%2A%2A%2FCONCAT%280x717a707871%2C%28SELECT%2F%2A%2A%2F%28ELT%283235%3D3235%2C1%29%29%29%2C0x717a626271%2C0x78%29%29s%29%2C%2F%2A%2A%2F8446744073709551610%2C%2F%2A%2A%2F8446744073709551610%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28%28%28%22rDmG%22%3D%22rDmG&species=Homo%20sapiens&types=Reaction&types=Pathway&cluster=true
> >
> > q=IPP%22%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28SELECT%2F%2A%2A%2F3641%2F%2A%2A%2FFROM%28SELECT%2F%2A%2A%2FCOUNT%28%2A%29%2CCONCAT%280x717a707871%2C%28SELECT%2F%2A%2A%2F%28ELT%283641%3D3641%2C1%29%29%29%2C0x717a626271%2CFLOOR%28RAND%280%29%2A2%29%29x%2F%2A%2A%2FFROM%2F%2A%2A%2FINFORMATION_SCHEMA.PLUGINS%2F%2A%2A%2FGROUP%2F%2A%2A%2FBY%2F%2A%2A%2Fx%29a%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28%28%28%22dfkM%22%2F%2A%2A%2FLIKE%2F%2A%2A%2F%22dfkM&species=Homo%20sapiens&types=Reaction&types=Pathway&cluster=true
> >
> > q=IPP%22%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28SELECT%2F%2A%2A%2F3641%2F%2A%2A%2FFROM%28SELECT%2F%2A%2A%2FCOUNT%28%2A%29%2CCONCAT%280x717a707871%2C%28SELECT%2F%2A%2A%2F%28ELT%283641%3D3641%2C1%29%29%29%2C0x717a626271%2CFLOOR%28RAND%280%29%2A2%29%29x%2F%2A%2A%2FFROM%2F%2A%2A%2FINFORMATION_SCHEMA.PLUGINS%2F%2A%2A%2FGROUP%2F%2A%2A%2FBY%2F%2A%2A%2Fx%29a%29%2F%2A%2A%2FAND%2F%2A%2A%2F%28%28%28%22yBhx%22%3D%22yBhx&species=Homo%20sapiens&types=Reaction&types=Pathway&cluster=true
> >
> > q=IPP%22%29%29%29%2F%2A%2A%2FAND%2F%2A%2A%2F1695%3DCTXSYS.DRITHSX.SN%28

Re: Facet Performance

2020-06-17 Thread Michael Gibney
facet.method=enum works by executing a query (against indexed values)
for each indexed value in a given field (which, for indexed=false, is
"no values"). So that explains why facet.method=enum no longer works.
I was going to suggest that you might not want to set indexed=false on
the docValues facet fields anyway, since the indexed values are still
used for facet refinement (assuming your index is distributed).

What's the number of unique values in the relevant fields? If it's low
enough, setting docValues=false and indexed=true and using
facet.method=enum (with a sufficiently large filterCache) is
definitely a viable option, and will almost certainly be faster than
docValues-based faceting. (As an aside, noting for future reference:
high-cardinality facets over high-cardinality DocSet domains might be
able to benefit from a term facet count cache:
https://issues.apache.org/jira/browse/SOLR-13807)

I think you didn't specifically mention whether you acted on Erick's
suggestion of setting "uninvertible=false" (I think Erick accidentally
said "uninvertible=true") to fail fast. I'd also recommend doing that,
perhaps even above all else -- it shouldn't actually *do* anything,
but will help ensure that things are behaving as you expect them to!

Michael

On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
 wrote:
>
> Thanks, I've implemented some queries that improve the first-hit execution 
> for faceting.
>
> Since turning off indexed on those fields, we've noticed that 
> facet.method=enum no longer returns the facets when used.
> Using facet.method=fc/fcs is significantly slower compared to 
> facet.method=enum for us. Why do these two differences exist?
>
> On 16/06/2020, 17:52, "Erick Erickson"  wrote:
>
> Ok, I see the disconnect... Necessary parts if the index are read from 
> disk
> lazily. So your newSearcher or firstSearcher query needs to do whatever
> operation causes the relevant parts of the index to be read. In this case,
> probably just facet on all the fields you care about. I'd add sorting too
> if you sort on different fields.
>
> The *:* query without facets or sorting does virtually nothing due to some
> special handling...
>
> On Tue, Jun 16, 2020, 10:48 James Bodkin 
> wrote:
>
> > I've been trying to build a query that I can use in newSearcher based 
> off
> > the information in your previous e-mail. I thought you meant to build a 
> *:*
> > query as per Query 1 in my previous e-mail but I'm still seeing the
> > first-hit execution.
> > Now I'm wondering if you meant to create a *:* query with each of the
> > fields as part of the fl query parameters or a *:* query with each of 
> the
> > fields and values as part of the fq query parameters.
> >
> > At the moment I've been running these manually as I expected that I 
> would
> > see the first-execution penalty disappear by the time I got to query 4, 
> as
> > I thought this would replicate the actions of the newSeacher.
> > Unfortunately we can't use the autowarm count that is available as part 
> of
> > the filterCache/filterCache due to the custom deployment mechanism we 
> use
> > to update our index.
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 16/06/2020, 15:30, "Erick Erickson"  wrote:
> >
> > Did you try the autowarming like I mentioned in my previous e-mail?
> >
> > > On Jun 16, 2020, at 10:18 AM, James Bodkin <
> > james.bod...@loveholidays.com> wrote:
> > >
> > > We've changed the schema to enable docValues for these fields and
> > this led to an improvement in the response time. We found a further
> > improvement by also switching off indexed as these fields are used for
> > faceting and filtering only.
> > > Since those changes, we've found that the first-execution for
> > queries is really noticeable. I thought this would be the filterCache 
> based
> > on what I saw in NewRelic however it is probably trying to read the
> > docValues from disk. How can we use the autowarming to improve this?
> > >
> > > For example, I've run the following queries in sequence and each
> > query has a first-execution penalty.
> > >
> > > Query 1:
> > >
> > > q=*:*
> > > facet=true
> > > facet.field=D_DepartureAirport
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 2:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2660)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > Query 3:
> > >
> > > q=*:*
> > > fq=D_DepartureAirport:(2661)
> > > facet=true
> > > facet.field=D_Destination
> > > facet.limit=-1
> > > rows=0
> > >
> > > 

Re: Facet Performance

2020-06-17 Thread Michael Gibney
To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:

First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson  wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
> use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point of 
> docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
> unique values (for each term, read the list of docs that have it and flip 
> a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
> uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
> That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
> stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
> very few installations
> where people have that luxury though. Each entry in the filterCache can 
> occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
> I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
> particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
> could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
> entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds like 
> it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin  
> > wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique values. 
> > We have two fields over that with 150 unique values and 5300 unique values 
> > retrospectively.
> > At the moment, our filterCache is configured with a maximum size of 8192.
> >
> > From the DocValues documentation 
> > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> > this approach promises to make lookups for faceting, sorting and grouping 
> > much faster.
> > Hence I thought that using DocValues would be better than using Indexed and 
> > in turn improve our response times and possibly lower memory requirements. 
> > It sounds like this isn't the case if you are able to allocate enough 
> > memory to the filterCache.
> >
> > I haven't yet tried changing the uninvertible setting, I was looking at the 
> > documentation for this field earlier today.
> > Should we be setting uninvertible="false" if docValues="true" regardless of 
> > whether indexed is true or false?
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 17/06/2020, 14:02, "Michael 

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Michael Gibney
I agree with Shawn that the top contenders so far (from my
perspective) are "primary/secondary" and "publisher/subscriber", and
agree with Walter that whatever term pair is used should ideally be
usable *as a pair* (to identify a cluster type) in addition to
individually (to identify the individual roles in that cluster).

To take the "bikeshedding" metaphor in another direction, I'd submit
"hub/spoke"? It's a little overloaded, but afaict mainly in domains
other than cluster architecture. It's very usable as a pair; it
manages to convey the singular nature of the "hub" and the
equivalent/final nature of the "spokes" in a way that
"primary/secondary" doesn't really; and it avoids implying an active
role in cluster maintenance for the "hub" (cf. "publisher", which
could be misleading in this regard).

Michael

On Wed, Jun 17, 2020 at 9:12 PM Scott Cote  wrote:
>
> Perhaps  Apache could provide a nomenclature suggestion that the projects 
> could adopt.   This would stand well for the whole Apache  community in 
> regards to BLM.
> My two cents as a “user”
> Good luck.
>
>
> Sent from Yahoo Mail for iPhone
>
>
> On Wednesday, June 17, 2020, 6:00 PM, Shawn Heisey  
> wrote:
>
> On 6/17/2020 2:36 PM, Trey Grainger wrote:
> > 2) TLOG - which can only serve in the role of follower
>
> This is inaccurate.  TLOG can become leader.  If that happens, then it
> functions exactly like an NRT leader.
>
> I'm aware that saying the following is bikeshedding ... but I do think
> it would be as mistake to use any existing SolrCloud terminology for
> non-cloud deployments, including the word "replica".  The top contenders
> I have seen to replace master/slave in Solr are primary/secondary and
> publisher/subscriber.
>
> It has been interesting watching this discussion play out on multiple
> open source mailing lists.  On other projects, I have seen a VERY high
> level of resistance to these changes, which I find disturbing and
> surprising.
>
> Thanks,
> Shawn
>
>
>


Re: Solr 8.3.1 longer query latency over 6.4.2

2020-08-19 Thread Michael Gibney
Hi Elaine,
I'm curious what happens if you remove "pf" (phrase field) setting
from your edismax config?

This question brought to mind
https://issues.apache.org/jira/browse/SOLR-12243?focusedCommentId=16836448#comment-16836448
and https://issues.apache.org/jira/browse/LUCENE-8531. This *could*
have directly explained the behavior you're observing, except for the
fact that pre-6.5.0, analyzeGraphPhrase(...) generated a
fully-enumerated Lucene "GraphQuery" (since removed, but afaict
similar to MultiPhraseQuery). But the direct topic of SOLR-12243 was
that SpanNearQuery, nevermind its performance characteristics, was
getting completely ignored by edismax. Curious about your case, I
looked at ExtendedDismaxQParser for 6.4.2, and it appears that
GraphQuery was similarly ignored?:

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.2/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java#L1219-L1252

If this is in fact the case (and I could well be overlooking
something), then it's possible that 6.4.2 was more performant mainly
because edismax was completely ignoring the more complex phrase
queries generated by analyzeGraphPhrase(...).

I'll be curious to hear what you find, and eager to be corrected if
the above speculation is off-base!

Michael


On Wed, Aug 19, 2020 at 10:56 AM Elaine Cario  wrote:
>
> Hi Solr experts,
>
> We're in the process of upgrading SolrCloud from 6.4.2 to 8.3.1, and our
> performance testing is consistently showing search latencies are measurably
> higher in 8.3.1, for certain kinds of queries it may be as much as 200 ms
> higher on average.
>
> We've seen this now in 2 different environments.  In one environment, we
> effectively doubled the OS memory for Solr 8 (by removing a replica set),
> and saw little improvement.
>
> The specs on the VM's we're using are the same from Solr 6 and 8, and the
> index sizes and shard distribution are also the same.  We reviewed garbage
> collection logs, and didn't see any red flags there.  We're still using
> Java 8 (sorry!).  Content was re-fed into Solr 8 from scratch.
>
> We re-ran queries removing all the usual suspects for high latencies:
> grouping, faceting, highlighting.We saw some improvement (as we would
> expect), but nothing approaching the average Solr 6 latencies with all
> those features turned on.
>
> We've narrowed the largest overall latencies to queries which contain many
> terms OR'd together (essentially synonyms we add to the query ourselves);
> there may be as many as 0-38 or more quoted phrases OR'd together.
> Latencies increase the more synonyms we add (we always knew this), but it
> seems much worse in Solr 8. (It is an unfortunate quirk of our content that
> these terms often have pretty high frequencies).  But it's not clear if
> this is just amplifying an underlying issue, or if something fundamental
> changed in the way Solr (or Lucene) resolves queries with OR'd terms.  We
> use a custom variant of edismax (but we also modified the queries to enable
> use of OOTB edismax, and still saw no improvement).
>
> We also noted that 0-term queries (*:*) with lots of facets perform as well
> as Solr 6, so it definitely seems related to searching for terms.
>
> I'm out of ideas here.  Has anyone experienced similar degradation from
> older Solr versions?
>
> Thanks in advance for any help you can provide.


Re: Solr 8.3.1 longer query latency over 6.4.2

2020-08-21 Thread Michael Gibney
Hmm... if you're manually constructing phrase queries during
pre-parsing, and those are set sow=true,
autogeneratePhraseQueries=true, then despite lack of pf, phrase
queries could still be a key to this. Would any of the phrase queries
explicitly introduced by your pre-parsing hactually  trigger
autogeneratePhraseQueries to kick in? (i.e., would any of the
whitespace-separated tokens in your phrases be further split by your
Solr-internal analysis chain -- WordDelimiter, (Solr-internal)
Synonym, etc.?). Would you be able to share the analysis chain on the
relevant fields, and perhaps (forgiving readability challenges) an
example of pre-parsed input that suffers particularly from performance
degradation?

On Thu, Aug 20, 2020 at 2:28 PM Elaine Cario  wrote:
>
> Thanks Michael, I took a look, but we don't have any pf or pf1,2,3 phrase
> params set at all.  Also, we don't add synonyms through Solr filters,
> rather we parse the user's query in our own application and add synonyms
> there, before it gets to Solr.
>
> Some additional info:  we have sow=true (to be compatible with Solr 6), and
> autogeneratePhraseQueries=true.  In our A/B testing, we didn't see any
> difference in search results (aside from some minor scoring variations), so
> functionally everything is working fine.
>
> I compared the debugQuery results between Solr 6 and 8 on a somewhat
> simplified query (they quickly become unreadable otherwise):
>
> Solr 6:
>   (+(DisjunctionMaxQuery((wkxmlsource:"new york" |
> title:"new york")~1.0) DisjunctionMaxQuery((wkxmlsource:ny | title:ny)~1.0)
> DisjunctionMaxQuery((wkxmlsource:"big apple" | title:"big
> apple")~1.0)))/no_coord
>   +((wkxmlsource:"new york" | title:"new
> york")~1.0 (wkxmlsource:ny | title:ny)~1.0 (wkxmlsource:"big apple" |
> title:"big apple")~1.0)
>
> Solr 8:
>   +(DisjunctionMaxQuery((wkxmlsource:"new york" |
> title:"new york")~1.0) DisjunctionMaxQuery((wkxmlsource:ny | title:ny)~1.0)
> DisjunctionMaxQuery((wkxmlsource:"big apple" | title:"big
> apple")~1.0))
>   +((wkxmlsource:"new york" | title:"new
> york")~1.0 (wkxmlsource:ny | title:ny)~1.0 (wkxmlsource:"big apple" |
> title:"big apple")~1.0)
>
> The only substantial difference is the removal of /no_coord (which is
> probably a result of LUCENE-7347 and likely accounts also for scoring
> variations).
>
> We do see generally higher CPU load with Solr 8 (although it is well within
> tolerance), and we do see much higher thread count (60 for Solr 6 vs 150
> for Solr 8 on average) even on a relatively quiet system.  That seems an
> interesting statistic, but not really sure what it signifies.  We mostly
> take the OOTB defaults for most everything, and config changes were
> minimal, mostly to maintain Solr 6 query behavior (uf=*_query_, sow=true).
>
> On Wed, Aug 19, 2020 at 5:46 PM Michael Gibney 
> wrote:
>
> > Hi Elaine,
> > I'm curious what happens if you remove "pf" (phrase field) setting
> > from your edismax config?
> >
> > This question brought to mind
> >
> > https://issues.apache.org/jira/browse/SOLR-12243?focusedCommentId=16836448#comment-16836448
> > and https://issues.apache.org/jira/browse/LUCENE-8531. This *could*
> > have directly explained the behavior you're observing, except for the
> > fact that pre-6.5.0, analyzeGraphPhrase(...) generated a
> > fully-enumerated Lucene "GraphQuery" (since removed, but afaict
> > similar to MultiPhraseQuery). But the direct topic of SOLR-12243 was
> > that SpanNearQuery, nevermind its performance characteristics, was
> > getting completely ignored by edismax. Curious about your case, I
> > looked at ExtendedDismaxQParser for 6.4.2, and it appears that
> > GraphQuery was similarly ignored?:
> >
> >
> > https://github.com/apache/lucene-solr/blob/releases/lucene-solr/6.4.2/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java#L1219-L1252
> >
> > If this is in fact the case (and I could well be overlooking
> > something), then it's possible that 6.4.2 was more performant mainly
> > because edismax was completely ignoring the more complex phrase
> > queries generated by analyzeGraphPhrase(...).
> >
> > I'll be curious to hear what you find, and eager to be corrected if
> > the above speculation is off-base!
> >
> > Michael
> >
> >
> > On Wed, Aug 19, 2020 at 10:56 AM Elaine Cario  wrote:
> > >
> > > Hi Solr experts,
> > >
> > > We're in the process of upgrading Sol

Re: Faceting on indexed=false stored=false docValues=true fields

2020-10-19 Thread Michael Gibney
As you've observed, it is indeed possible to facet on fields with
docValues=true, indexed=false; but in almost all cases you should
probably set indexed=true. 1. for distributed facet count refinement,
the "indexed" approach is used to look up counts by value; 2. assuming
you're wanting to do something usual, e.g. allow users to apply
filters based on facet counts, the filter application would use the
"indexed" approach as well. Where indexed=false, if either filtering
or distributed refinement is attempted, I'm not 100% sure what
happens. It might fail, or lead to inconsistent results, or attempt to
look up results via the equivalent of a "table scan" over docValues (I
think the last of these is what actually happens, fwiw) ... but none
of these options is likely desirable.

Michael

On Mon, Oct 19, 2020 at 1:42 PM uyilmaz  wrote:
>
> Thanks! This also contributed to my confusion:
>
> https://lucene.apache.org/solr/guide/8_4/faceting.html#field-value-faceting-parameters
>
> "If you want Solr to perform both analysis (for searching) and faceting on 
> the full literal strings, use the copyField directive in your Schema to 
> create two versions of the field: one Text and one String. Make sure both are 
> indexed="true"."
>
> On Mon, 19 Oct 2020 13:08:00 -0400
> Alexandre Rafalovitch  wrote:
>
> > I think this is all explained quite well in the Ref Guide:
> > https://lucene.apache.org/solr/guide/8_6/docvalues.html
> >
> > DocValues is a different way to index/store values. Faceting is a
> > primary use case where docValues are better than what 'indexed=true'
> > gives you.
> >
> > Regards,
> >Alex.
> >
> > On Mon, 19 Oct 2020 at 12:51, uyilmaz  wrote:
> > >
> > >
> > > Hey all,
> > >
> > > From my little experiments, I see that (if I didn't make a stupid 
> > > mistake) we can facet on fields marked as both indexed and stored being 
> > > false:
> > >
> > >  > > stored="false" docValues="true"/>
> > >
> > > I'm suprised by this, I thought I would need to index it. Can you confirm 
> > > this?
> > >
> > > Regards
> > >
> > > --
> > > uyilmaz 
>
>
> --
> uyilmaz 


Re: json.facet floods the filterCache

2020-10-22 Thread Michael Gibney
Damien,
Are you able to share the actual json.facet request that you're using
(at least just the json.facet part)? I'm having a hard time being
confident that I'm correctly interpreting when you say "a json.facet
query on nested facets terms".
Michael

On Thu, Oct 22, 2020 at 3:52 AM Christine Poerschke (BLOOMBERG/
LONDON)  wrote:
>
> Hi Damien,
>
> You mention about JSON term facets, I haven't explored w.r.t. that but we 
> have observed what you describe for JSON range facets and I've started 
> https://issues.apache.org/jira/browse/SOLR-14939 about it.
>
> Hope that helps.
>
> Regards,
> Christine
>
> From: solr-user@lucene.apache.org At: 10/22/20 01:07:59To:  
> solr-user@lucene.apache.org
> Subject: json.facet floods the filterCache
>
> Hi,
>
> I'm using a json.facet query on nested facets terms and am seeing very high
> filterCache usage. Is it possible to somehow control this? With a fq it's
> possible to specify fq={!cache=false}... but I don't see a similar thing
> json.facet.
>
> Kind regards,
> Damien
>
>


Re: json.facet floods the filterCache

2020-10-26 Thread Michael Gibney
Damien, I gathered that you're using "nested facet"; but there are a
lot of different ways to do that, with different implications. e.g.,
nesting terms facet within terms facet, query facet within terms,
terms within query, different stats, sorting, overrequest/overrefine
(and for that matter, refine:none|simple, or even distirbuted vs.
non-distributed), etc. I was wondering if you could share an example
of an actual json facet specification.

Pending more information, I can say that I've been independently
looking into this also. I think high filterCache usage can result if
you're using terms faceting that results in a lot of refinement
requests (either a high setting for overrefine, or
low/unevenly-distributed facet counts (as might happen with
high-cardinality fields). I think nested terms could also magnify the
effect of high-cardinality fields, increasing the number of buckets
needing refinement. You could see if setting refine:none helps (though
of course it could have undesirable effects on the actual results).
But afaict every term specified in a refinement request currently hits
the filterCache:
https://github.com/apache/lucene-solr/blob/40e2122/solr/core/src/java/org/apache/solr/search/facet/FacetProcessor.java#L418

A word of caution regarding the JSON facet `cacheDf` param: although
it's currently undocumented in the refGuide, I believe it's only
respected at all in FacetFieldProcessorByEnumTermsStream, which is
only invoked under certain circumstances (and only when sort=index).
So this is unlikely to help (though it's impossible to say without
more specific information about the actual requests you're trying to
run).

Michael

On Fri, Oct 23, 2020 at 12:52 AM  wrote:
>
> Im dong a nested facet (
> https://lucene.apache.org/solr/guide/8_6/json-facet-api.html#nested-facets)
> or sub-facets, and am using the 'terms' facet.
>
> Digging around more looks like I can set 'cacheDf=-1' to disable the use of
> the cache.
>
> On Fri, 23 Oct 2020 at 00:14, Michael Gibney 
> wrote:
>
> > Damien,
> > Are you able to share the actual json.facet request that you're using
> > (at least just the json.facet part)? I'm having a hard time being
> > confident that I'm correctly interpreting when you say "a json.facet
> > query on nested facets terms".
> > Michael
> >
> > On Thu, Oct 22, 2020 at 3:52 AM Christine Poerschke (BLOOMBERG/
> > LONDON)  wrote:
> > >
> > > Hi Damien,
> > >
> > > You mention about JSON term facets, I haven't explored w.r.t. that but
> > we have observed what you describe for JSON range facets and I've started
> > https://issues.apache.org/jira/browse/SOLR-14939 about it.
> > >
> > > Hope that helps.
> > >
> > > Regards,
> > > Christine
> > >
> > > From: solr-user@lucene.apache.org At: 10/22/20 01:07:59To:
> > solr-user@lucene.apache.org
> > > Subject: json.facet floods the filterCache
> > >
> > > Hi,
> > >
> > > I'm using a json.facet query on nested facets terms and am seeing very
> > high
> > > filterCache usage. Is it possible to somehow control this? With a fq it's
> > > possible to specify fq={!cache=false}... but I don't see a similar thing
> > > json.facet.
> > >
> > > Kind regards,
> > > Damien
> > >
> > >
> >


Re: Simulate facet.exists for json query facets

2020-10-28 Thread Michael Gibney
Separately, and in parallel to Erick's question: indeed I'm not aware
of any way to do this currently, but I *can* imagine cases where this
would be useful. I have a sense this could be cleanly implemented as a
stat facet function
(https://lucene.apache.org/solr/guide/8_6/json-facet-api.html#stat-facet-functions),
e.g.:

curl http://localhost:8983/solr/portal/select -d \
"q=*:*\
&json.facet={
  tour: \"exists(+categoryId:6000 -categoryId:(6061 21493 8510))\"
}\
&rows=0"

The return value of the `exists` function could be boolean, which
would be semantically clearer than capping count to 1, as I gather
`facet.exists` does. For the same reason, implementing this as a
function would probably be better than adding this functionality to
the `query` facet type, which carries certain useful assumptions (the
meaning of the "count" attribute in the response, the ability to nest
stats and subfacets, etc.) ... just thinking out loud at the moment
...

On Wed, Oct 28, 2020 at 9:17 AM Erick Erickson  wrote:
>
> This really sounds like an XY problem. The whole point of facets is
> to count the number of documents that have a value in some
> number of buckets. So trying to stop your facet query as soon
> as it matches a hit for the first time seems like an odd thing to do.
>
> So what’s the “X”? In other words, what is the problem you’re trying
> to solve at a high level? Perhaps there’s a better way to figure this
> out.
>
> Best,
> Erick
>
> > On Oct 28, 2020, at 3:48 AM, michael dürr  wrote:
> >
> > Hi,
> >
> > I use json facets of type 'query'. As these queries are pretty slow and I'm
> > only interested in whether there is a match or not, I'd like to restrict
> > the query execution similar to the standard facetting (like with the
> > facet.exists parameter). My simplified query looks something like this (in
> > reality *:* may be replaced by a complex edismax query and multiple
> > subfacets similar to "tour" occur):
> >
> > curl http://localhost:8983/solr/portal/select -d \
> > "q=*:*\
> > &json.facet={
> >  tour:{
> >type : query,
> > q: \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"
> >  }
> > }\
> > &rows=0"
> >
> > Is there any possibility to modify my request to ensure that the facet
> > query stops as soon as it matches a hit for the first time?
> >
> > Thanks!
> > Michael
>


Re: Avoiding duplicate entry for a multivalued field

2020-10-29 Thread Michael Gibney
If I understand correctly what you're trying to do, docValues for a
number of field types are (at least in their multivalued incarnation)
backed by SortedSetDocValues, which inherently deduplicate values
per-document. In your case it sounds like you could maybe rely on that
behavior as a feature, set stored=false, docValues=true,
useDocValuesAsStored=true, and achieve the desired behavior?
Michael

On Thu, Oct 29, 2020 at 6:17 AM Srinivas Kashyap
 wrote:
>
> Thanks Dwane,
>
> I have a doubt, according to the java doc, the duplicates still continue to 
> exist in the field. May be during query time, the field returns only unique 
> values? Am I right with my assumption?
>
> And also, what is the performance overhead for this UniqueFiled*Factory?
>
> Thanks,
> Srinivas
>
> From: Dwane Hall 
> Sent: 29 October 2020 14:33
> To: solr-user@lucene.apache.org
> Subject: Re: Avoiding duplicate entry for a multivalued field
>
> Srinivas this is possible by adding an unique field update processor to the 
> update processor chain you are using to perform your updates (/update, 
> /update/json, /update/json/docs, .../a_custom_one)
>
> The Java Documents explain its use nicely
> (https://lucene.apache.org/solr/8_6_0//solr-core/org/apache/solr/update/processor/UniqFieldsUpdateProcessorFactory.html)
>  or there are articles on stack overflow addressing this exact problem 
> (https://stackoverflow.com/questions/37005747/how-to-remove-duplicates-from-multivalued-fields-in-solr#37006655)
>
> Thanks,
>
> Dwane
> 
> From: Srinivas Kashyap 
> mailto:srini...@bamboorose.com.INVALID>>
> Sent: Thursday, 29 October 2020 3:49 PM
> To: solr-user@lucene.apache.org 
> mailto:solr-user@lucene.apache.org>>
> Subject: Avoiding duplicate entry for a multivalued field
>
> Hello,
>
> Say, I have a schema field which is multivalued. Is there a way to maintain 
> distinct values for that field though I continue to add duplicate values 
> through atomic update via solrj?
>
> Is there some property setting to have only unique values in a multi valued 
> fields?
>
> Thanks,
> Srinivas
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.


Re: Simulate facet.exists for json query facets

2020-10-30 Thread Michael Gibney
Michael, sorry for the confusion; I was positing a *hypothetical*
"exists()" function that doesn't currently exist, that *is* an
aggregate function, and the *does* stop early. I didn't account for
the fact that there's already an "exists()" function *query* that
behaves very differently. So yes, definitely confusing :-). I guess
choosing a different name for the proposed aggregate function would
make sense. I was suggesting it mostly as an alternative to extending
the syntax of JSON Facet "query" facet type, and to say that I think
the implementation of such an aggregate function would be pretty
straightforward.

On Fri, Oct 30, 2020 at 3:44 AM michael dürr  wrote:
>
> @Erick
>
> Sorry! I chose a simple example as I wanted to reduce complexity.
> In detail:
> * We have distinct contents like tours, offers, events, etc which
> themselves may be categorized: A tour may be a hiking tour, a
> mountaineering tour, ...
> * We have hundreds of customers that want to facet their searches to that
> content types but often with distinct combinations of categories, i.e.
> customer A wants his facet "tours" to only count hiking tours, customer B
> only mountaineering tours, customer C a combination of both, etc
> * We use "query" facets as each facet request will be build dynamically (it
> is not feasible to aggregate certain categories and add them as an
> additional solr schema field as we have hundreds of different combinations).
> * Anyways, our ui only requires adding a toggle to filter for (for example)
> "tours" in case a facet result is present. We do not care about the number
> of tours.
> * As we have millions of contents and dozens of content types (and dozens
> of categories per content type) such queries may take a very long time.
>
> A complex example may look like this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *q=*:*&json.facet={   tour:{ type : query, q: \"+categoryId:(21450
> 21453)\"   },   guide:{ type : query, q: \"+categoryId:(21105 21401
> 21301 21302 21303 21304 21305 21403 21404)\"   },   story:{ type :
> query, q: \"+categoryId:21515\"   },   condition:{ type : query,
>  q: \"+categoryId:21514\"   },   hut:{ type : query, q:
> \"+categoryId:8510\"   },   skiresort:{ type : query, q:
> \"+categoryId:21493\"   },   offer:{ type : query, q:
> \"+categoryId:21462\"   },   lodging:{ type : query, q:
> \"+categoryId:6061\"   },   event:{ type : query, q:
> \"+categoryId:21465\"   },   poi:{ type : query, q:
> \"+(+categoryId:6000 -categoryId:(6061 21493 8510))\"   },   authors:{
>  type : query, q: \"+categoryId:(21205 21206)\"   },   partners:{
>  type : query, q: \"+categoryId:21200\"   },   list:{ type :
> query, q: \"+categoryId:21481\"   } }\&rows=0"*
>
> @Michael
>
> Thanks for your suggestion but this does not work as
> * the facet module expects an aggregate function (which i simply added by
> embracing your call with sum(...))
> * and (please correct me if I am wrong) the exists() function not stops on
> the first match, but counts the number of results for which the query
> matches a document.


Re: Simulate facet.exists for json query facets

2020-10-30 Thread Michael Gibney
>If all of those facet queries are _known_ to be a performance hit,
you might be able to do something custom.That would require
custom code though and I wouldn’t go there unless you can
demonstrate need.

Yeah ... indeed if those facet queries are relatively static (and thus
cacheable ... even if there are a lot of them), an appropriately-sized
filterCache would allow them to be cached to good effect and then the
performance hit should be negligible. Knowing what the queries are up
front, you could even add them to your warming queries.

It'd also be unusual (though possible, sure?) to run these kinds of
facet queries with no intention of ever conditionally following up in
a way that would want the actual results/docSet -- even if the
initial/more common query only cares about boolean existence.

The case in which this type of functionality really might be indicated is:
1. only care about boolean result (obvious, ok)
2. dynamic (i.e., not-particularly-cacheable) queries
3. never intend to follow up with a request that calls for full results

If both of the first two conditions hold, and especially if the third
also holds, there would in principle definitely be efficiency to be
gained by early termination (and avoiding the creation of a DocSet,
which at the moment happens unconditionally for every facet query).
I'm also thinking about this through the lens of bringing the JSON
Facet API to parity with the legacy facet API, fwiw ...

On Fri, Oct 30, 2020 at 9:02 AM Erick Erickson  wrote:
>
> I don’t think there’s anything to do what you’re asking OOB.
>
> If all of those facet queries are _known_ to be a performance hit,
> you might be able to do something custom.That would require
> custom code though and I wouldn’t go there unless you can
> demonstrate need.
>
> If you issue a debug=timing you’ll see the time each component
> takes,  and there’s a separate entry for faceting so that’ll give you
> a clue whether it’s worth the effort.
>
> Best,
> Erick
>
> > On Oct 30, 2020, at 8:10 AM, Michael Gibney  
> > wrote:
> >
> > Michael, sorry for the confusion; I was positing a *hypothetical*
> > "exists()" function that doesn't currently exist, that *is* an
> > aggregate function, and the *does* stop early. I didn't account for
> > the fact that there's already an "exists()" function *query* that
> > behaves very differently. So yes, definitely confusing :-). I guess
> > choosing a different name for the proposed aggregate function would
> > make sense. I was suggesting it mostly as an alternative to extending
> > the syntax of JSON Facet "query" facet type, and to say that I think
> > the implementation of such an aggregate function would be pretty
> > straightforward.
> >
> > On Fri, Oct 30, 2020 at 3:44 AM michael dürr  wrote:
> >>
> >> @Erick
> >>
> >> Sorry! I chose a simple example as I wanted to reduce complexity.
> >> In detail:
> >> * We have distinct contents like tours, offers, events, etc which
> >> themselves may be categorized: A tour may be a hiking tour, a
> >> mountaineering tour, ...
> >> * We have hundreds of customers that want to facet their searches to that
> >> content types but often with distinct combinations of categories, i.e.
> >> customer A wants his facet "tours" to only count hiking tours, customer B
> >> only mountaineering tours, customer C a combination of both, etc
> >> * We use "query" facets as each facet request will be build dynamically (it
> >> is not feasible to aggregate certain categories and add them as an
> >> additional solr schema field as we have hundreds of different 
> >> combinations).
> >> * Anyways, our ui only requires adding a toggle to filter for (for example)
> >> "tours" in case a facet result is present. We do not care about the number
> >> of tours.
> >> * As we have millions of contents and dozens of content types (and dozens
> >> of categories per content type) such queries may take a very long time.
> >>
> >> A complex example may look like this:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
&g

Re: Multiple Facets on Same Field

2020-11-17 Thread Michael Gibney
Answering a slightly different question perhaps, but you can
definitely do this with the "JSON Facet" API, where there's much
cleaner separation between different facets (and output is assigned to
arbitrary keys).
Michael

On Tue, Nov 17, 2020 at 9:36 AM Jason Gerlowski  wrote:
>
> Hi all,
>
> Is it possible to have multiple facets on the same field with
> different parameters (mincount, limit, prefix, etc.) on each?
>
> The ref-guide describes these per-facet parameters as being settable
> on a "per-field basis" with syntax of
> "f..facet." [1].  But I wasn't sure whether to
> take that at face value, or hope that the "" value there
> could be something more flexible (like the value of facet.field which
> can take local params).
>
> I've been trying variations of
> "facet=true&facet.field=f1&f.f1.facet.mincount=5&facet.field={!key=someOutputKey}f1",
> but without luck.  "mincount" is always applied to both of the
> facet.field's being computed.
>
> Best,
>
> Jason


Re: Multiple Facets on Same Field

2020-11-17 Thread Michael Gibney
Ah, ok! The Jira issue mentioned in the mail thread you cited above
has some further discussion/detail. (I don't think adding a "{!terms}"
query filter would necessarily work ... it'd need to be a group of
facets of type "query", sorted client-side ... unless I'm missing
something?)
https://issues.apache.org/jira/browse/SOLR-14921


On Tue, Nov 17, 2020 at 11:00 AM Jason Gerlowski  wrote:
>
> Thanks Michael,
>
> I agree - JSON Facets is a better candidate for the functionality I'm
> looking for.  In my case specifically though, I think I'm pegged to
> traditional facets because I also want to use the "terms" local params
> support that doesn't have a native equivalent in JSON Faceting (yet:
> SOLR-14921).
>
> If no one has other ideas here, maybe my best bet is to switch to
> using JSON Faceting and adding an explicit "{!terms}" query as a
> filter.  I see you suggested that as a workaround here [1].
>
> Jason
>
> [1] 
> http://mail-archives.apache.org/mod_mbox/lucene-dev/202010.mbox/%3CCAF%3DheHGKwGtvq%3DgAndmVrgvo1cxKmzP0neGi17_eoVhubpaBZA%40mail.gmail.com%3E
>
> On Tue, Nov 17, 2020 at 10:02 AM Michael Gibney
>  wrote:
> >
> > Answering a slightly different question perhaps, but you can
> > definitely do this with the "JSON Facet" API, where there's much
> > cleaner separation between different facets (and output is assigned to
> > arbitrary keys).
> > Michael
> >
> > On Tue, Nov 17, 2020 at 9:36 AM Jason Gerlowski  
> > wrote:
> > >
> > > Hi all,
> > >
> > > Is it possible to have multiple facets on the same field with
> > > different parameters (mincount, limit, prefix, etc.) on each?
> > >
> > > The ref-guide describes these per-facet parameters as being settable
> > > on a "per-field basis" with syntax of
> > > "f..facet." [1].  But I wasn't sure whether to
> > > take that at face value, or hope that the "" value there
> > > could be something more flexible (like the value of facet.field which
> > > can take local params).
> > >
> > > I've been trying variations of
> > > "facet=true&facet.field=f1&f.f1.facet.mincount=5&facet.field={!key=someOutputKey}f1",
> > > but without luck.  "mincount" is always applied to both of the
> > > facet.field's being computed.
> > >
> > > Best,
> > >
> > > Jason


Re: nested facets of query and terms type in JSON format

2020-12-03 Thread Michael Gibney
Arturas,
I think your syntax is wrong for the range subfacet? -- the configuration
of the range facet should be directly under the `tt` key, rather than
nested under `t_buckets` in the request. (The response introduces a
"buckets" attribute that is not part of the request syntax).
Michael

On Thu, Dec 3, 2020 at 3:47 AM Arturas Mazeika  wrote:

> Hi Solr Team,
>
> I am trying to check how I can formulate facet queries using JSON format. I
> can successfully formulate query, range, term queries, as well as nested
> term queries. How can I formulate a nested facet query involving "query" as
> well as "range" formulations? The following does not work:
>
>
> GET http://localhost:/solr/db/query HTTP/1.1
> content-type: application/json
>
> {
> "query"  : "*:*",
> "limit"  : 0,
> "facet": {
> "a1": { "query":  "cfname2:1" },
> "a2": { "query":  "cfname2:2" },
> "a3": { "field":  "cfname2", "type":"terms", "prefix":"3" },
> "a4": { "query":  "cfname2:4" },
> "a5": { "query":  "cfname2:5" },
> "a6": { "query":  "cfname2:6" },
>
> "tt": {
> "t_buckets": {
> "type":  "range",
> "field": "t",
> "sort": { "t": "asc" },
> "start": "2018-05-02T17:00:00.000Z",
> "end":   "2020-11-16T21:00:00.000Z",
> "gap":   "+1HOUR"
> }
> }
> }
> }
>
> Single (not nested facets separately on individual queries as well as for
> range) work in flying colors.
>
> Cheers,
> Arturas
>


Re: nested facets of query and terms type in JSON format

2020-12-03 Thread Michael Gibney
I think the first "error" case in your set of examples above is closest to
being correct. For "query" facet type, I think you want to explicitly
specify `"type":"query"`, and specify the query itself in the `"q"` param,
i.e.:
{
"query"  : "*:*",
"limit"  : 0,

"facet": {
"aip": {
"type":  "query",
"q":  "cfname2:aip",
"facet": {
"t_buckets": {
"type":  "range",
"field": "t",
"sort": { "t": "asc" },
"start": "2018-05-02T17:00:00.000Z",
"end":   "2020-11-16T21:00:00.000Z",
"gap":   "+1HOUR"
"limit": 1
}
}
}
}
}

On Thu, Dec 3, 2020 at 12:59 PM Arturas Mazeika  wrote:

> Hi Michael,
>
> Thanks for helping me to figure this out.
>
> If I fire:
>
> {
> "query"  : "*:*",
> "limit"  : 0,
>
> "facet": {
> "aip": { "query":  "cfname2:aip", }
>
> }
> }
>
> I get
>
> "response": { "numFound": 20560849, "start": 0, "numFoundExact": true,
> "docs": [] }, "facets": { "count": 20560849, "aip": { "count": 2307 } } }
>
> (works). If I fire
>
>
> {
> "query"  : "*:*",
> "limit"  : 0,
>
> "facet": {
> "t_buckets": {
> "type":  "range",
> "field": "t",
> "sort": { "t": "asc" },
> "start": "2018-05-02T17:00:00.000Z",
> "end":   "2020-11-16T21:00:00.000Z",
> "gap":   "+1HOUR"
> "limit": 1
> }
> }
> }
>
> I get
>
> "response": { "numFound": 20560849, "start": 0, "numFoundExact": true,
> "docs": [] }, "facets": { "count": 20560849, "t_buckets": { "buckets": [ {
> "val": "2018-05-02T17:00:00Z", "count": 150 },
>
> (works). If I fire:
>
> {
> "query"  : "*:*",
> "limit"  : 0,
>
> "facet": {
> "aip": { "query":  "cfname2:aip",
>
> "facet": {
> "t_buckets": {
> "type":  "range",
> "field": "t",
> "sort": { "t": "asc" },
> "start": "2018-05-02T17:00:00.000Z",
> "end":   "2020-11-16T21:00:00.000Z",
> "gap":   "+1HOUR"
> "limit": 1
>     }
> }
> }
> }
> }
>
> I get
>
> "error": { "metadata": [ "error-class",
> "org.apache.solr.common.SolrException", "root-error-class",
> "org.apache.solr.common.SolrException" ], "msg": "expected facet/stat type
> name, like {type:range, field:price, ...} but got null , path=/facet",
> "code": 400 } }
>
> If I fire
>
> {
> "query"  : "*:*",
> "limit"  : 0,
>
> "facet": {
> "aip": { "query":  "cfname2:aip",
>
> "facet": {
> "type":  "range",
> "field": "t",
> "sort": { "t": "asc" },
> "start": "2018-05-02T17:00:00.000Z",
> "end":   "2020-11-16T21:00:00.000Z",
> "gap":   "+1HOUR"
> "limit": 1
> }
> }
> }
> }
>
> I get
>
> "error": { "metadata": [ "error-class",
> "org.apache.solr.common.SolrException", "root-error-class",
> "org.apache.solr.common.SolrException" ], "msg&

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

2019-05-17 Thread Michael Gibney
The SpanNearQuery in association with "a.b." input and WDGF is
expected behavior, since WDGF causes the query to search ("ab")|("a"
"b"), as 1 or 2 tokens, respectively. The "a. b." input
(whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so
sticks with the more straightforward PhraseQuery implementation.

That said, the problem you're encountering is related to a couple of issues:
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312

For this case specifically, the problem is that NearSpansOrdered
lazily returns one match per position *for the first subclause*. The
or clause ("ab"|"a" "b"), because positionLength is not indexed, will
always return "ab" first (implicit positionLength of 1). Again because
"ab"'s actual positionLength of 2 from index-time WDGF is not stored
in the index, the implicit positionLength of 1 at query-time gives the
impression of a gap between "ab" and "isar", violating the "slop=0"
constraint.

Because NearSpansOrdered.nextStartPosition() always advances by
calling nextStartPosition() on the first subclause (without exploring
for variant matches in other subclauses), the top-level
NearSpansOrdered advances after one attempt at matching, and the valid
match is missed.

Pending fixes to address the underlying issue (there is a candidate
patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312),
you could mitigate the problem to some extent by either forcing slop>0
(which as of 7.6 will be expanded into MultiPhraseQuery -- see
https://issues.apache.org/jira/browse/LUCENE-8531), or you could set
preserveOriginal=true on both index-time and query-time WDGF and
upgrade to 8.1 (which would prevent the extreme case of an *exact*
character-for-character matching query turning up no results -- see
https://issues.apache.org/jira/browse/LUCENE-8730).

On Fri, May 17, 2019 at 11:47 AM Erick Erickson  wrote:
>
> I’ll leave that explanation to someone who understands query parsers ;)
>
> > On May 17, 2019, at 7:57 AM, Doris Peter  
> > wrote:
> >
> > Thanks a lot! I tried the debug parameter, which shows interesting 
> > differences:
> >
> > debug": {
> >
> >"rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> >"querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> >"parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
> >"parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
> >"QParser": "LuceneQParser"
> > }
> >
> > debug": {
> >"rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> >"querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> >"parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, 
> > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > 0, true)]), all_places_txt:donau], 0, true))",
> >"parsedquery_toString": "spanNear([all_places_txt:neuburg, 
> > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > 0, true)]), all_places_txt:donau], 0, true)",
> >"QParser": "LuceneQParser"
> >}
> >
> >
> > Something seems to go wrong here, as the parsedquery contains the 
> > SpanNearQuery instead of a PhraseQuery.
> >
> >
> >
> >
> >
> >
> >
> >
> >
>  Erick Erickson  5/17/2019 4:27 PM >>>
> > Three things:
> >
> > 1> WordDelimiterGraphFilterFactory requires FlattenGraphFilterFactory after 
> > it in the index config
> >
> > 2> It is usually unnecessary to have the exact same parameters at both 
> > query and index time for WDGFF. If you’ve split parts up at index time then 
> > mashed them all back together, you can usually only split them up at query 
> > time.
> >
> > 3> try adding &debug=query to the query and see what the results show for 
> > the parsed query. That usually gives you a clue what is really happening 
> > .vs. what you think is happening.
> >
> > Best,
> > Erick
> >
> >> On May 17, 2019, at 12:59 AM, Doris Peter  
> >> wrote:
> >>
> >> Hello,
> >>
> >> We use Solr 7.6.0 to build our index, and I have got a Question about
> >> Phrase Queries:
> >>
> >> We use the following configuration in schema.xml:
> >>
> >>   
> >>>> positionIncrementGap="1000" sortMissingLast="true"
> >> autoGeneratePhraseQueries="true">
> >> 
> >>   
> >>>> mapping="mapping-FoldToASCII.txt"/>
> >>   
> >>>> protected="protectedword.txt"
> >>preserveOriginal="0" splitOnNumerics="1"
> >> splitOnCaseChange="0"
> >>catenateWords="1" catenateNumbers="1" catenateAll="1"
> >>generateWordParts="1" generateNumberParts="1"
> >> stemEnglishPossessive="1"
> >>types="wdfftypes.txt" />
> >>>> max="2147483647"/>
> >>   
> >> 
> >> 
> >>   
> >>>> mapping="mapping-FoldToASCII.txt"/>
> >>   
> >>>> protected="protectedword.txt"
> >>preserveOriginal="0" splitOnNumerics="1"
> >> splitOnCaseChange="0"
> >>catenateWords="1" catenateNu

Re: Antw: Re: Behaviour of punctuation marks in phrase queries

2019-05-17 Thread Michael Gibney
After further reflection, I think that upgrading to 8.1 (LUCENE-8730)
would actually not help in this case. It doesn't matter whether "a.b."
or "ab" would be indexed or evaluated first; they'd both have implied
positionLength 1 (as read from the index at query time), and would
both be evaluated before ("a" "b"), leaving the impression of a gap
between tokens, causing the match to be missed.

On Fri, May 17, 2019 at 12:29 PM Michael Gibney
 wrote:
>
> The SpanNearQuery in association with "a.b." input and WDGF is
> expected behavior, since WDGF causes the query to search ("ab")|("a"
> "b"), as 1 or 2 tokens, respectively. The "a. b." input
> (whitespace-separated) is tokenized simply as "a" "b" (2 tokens) so
> sticks with the more straightforward PhraseQuery implementation.
>
> That said, the problem you're encountering is related to a couple of issues:
> https://issues.apache.org/jira/browse/LUCENE-7398
> https://issues.apache.org/jira/browse/LUCENE-4312
>
> For this case specifically, the problem is that NearSpansOrdered
> lazily returns one match per position *for the first subclause*. The
> or clause ("ab"|"a" "b"), because positionLength is not indexed, will
> always return "ab" first (implicit positionLength of 1). Again because
> "ab"'s actual positionLength of 2 from index-time WDGF is not stored
> in the index, the implicit positionLength of 1 at query-time gives the
> impression of a gap between "ab" and "isar", violating the "slop=0"
> constraint.
>
> Because NearSpansOrdered.nextStartPosition() always advances by
> calling nextStartPosition() on the first subclause (without exploring
> for variant matches in other subclauses), the top-level
> NearSpansOrdered advances after one attempt at matching, and the valid
> match is missed.
>
> Pending fixes to address the underlying issue (there is a candidate
> patch for LUCENE-7398 that incorporates a workaround for LUCENE-4312),
> you could mitigate the problem to some extent by either forcing slop>0
> (which as of 7.6 will be expanded into MultiPhraseQuery -- see
> https://issues.apache.org/jira/browse/LUCENE-8531), or you could set
> preserveOriginal=true on both index-time and query-time WDGF and
> upgrade to 8.1 (which would prevent the extreme case of an *exact*
> character-for-character matching query turning up no results -- see
> https://issues.apache.org/jira/browse/LUCENE-8730).
>
> On Fri, May 17, 2019 at 11:47 AM Erick Erickson  
> wrote:
> >
> > I’ll leave that explanation to someone who understands query parsers ;)
> >
> > > On May 17, 2019, at 7:57 AM, Doris Peter  
> > > wrote:
> > >
> > > Thanks a lot! I tried the debug parameter, which shows interesting 
> > > differences:
> > >
> > > debug": {
> > >
> > >"rawquerystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> > >"querystring": "all_places_txt:\"Neuburg a. d. Donau\"",
> > >"parsedquery": "PhraseQuery(all_places_txt:\"neuburg a d donau\")",
> > >"parsedquery_toString": "all_places_txt:\"neuburg a d donau\"",
> > >"QParser": "LuceneQParser"
> > > }
> > >
> > > debug": {
> > >"rawquerystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> > >"querystring": "all_places_txt:\"Neuburg a.d. Donau\"",
> > >"parsedquery": "SpanNearQuery(spanNear([all_places_txt:neuburg, 
> > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > > 0, true)]), all_places_txt:donau], 0, true))",
> > >"parsedquery_toString": "spanNear([all_places_txt:neuburg, 
> > > spanOr([all_places_txt:ad, spanNear([all_places_txt:a, all_places_txt:d], 
> > > 0, true)]), all_places_txt:donau], 0, true)",
> > >"QParser": "LuceneQParser"
> > >}
> > >
> > >
> > > Something seems to go wrong here, as the parsedquery contains the 
> > > SpanNearQuery instead of a PhraseQuery.
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >>>> Erick Erickson  5/17/2019 4:27 PM >>>
> > > Three things:
> > >
> > > 1> WordDelimiterGrap

Re: Query of Death Lucene/Solr 7.6

2019-05-30 Thread Michael Gibney
Very likely: https://issues.apache.org/jira/browse/SOLR-13336
Individual queries should still fail, but should fail fast, without
the broader impact seen prior to 8.1.
Does that describe the behavior you're seeing now with 8.1.1?
Michael

On Thu, May 30, 2019 at 11:55 AM Markus Jelsma
 wrote:
>
> Hello,
>
> When upgrading to 8.1.1 i took some time to quickly test this problem. Good 
> news, is has disappeared, for me at least. I immediately reproduce it on a 
> local 7.7 node, it died immediately. But it runs smoothly on the production 
> 7.5, and local 8.1.1 node! The problem still exists in 8.0.0.
>
> So i went through Lucene and Solr's CHANGELOG again but could not find any 
> ticket about the problem? Does anyone have an idea which ticket could be 
> responsible for fixing this?
>
> Anyway, many thanks!
> Markus
>
> -Original message-
> > From:Michael Gibney 
> > Sent: Friday 22nd February 2019 17:22
> > To: solr-user@lucene.apache.org
> > Subject: Re: Query of Death Lucene/Solr 7.6
> >
> > Ah... I think there are two issues likely at play here. One is LUCENE-8531
> > , which reverts a bug
> > related to SpanNearQuery semantics, causing possible query paths to be
> > enumarated up front. Setting ps=0 (although perhaps not appropriate for
> > some use cases) should address problems related to this issue.
> >
> > The other (likely affecting Gregg, for whom ps=0 did not help) is SOLR-12243
> > . Prior to 7.6,
> > SpanNearQuery (generated for relatively complex "graph" tokenized queries,
> > such as would be generated with WDGF, SynonymGraphFilter, etc.) were simply
> > getting dropped. This was surely a bug, in that pf did not contribute at
> > all to boosting such queries; but the silver lining was that performance
> > was great ;-)
> >
> > Markus, Gregg, could send examples (parsed query toString()) of problematic
> > queries (and perhaps relevant analysis chain configs)?
> >
> > Michael
> >
> >
> >
> > On Fri, Feb 22, 2019 at 11:00 AM Gregg Donovan  wrote:
> >
> > > FWIW: we have also seen serious Query of Death issues after our upgrade to
> > > Solr 7.6. Are there any open issues we can watch? Is Markus' findings
> > > around `pf` our best guess? We've seen these issues even with ps=0. We 
> > > also
> > > use the WDF.
> > >
> > > On Fri, Feb 22, 2019 at 8:58 AM Markus Jelsma 
> > > wrote:
> > >
> > > > Hello Michael,
> > > >
> > > > Sorry it took so long to get back to this, too many things to do.
> > > >
> > > > Anyway, yes, we have WDF on our query-time analysers. I uploaded two log
> > > > files, both the same query of death with and without synonym filter
> > > enabled.
> > > >
> > > > https://mail.openindex.io/export/solr-8983-console.log 23 MB
> > > > https://mail.openindex.io/export/solr-8983-console-without-syns.log 1.9
> > > MB
> > > >
> > > > Without the synonym we still see a huge number of entries. Many 
> > > > different
> > > > parts of our analyser chain contribute to the expansion of queries, but
> > > pf
> > > > itself really turns the problem on or off.
> > > >
> > > > Since SOLR-12243 is new in 7.6, does anyone know that SOLR-12243 could
> > > > have this side-effect?
> > > >
> > > > Thanks,
> > > > Markus
> > > >
> > > >
> > > > -Original message-
> > > > > From:Michael Gibney 
> > > > > Sent: Friday 8th February 2019 17:19
> > > > > To: solr-user@lucene.apache.org
> > > > > Subject: Re: Query of Death Lucene/Solr 7.6
> > > > >
> > > > > Hi Markus,
> > > > > As of 7.6, LUCENE-8531 <
> > > > https://issues.apache.org/jira/browse/LUCENE-8531>
> > > > > reverted a graph/Spans-based phrase query implementation (introduced 
> > > > > in
> > > > 6.5
> > > > > -- LUCENE-7699 ) to
> > > > an
> > > > > implementation that builds a separate phrase query for each possible
> > > > > enumerated path through the graph described by a parsed query.
> > > > > The potential for combinatoric explosion of the enumerated approach 
> > > > > was
> > > > (as
> > > > > far as I can tell) one of the main motivations for introducing the
> > > > > Spans-based implementation. Some real-world use cases would be good to
> > > > > explore. Markus, could you send (as an attachment) the debug 
> > > > > toString()
> > > > for
> > > > > the queries with/without synonyms enabled? I'm also guessing you may
> > > have
> > > > > WordDelimiterGraphFilter on the query analyzer?
> > > > > As an alternative to disabling pf, LUCENE-8531 only reverts to the
> > > > > enumerated approach for phrase queries where slop>0, so setting ps=0
> > > > would
> > > > > probably also help.
> > > > > Michael
> > > > >
> > > > > On Fri, Feb 8, 2019 at 5:57 AM Markus Jelsma <
> > > markus.jel...@openindex.io
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Hello (apologies for cross-posting),
> > > > > >
> > > > > > While working on SOLR-12743, using 7.6 on two nodes and 7.

Re: HttpShardHandlerFactory

2019-08-19 Thread Michael Gibney
Mark,

Another thing to check is that I believe the configuration you posted may
not actually be taking effect. Unless I'm mistaken, I think the correct
element name to configure the shardHandler is "shardHandler*Factory*", not
"shardHandler" ... as in, '...'

The element name is documented correctly in the refGuide page for "Format
of solr.xml":
https://lucene.apache.org/solr/guide/8_1/format-of-solr-xml.html#the-shardhandlerfactory-element

... but the incorrect (?) element name is included in the refGuide page for
"Distributed Requests":
https://lucene.apache.org/solr/guide/8_1/distributed-requests.html#configuring-the-shardhandlerfactory

Michael

On Fri, Aug 16, 2019 at 9:40 AM Shawn Heisey  wrote:

> On 8/16/2019 3:51 AM, Mark Robinson wrote:
> > I am trying to understand the socket time out and connection time out in
> > the HttpShardHandlerFactory:-
> >
> > 
> >10
> >20
> > 
>
> The shard handler is used when that Solr instance needs to make
> connections to another Solr instance (which could be itself, as odd as
> that might sound).  It does not apply to the requests that you make from
> outside Solr.
>
> > 1.Could some one please help me understand the effect of using such low
> > values of 10 ms
> >  and 20ms as given above inside my /select handler?
>
> A connection timeout of 10 milliseconds *might* result in connections
> not establishing at all.  This is translated down to the TCP socket as
> the TCP connection timeout -- the time limit imposed on making the TCP
> connection itself.  Which as I understand it, is the completion of the
> "SYN", "SYN/ACK", and "ACK" sequence.  If the two endpoints of the
> connection are on a LAN, you might never see a problem from this -- LAN
> connections are very low latency.  But if they are across the Internet,
> they might never work.
>
> The socket timeout of 20 milliseconds means that if the connection goes
> idle for 20 milliseconds, it will be forcibly closed.  So if it took 25
> milliseconds for the remote Solr instance to respond, this Solr instance
> would have given up and closed the connection.  It is extremely common
> for requests to take 100, 500, 2000, or more milliseconds to respond.
>
> > 2. What is the guidelines for setting these parameters? Should they be
> low
> > or high
>
> I would probably use a value of about 5000 (five seconds) for the
> connection timeout if everything's on a local LAN.  I might go as high
> as 15 seconds if there's a high latency network between them, but five
> seconds is probably long enough too.
>
> For the socket timeout, you want a value that's considerably longer than
> you expect requests to ever take.  Probably somewhere between two and
> five minutes.
>
> > 3. How can I test the effect of this chunk of code after adding it to my
> > /select handler ie I want to
> >   make sure the above code snippet is working. That is why I gave
> such
> > low values and
> >   thought when I fire a query I would get both time out errors in the
> > logs. But did not!
> >   Or is it that within the above time frame (10 ms, 20ms) if no
> request
> > comes the socket will
> >   time out and the connection will be lost. So to test this should I
> > give a say 100 TPS load with
> >   these low values and then increase the values to maybe 1000 ms and
> > 1500 ms respectively
> >   and see lesser time out error messages?
>
> If you were running a multi-server SolrCloud setup (or a single-server
> setup with multiple shards and/or replicas), you probably would see
> problems from values that low.  But if Solr never has any need to make
> connections to satisfy a request, then the values will never take effect.
>
> If you want to control these values for requests made from outside Solr,
> you will need to do it in your client software that is making the request.
>
> Thanks,
> Shawn
>


Re: Query on autoGeneratePhraseQueries

2019-10-16 Thread Michael Gibney
Going to back to the initial question, the wording is a little ambiguous
and it occurs to me that it's possible there's a misunderstanding of what
autoGeneratePhraseQueries does. It really only auto-generates phrase
*subqueries*. To use the example from the initial request, a query like
(black company) would always generate a non-phrase query (respecting mm,
q.op, etc. -- but in any case not a top-level phrase query), regardless of
the setting of autoGeneratePhraseQueries.

autoGeneratePhraseQueries (when set to true) only kicks in (in different
ways depending on analysis chain, and setting of "sow") for a query like
(the black-company manufactures), which would be transformed to something
more like (the "black company" manufactures). The idea is that there's some
extra indication that the two words should be bundled together for purposes
of querying.

If you want to auto-generate a top-level phrase query, some other approach
would be called for.

Apologies if this is obvious and/or not helpful, Shubham!

On Wed, Oct 16, 2019 at 10:10 AM Shawn Heisey  wrote:

> On 10/16/2019 7:14 AM, Shubham Goswami wrote:
> > I have implemented the sow=false property with eDismax Query parser but
> > still it does not has any effect
> > on the query as it is still parsing as separate terms instead of phrased
> > one.
>
> We have seen reports that when sow=false, which is the default setting
> since Solr 7.0, autoGeneratePhraseQueries does not work.  Try setting
> sow=true and see whether you get the results you expect.
>
> I do not know whether this behavior is a bug or if it is expected.
>
> Thanks,
> Shawn
>


Re: FlattenGraphFilter Eliminates Tokens - Can't match "Can't"

2019-12-05 Thread Michael Gibney
I wonder if this might be similar/related to the underlying problem
that is intended to be addressed by
https://issues.apache.org/jira/browse/LUCENE-8985?

btw, I think you only want to use FlattenGraphFilter *once* in the
indexing analysis chain, towards the end (after all components that
emit graphs). ...though that's probably *not* what's causing the
problem (based on the fact that the extra FGF doesn't seem to modify
any attributes).



On Mon, Nov 25, 2019 at 2:19 PM Eric Buss  wrote:
>
> Hi all,
>
> I have been trying to solve an issue where FlattenGraphFilter (FGF) removes
> tokens produced by WordDelimiterGraphFilter (WDGF) - consequently searches 
> that
> contain the contraction "can't" do not match.
>
> This is on Solr version 7.7.1.
>
> The field in question is defined as follows:
>
> 
>
> And the relevant fieldType "text_general":
>
>  positionIncrementGap="100">
> 
> 
>  words="stopwords.txt"/>
>  stemEnglishPossessive="0" preserveOriginal="1" catenateAll="1" 
> splitOnCaseChange="0"/>
> 
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>  class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> 
> 
> 
>  words="stopwords.txt"/>
>  stemEnglishPossessive="0" preserveOriginal="0" catenateAll="0" 
> splitOnCaseChange="0"/>
>  words="stopwords.txt"/>
>  class="org.apache.lucene.analysis.icu.ICUFoldingFilterFactory"/>
> 
> 
>
> Finally, the relevant entries in synonyms.txt are:
>
> can,cans
> cants,cant
>
> Using the Solr console Analysis and "can't" as the Field Value, the following
> tokens are produced (find the verbose output at the bottom of this email):
>
> Index
> ST| can't
> SF| can't
> WDGF  | cant | can't | can | t
> FGF   | cant | can't | can | t
> SGF   | cants | cant | can't | | cans | can | t
> ICUFF | cants | cant | can't | | cans | can | t
> FGF   | cants | cant | can't | | t
>
> Query
> ST| can't
> SF| can't
> WDGF  | can | t
> SF| can | t
> ICUFF | can | t
>
> As you can see after the FGF the tokens "can" and "cans" are pruned so the 
> query
> does not match. Is there a reasonable way to preserve these tokens?
>
> My key concern is that I want the "fix" for this to have as little impact on
> other queries as possible.
>
> Some things I have checked/tried:
>
> Searching for similar problems I found this thread:
> https://lucene.472066.n3.nabble.com/Questions-for-SynonymGraphFilter-and-WordDelimiterGraphFilter-td4420154.html
> Here it is suggested that FGF is not necessary (without any supporting
> evidence). This goes directly against the documentation that states "If you 
> use
> [the SynonymGraphFilter] during indexing, you must follow it with a Flatten
> Graph Filter":
> https://lucene.apache.org/solr/guide/7_0/filter-descriptions.html
> Despite this warning I tried out removing the FGF on a local
> cluster and indeed it still runs and this search now works, however I am
> paranoid that this will break far more things than it fixes.
>
> I have tried adding the FGF as a filter to the query. This does not eliminate
> the "can" term in the query analysis.
>
> I have tested other contracted words. Some have this issue as well - others do
> not. "haven't", "shouldn't", "couldn't", "I'll", "weren't", "ain't" all
> preserve their tokens "won't" does not. I believe the pattern here is that
> whenever part of the contraction has synonyms this problem manifests.
>
> Eliminating WDGF is not viable as we rely on this functionality for other uses
> of delimiters (such as wi-fi -> wi fi).
>
> Performing WDGF after synonyms is also not viable as in the case that we have
> the data "historical-text" we want this to match the search "history text".
>
> The hacky solution I have found is to use the PatternReplaceFilterFactory to
> replace "can't" with "cant". Though this technically solves the issue, I hope 
> it
> is obvious why this does not feel like an ideal solution.
>
> Has anyone encountered this type of issue before? Any advice on how the filter
> use here could be improved to handle this case?
>
> Thanks,
> Eric Buss
>
>
> PS. The verbose output from Analysis of "can't"
>
> Index
>
> ST| text  | can't|
>   | raw_bytes | [63 61 6e 27 74] |
>   | start | 0|
>   | end   | 5|
>   | positionLength| 1|
>   | type  ||
>   | termFrequency | 1|
>   | position  | 1|
> SF| text  | can't|
>   | raw_bytes | [63 61 6e 27 74] |
>   | start | 0|
>   | end   | 5|
>   | positionLength| 1|
>   | type  ||
>   | termFrequency | 1|
>   | position  | 1|
> WDGF  | text  | cant  | can't| can

Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-18 Thread Michael Gibney
This is related to this issue:
https://issues.apache.org/jira/browse/SOLR-13336

Also tangentially relevant:
https://issues.apache.org/jira/browse/LUCENE-8531
https://issues.apache.org/jira/browse/SOLR-12243

I think your options include:
1. setting slop=0, which restores SpanNearQuery as the graph phrase
query implementation (see LUCENE-8531)
2. downgrading to 7.5 would avoid the OOM, but would cause graph
phrase queries to be effectively ignored (see SOLR-12243)
3. upgrade to 8.0, which will restore the failsafe maxBooleanClauses,
avoiding OOM but returning an error code for affected queries (which
in your case sounds like most queries?) (see SOLR-13336)

Michael

On Tue, Dec 17, 2019 at 4:16 PM Nick D  wrote:
>
> Hello All,
>
> We recently upgraded from Solr 6.6 to Solr 7.7.2 and recently had spikes in
> memory that eventually caused either an OOM or almost 100% utilization of
> the available memory. After trying a few things, increasing the JVM heap,
> making sure docValues were set for all Sort, facet fields (thought maybe
> the fieldCache was blowing up), I was able to isolate a single query that
> would cause the used memory to become fully exhausted and effectively
> render the instance dead. After applying a timeAllowed  value to the query
> and reducing the query phrase (system would crash on without throwing the
> warning on longer queries containing synonyms). I was able to idenitify the
> following warning in the logs:
>
> o.a.s.s.SolrIndexSearcher Query: 
>
> the request took too long to iterate over terms. Timeout: timeoutAt:
> 812182664173653 (System.nanoTime(): 812182715745553),
> TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@7a0db441
>
> I have narrowed the problem down to the following:
> the way synonyms are being expaneded along with phrase slop.
>
> With a ps=5 I get 4096 possible permutations of the phrase being searched
> with because of synonyms, looking similar to:
> ngs_title:"bereavement leave type build bereavement leave type data p"~5
>  ngs_title:"bereavement leave type build bereavement bereavement type data
> p"~5
>  ngs_title:"bereavement leave type build bereavement jury duty type data
> p"~5
>  ngs_title:"bereavement leave type build bereavement maternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build bereavement paternity type data
> p"~5
>  ngs_title:"bereavement leave type build bereavement paternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build bereavement adoption leave type
> data p"~5
>  ngs_title:"bereavement leave type build jury duty maternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build jury duty paternity type data p"~5
>  ngs_title:"bereavement leave type build jury duty paternity leave type
> data p"~5
>  ngs_title:"bereavement leave type build jury duty adoption leave type data
> p"~5
>  ngs_title:"bereavement leave type build jury duty absence type data p"~5
>  ngs_title:"bereavement leave type build maternity leave leave type data
> p"~5
>  ngs_title:"bereavement leave type build maternity leave bereavement type
> data p"~5
>  ngs_title:"bereavement leave type build maternity leave jury duty type
> data p"~5
>
> 
>
> Previously in Solr 6 that same query, with the same synonyms (and query
> analysis chain) would produce a parsedQuery like when using a &ps=5:
> DisjunctionMaxQuery(((ngs_field_description:\"leave leave type build leave
> leave type data ? p leave leave type type.enabled\"~5)^3.0 |
> (ngs_title:\"leave leave type build leave leave type data ? p leave leave
> type type.enabled\"~5)^10.0)
>
> The expansion wasn't being applied to the added disjunctionMaxQuery to when
> adjusting rankings with phrase slop.
>
> In general the parsedqueries between 6 and 7 are differnet, with some new
> `spanNears` showing but they don't create the memory consumpution issues
> that I have seen when a large synonym expansion is happening along w/ using
> a PS parameter.
>
> I didn't see much in terms on release notes changes for synonym changes
> (outside of SOW=false being the default for version . 7).
>
> The field being opertated on has the following query analysis chain:
>
>  
> 
>  words="stopwords.txt"/>
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>   
>
> Not sure if there is a change in phrase slop that now takes synonyms into
> account and if there is way to disable that kind of expansion or not. I am
> not sure if it is related to SOLR-10980
>  or
> not, does seem to be related,  but referenced Solr 6 which does not do the
> expansion.
>
> Any help would be greatly appreciated.
>
> Nick


Re: Synonym expansions w/ phrase slop exhausting memory after upgrading to SOLR 7

2019-12-19 Thread Michael Gibney
Solution 2 (downgrade to 7.5) fixes the problem by reverting to
building proximity (SpanNear) queries that do not explode
exponentially like MultiPhraseQuery does; but note that
SpanNearQueries in 7.5 are dropped (SOLR-12243), so they have
literally no effect whatsoever (aside from the minimal cost of
building them before they get GC'd).

You are correct that there's still a problem wrt exponential graph
query expansion (a case often associated with multi-term synonyms and
WordDelimiterGraphFilter). In 8.1 and later, the parser still
*attempts* to build the exact same queries, but will be prevented from
doing so (and return an error response code) when expansion hits the
effective maxBooleanClauses threshold. So individual requests will
(for certain analysis/input combinations) still error out, but the
entire system is no longer affected. Some related issues (having to do
with graph token streams) that might be of interest:
https://issues.apache.org/jira/browse/LUCENE-8544
https://issues.apache.org/jira/browse/LUCENE-7398
https://issues.apache.org/jira/browse/LUCENE-4312


I interpret the documentation
(https://lucene.apache.org/solr/guide/8_3/query-settings-in-solrconfig.html#maxbooleanclauses)
to indicate that the maxBooleanClauses setting in solr.xml takes
priority as a hard upper bound; so the individual settings in
solrconfig.xml may be used per-collection to *decrease* the global
limit. If you attempt to set maxBooleanClauses larger in
solrconfig.xml than the systemwide setting (configured via solr.xml),
a warning message will be logged, but the attempted configuration will
otherwise have no affect (the lower systemwide value will still be the
effective limit).

Michael


On Wed, Dec 18, 2019 at 9:48 PM Nick D  wrote:
>
> Michael,
>
> Thank you so much, that was extremely helpful. My googlefu wasn't good
> enough I guess.
>
> 1. Was my initial fix just to stop it from exploding.
>
> 2. Will be the perm solutions for now until we can get some things squared
> away for 8.0.
>
> Sounds like even in 8 there is a problem with any graph query expansion
> potential still growing rather large but it just won't consume all
> available memory, is that correct?
>
> One final question, why would the maxbooleanqueries value in the solrconfig
> still apply? Reading through all the jiras I thought that was supposed to
> still be a fail safe, did I miss something?
>
> Thanks again for your help,
>
> Nick
>
> On Wed, Dec 18, 2019, 8:10 AM Michael Gibney 
> wrote:
>
> > This is related to this issue:
> > https://issues.apache.org/jira/browse/SOLR-13336
> >
> > Also tangentially relevant:
> > https://issues.apache.org/jira/browse/LUCENE-8531
> > https://issues.apache.org/jira/browse/SOLR-12243
> >
> > I think your options include:
> > 1. setting slop=0, which restores SpanNearQuery as the graph phrase
> > query implementation (see LUCENE-8531)
> > 2. downgrading to 7.5 would avoid the OOM, but would cause graph
> > phrase queries to be effectively ignored (see SOLR-12243)
> > 3. upgrade to 8.0, which will restore the failsafe maxBooleanClauses,
> > avoiding OOM but returning an error code for affected queries (which
> > in your case sounds like most queries?) (see SOLR-13336)
> >
> > Michael
> >
> > On Tue, Dec 17, 2019 at 4:16 PM Nick D  wrote:
> > >
> > > Hello All,
> > >
> > > We recently upgraded from Solr 6.6 to Solr 7.7.2 and recently had spikes
> > in
> > > memory that eventually caused either an OOM or almost 100% utilization of
> > > the available memory. After trying a few things, increasing the JVM heap,
> > > making sure docValues were set for all Sort, facet fields (thought maybe
> > > the fieldCache was blowing up), I was able to isolate a single query that
> > > would cause the used memory to become fully exhausted and effectively
> > > render the instance dead. After applying a timeAllowed  value to the
> > query
> > > and reducing the query phrase (system would crash on without throwing the
> > > warning on longer queries containing synonyms). I was able to idenitify
> > the
> > > following warning in the logs:
> > >
> > > o.a.s.s.SolrIndexSearcher Query: <very long synonym expansion>
> > >
> > > the request took too long to iterate over terms. Timeout: timeoutAt:
> > > 812182664173653 (System.nanoTime(): 812182715745553),
> > > TermsEnum=org.apache.lucene.codecs.blocktree.SegmentTermsEnum@7a0db441
> > >
> > > I have narrowed the problem down to the following:
> > > the way synonyms are being expaneded along with phrase slop.
> > >
> > > With a 

Re: Solr 7.7 heap space is getting full

2020-01-22 Thread Michael Gibney
Rajdeep, you say that "suddenly" heap space is getting full ... does
this mean that some variant of this configuration was working for you
at some point, or just that the failure happens quickly?

If heap space and faceting are indeed the bottleneck, you might make
sure that you have docValues enabled for your facet field fieldTypes,
and perhaps set uninvertible=false.

I'm not seeing where large numbers of facets initially came from in
this thread? But on that topic this is perhaps relevant, regarding the
potential utility of a facet cache:
https://issues.apache.org/jira/browse/SOLR-13807

Michael

On Wed, Jan 22, 2020 at 7:16 AM Toke Eskildsen  wrote:
>
> On Sun, 2020-01-19 at 21:19 -0500, Mehai, Lotfi wrote:
> > I  had a similar issue with a large number of facets. There is no way
> > (At least I know) your can get an acceptable response time from
> > search engine with high number of facets.
>
> Just for the record then it is doable under specific circumstances
> (static single-shard index, only String fields, Solr 4 with patch,
> fixed list of facet fields):
> https://sbdevel.wordpress.com/2013/03/20/over-9000-facet-fields/
>
> More usable for the current case would be to play with facet.threads
> and throw hardware with many CPU-cores after the problem.
>
> - Toke Eskildsen, Royal Danish Library
>
>


Re: Solr using all available CPU and becoming unresponsive

2021-01-11 Thread Michael Gibney
Hi Jeremy,
Can you share your analysis chain configs? (SOLR-13336 can manifest in a
similar way, and would affect 7.3.1 with a susceptible config, given the
right (wrong?) input ...)
Michael

On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith  wrote:

> Hello all,
>  We have been struggling with an issue where solr will intermittently
> use all available CPU and become unresponsive.  It will remain in this
> state until we restart.  Solr will remain stable for some time, usually a
> few hours to a few days, before this happens again.  We've tried adjusting
> the caches and adding memory to both the VM and JVM, but we haven't been
> able to solve the issue yet.
>
> Here is some info about our server:
> Solr:
>   Solr 7.3.1, running on Java 1.8
>   Running in cloud mode, but there's only one core
>
> Host:
>   CentOS7
>   8 CPU, 56GB RAM
>   The only other processes running on this VM are two zookeepers, one for
> this Solr instance, one for another Solr instance
>
> Solr Config:
>  - One Core
>  - 36 Million documents (Max Doc), 28 million (Num Docs)
>  - ~15GB
>  - 10-20 Requests/second
>  - The schema is fairly large (~100 fields) and we allow faceting and
> searching on many, but not all, of the fields
>  - Data are imported once per minute through the DataImportHandler, with a
> hard commit at the end.  We usually index ~100-500 documents per minute,
> with many of these being updates to existing documents.
>
> Cache settings:
>   size="256"
>  initialSize="256"
>  autowarmCount="8"
>  showItems="64"/>
>
>size="256"
>   initialSize="256"
>   autowarmCount="0"/>
>
> size="1024"
>initialSize="1024"
>autowarmCount="0"/>
>
> For the filterCache, we have tried sizes as low as 128, which caused our
> CPU usage to go up and didn't solve our issue.  autowarmCount used to be
> much higher, but we have reduced it to try to address this issue.
>
>
> The behavior we see:
>
> Solr is normally using ~3-6GB of heap and we usually have ~20GB of free
> memory.  Occasionally, though, solr is not able to free up memory and the
> heap usage climbs.  Analyzing the GC logs shows a sharp incline of usage
> with the GC (the default CMS) working hard to free memory, but not
> accomplishing much.  Eventually, it fills up the heap, maxes out the CPUs,
> and never recovers.  We have tried to analyze the logs to see if there are
> particular queries causing issues or if there are network issues to
> zookeeper, but we haven't been able to find any patterns.  After the issues
> start, we often see session timeouts to zookeeper, but it doesn't appear​
> that they are the cause.
>
>
>
> Does anyone have any recommendations on things to try or metrics to look
> into or configuration issues I may be overlooking?
>
> Thanks,
> Jeremy
>
>


Re: Solr using all available CPU and becoming unresponsive

2021-01-12 Thread Michael Gibney
Ahh ok. If those are your only fieldType definitions, and most of your
config is copied from the default, then SOLR-13336 is unlikely to be the
culprit. Looking at more general options, off the top of my head:
1. make sure you haven't allocated all physical memory to heap (leave a
decent amount for OS page cache)
2. disable swap, if you can (this is esp. important if using network
storage as swap). There are potential downsides to this (so proceed with
caution); but if part of your heap gets swapped out (and it almost
certainly will, with a sufficiently large heap) full GCs lead to a swap
storm that compounds the problem. (fwiw, this is probably the first thing
I'd recommend looking into and trying, because it's so easy, and can in
some cases yield a dramatic improvement. N.b., I'm talking about `swapoff
-a`, not `sysctl -w vm.swappiness=0` -- I find that the latter does *not*
eliminate swapping in the way that's needed to achieve the desired goal in
this case. Again, exercise caution in doing this, discuss, research, etc.).
Related documentation was added in 8.5, but absolutely applies to 7.3.1 as
well:
https://lucene.apache.org/solr/guide/8_7/taking-solr-to-production.html#avoid-swapping-nix-operating-systems
-- the note there about "lowering swappiness" being an acceptable
alternative contradicts my experience, but I suppose ymmv?
3. if you're faceting on fields -- especially high-cardinality fields (many
values) -- make sure that you have `docValues=true, uninvertible=false`
configured (to ensure that you're not building large on-heap data
structures when there's an alternative that doesn't require it.

These are all recommendations that are explained in more detail by others
elsewhere; I think they should all apply to 7.3.1; fwiw, I would recommend
upgrading if you have the (human) bandwidth to do so. Good luck!

Michael

On Tue, Jan 12, 2021 at 8:39 AM Jeremy Smith  wrote:

> Thanks Michael,
>  SOLR-13336 seems intriguing.  I'm not a solr expert, but I believe
> these are the relevant sections from our schema definition:
>
>  positionIncrementGap="100">
>   
> 
> 
>   
>   
> 
> 
>   
> 
>  positionIncrementGap="100" multiValued="false">
>   
> 
>  words="stopwords.txt" />
> 
>   
>   
> 
>  words="stopwords.txt" />
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>   
> 
>
> Our other fieldTypes don't have any analyzers attached to them.
>
>
> If SOLR-13336 is the cause of the issue is the best remedy to upgrade to
> solr 8?  It doesn't look like the fix was back patched to 7.x.
>
> Our schema has some issues arising from not fully understanding Solr and
> just copying existing structures from the defaults.  In this case,
> stopwords.txt is completely empty and synonyms.txt is just the default
> synonyms.txt, which seems not useful at all for us.  Could I just take out
> the StopFilterFactory and SynonymGraphFilterFactory from the query section
> (and maybe the StopFilterFactory from the index section as well)?
>
> Thanks again,
> Jeremy
>
> 
> From: Michael Gibney 
> Sent: Monday, January 11, 2021 8:30 PM
> To: solr-user@lucene.apache.org 
> Subject: Re: Solr using all available CPU and becoming unresponsive
>
> Hi Jeremy,
> Can you share your analysis chain configs? (SOLR-13336 can manifest in a
> similar way, and would affect 7.3.1 with a susceptible config, given the
> right (wrong?) input ...)
> Michael
>
> On Mon, Jan 11, 2021 at 5:27 PM Jeremy Smith  wrote:
>
> > Hello all,
> >  We have been struggling with an issue where solr will intermittently
> > use all available CPU and become unresponsive.  It will remain in this
> > state until we restart.  Solr will remain stable for some time, usually a
> > few hours to a few days, before this happens again.  We've tried
> adjusting
> > the caches and adding memory to both the VM and JVM, but we haven't been
> > able to solve the issue yet.
> >
> > Here is some info about our server:
> > Solr:
> >   Solr 7.3.1, running on Java 1.8
> >   Running in cloud mode, but there's only one core
> >
> > Host:
> >   CentOS7
> >   8 CPU, 56GB RAM
> >   The only other processes running on this VM are two zookeepers, one for
> > this Solr instance, one for another Solr instance
> >
> > Solr Config:
> >  - One Core
> >  - 36 Million documents (Max Doc), 28 million (Num D

Re: Handling acronyms

2021-01-15 Thread Michael Gibney
The equivalent terms on the right-hand side of the `=>` operator in the
example you sent should be separated by a comma. You mention you already
tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
and that that yielded unexpected results as well. I would recommend
pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
applying the synonym filter _after_ case normalization in the analysis
chain (there are other ways you could do, but the key point being that you
need to pay attention to case and how it interacts with the order in which
filters are applied).

Re: Charlie's recommendation to apply these at index-time, a word of
caution (and it's possible that this is in fact the underlying cause of
some of the unexpected behavior you're observine?): be careful if you're
using term _expansion_ at index-time (i.e., mapping single terms to
multiple terms, which I note appears to be what you're trying to do in the
example lines you provided). Multi-term index-time synonyms can lead to
unexpected results for positional queries (either explicit phrase queries,
or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
least two good overviews of this topic, one by Mike McCandless focusing on
Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
relevant.

One way to work around this is to "collapse" (rather than expand) synonyms,
at both index and query time. Another option would be to apply synonym
expansion only at query-time. It's also worth noting that increasing phrase
slop (`ps` param, etc.) can cause the issues with index-time synonym
expansion to "fly under the radar" a little, wrt the most blatant "false
negative" manifestations of index-time synonym issues for phrase queries.

[1]
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
[2]
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
[3] https://issues.apache.org/jira/browse/LUCENE-4312

On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
ch...@opensourceconnections.com> wrote:

> I'm wondering if you should be using these acronyms at index time, not
> search time. It will make your index bigger and you'll have to re-index
> to add new synonyms (as they may apply to old documents) but this could
> be an occasional task, and in the meantime you could use query-time
> synonyms for the new ones.
>
> Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.
>
> Cheers
>
> Charlie
>
> On 15/01/2021 09:48, Shaun Campbell wrote:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> 
> Founding member of The Search Network 
> and co-author of Searching the Enterprise
> 
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>


Re: Handling acronyms

2021-01-15 Thread Michael Gibney
Shaun,

I'm not 100% sure, but don't give up on this just yet:

> For example if I enter diabetes it finds the acronym DM for diabetes
mellitus

I think the behavior you're observing may simply be a side-effect of a
misconfiguration of synonyms.txt. In the example you posted, the equivalent
terms are separated by commas (as they should be), which would lead to
treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
mellitus", which as you point out is clearly not what you want. Do you see
similar results for `DM, diabetes mellitus` (which should be parsed as
meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?

(see the note about ensuring proper comma-separation in my earlier response)

Michael


On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell 
wrote:

> Hi Michael
>
> Thanks for that I'll have a study later.  It's just reminded me of the
> expand option which I meant to have a look at.
>
> Thanks
> Shaun
>
> On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
> wrote:
>
> > The equivalent terms on the right-hand side of the `=>` operator in the
> > example you sent should be separated by a comma. You mention you already
> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> > and that that yielded unexpected results as well. I would recommend
> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
> and
> > applying the synonym filter _after_ case normalization in the analysis
> > chain (there are other ways you could do, but the key point being that
> you
> > need to pay attention to case and how it interacts with the order in
> which
> > filters are applied).
> >
> > Re: Charlie's recommendation to apply these at index-time, a word of
> > caution (and it's possible that this is in fact the underlying cause of
> > some of the unexpected behavior you're observine?): be careful if you're
> > using term _expansion_ at index-time (i.e., mapping single terms to
> > multiple terms, which I note appears to be what you're trying to do in
> the
> > example lines you provided). Multi-term index-time synonyms can lead to
> > unexpected results for positional queries (either explicit phrase
> queries,
> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
> at
> > least two good overviews of this topic, one by Mike McCandless focusing
> on
> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> > relevant.
> >
> > One way to work around this is to "collapse" (rather than expand)
> synonyms,
> > at both index and query time. Another option would be to apply synonym
> > expansion only at query-time. It's also worth noting that increasing
> phrase
> > slop (`ps` param, etc.) can cause the issues with index-time synonym
> > expansion to "fly under the radar" a little, wrt the most blatant "false
> > negative" manifestations of index-time synonym issues for phrase queries.
> >
> > [1]
> >
> >
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> > [2]
> >
> >
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
> >
> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> > ch...@opensourceconnections.com> wrote:
> >
> > > I'm wondering if you should be using these acronyms at index time, not
> > > search time. It will make your index bigger and you'll have to re-index
> > > to add new synonyms (as they may apply to old documents) but this could
> > > be an occasional task, and in the meantime you could use query-time
> > > synonyms for the new ones.
> > >
> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> > me.
> > >
> > > Cheers
> > >
> > > Charlie
> > >
> > > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > > I have a medical journals search application and I've a list of some
> > > 9,000
> > > > acronyms like this:
> > > >
> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> > Questionnaire
> > > > SRN=>SRN Stroke Research Network
> > > > IGBP=>IGBP isolated gastric bypass
> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > > Obstructive
> > > 

Re: Solrcloud - Reads on specific nodes

2021-01-15 Thread Michael Gibney
I know you're asking about nodes, not replicas; but depending on what
you're trying to achieve you might be as well off routing requests based on
replica. Have you considered the various options available via the
`shards.preference` param [1]? For instance, you could set up your "write"
replicas as `NRT`, and your "read" replicas as `PULL`, then use the
`replica.type` property of the `shards.preference` param to route "select"
requests to the `PULL` replicas.

It might also be worth looking at the options for stable routing provided
by the relatively new `replica.base` property (of `shards.preference`
param). If you have varying workloads with distinct cache usage patterns,
for instance, this could be useful to you.

To tie this back to nodes (your original question, if a replica-focused
solution is not sufficient): you could still use replica types and the
`shards.preference` param to control request routing, and implicitly route
by node by paying extra attention to careful replica placement on
particular nodes. As it happens, I'm actually doing a very simple variant
of this -- but not in a general-purpose enough way to feel I'm in a
position to make any specific recommendations.

[1]
https://lucene.apache.org/solr/guide/8_7/distributed-requests.html#shards-preference-parameter

On Fri, Jan 15, 2021 at 9:56 AM Doss  wrote:

> Dear All,
>
> 1. Suppose we have 10 node SOLR Cloud setup, is it possible to dedicate 4
> nodes for writes and 6 nodes for selects?
>
> 2. We have a SOLR cloud setup for our customer facing applications, and we
> would like to have two more SOLR nodes for some backend jobs. Is it good
> idea to form these nodes as slave nodes and making one node in the cloud as
> Master?
>
> Thanks!
> Mohandoss.
>


Re: Handling acronyms

2021-01-15 Thread Michael Gibney
EDIT: "the equivalent terms are separated by commas (as they should be)" =>
"the equivalent terms are _not_ separated by commas (as they should be)"

On Fri, Jan 15, 2021 at 10:09 AM Michael Gibney 
wrote:

> Shaun,
>
> I'm not 100% sure, but don't give up on this just yet:
>
> > For example if I enter diabetes it finds the acronym DM for diabetes
> mellitus
>
> I think the behavior you're observing may simply be a side-effect of a
> misconfiguration of synonyms.txt. In the example you posted, the equivalent
> terms are separated by commas (as they should be), which would lead to
> treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
> mellitus", which as you point out is clearly not what you want. Do you see
> similar results for `DM, diabetes mellitus` (which should be parsed as
> meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?
>
> (see the note about ensuring proper comma-separation in my earlier
> response)
>
> Michael
>
>
> On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell 
> wrote:
>
>> Hi Michael
>>
>> Thanks for that I'll have a study later.  It's just reminded me of the
>> expand option which I meant to have a look at.
>>
>> Thanks
>> Shaun
>>
>> On Fri, 15 Jan 2021 at 14:33, Michael Gibney 
>> wrote:
>>
>> > The equivalent terms on the right-hand side of the `=>` operator in the
>> > example you sent should be separated by a comma. You mention you already
>> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research
>> Network`)
>> > and that that yielded unexpected results as well. I would recommend
>> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
>> and
>> > applying the synonym filter _after_ case normalization in the analysis
>> > chain (there are other ways you could do, but the key point being that
>> you
>> > need to pay attention to case and how it interacts with the order in
>> which
>> > filters are applied).
>> >
>> > Re: Charlie's recommendation to apply these at index-time, a word of
>> > caution (and it's possible that this is in fact the underlying cause of
>> > some of the unexpected behavior you're observine?): be careful if you're
>> > using term _expansion_ at index-time (i.e., mapping single terms to
>> > multiple terms, which I note appears to be what you're trying to do in
>> the
>> > example lines you provided). Multi-term index-time synonyms can lead to
>> > unexpected results for positional queries (either explicit phrase
>> queries,
>> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
>> at
>> > least two good overviews of this topic, one by Mike McCandless focusing
>> on
>> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The
>> underlying
>> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
>> > relevant.
>> >
>> > One way to work around this is to "collapse" (rather than expand)
>> synonyms,
>> > at both index and query time. Another option would be to apply synonym
>> > expansion only at query-time. It's also worth noting that increasing
>> phrase
>> > slop (`ps` param, etc.) can cause the issues with index-time synonym
>> > expansion to "fly under the radar" a little, wrt the most blatant "false
>> > negative" manifestations of index-time synonym issues for phrase
>> queries.
>> >
>> > [1]
>> >
>> >
>> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
>> > [2]
>> >
>> >
>> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
>> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
>> >
>> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
>> > ch...@opensourceconnections.com> wrote:
>> >
>> > > I'm wondering if you should be using these acronyms at index time, not
>> > > search time. It will make your index bigger and you'll have to
>> re-index
>> > > to add new synonyms (as they may apply to old documents) but this
>> could
>> > > be an occasional task, and in the meantime you could use query-time
>> > > synonyms for the new ones.
>> > >
>> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy
>> to
>> 

Re: DocValued SortableText Field is slower than Non DocValued String Field for Facet

2021-01-28 Thread Michael Gibney
I'm not sure about _performance_, but I'm pretty sure you don't want to be
faceting on docValued SortableTextField (and faceting on non-docValued
SortableTextField, though I think technically possible, works against
uninverted _indexed_values, so ends up doing something entirely different):
https://issues.apache.org/jira/browse/SOLR-13056.

TL;DR: with SortableTextField bulk faceting happens over docValues (which
for SortableTextField contains the full sort value string) and refinement
happens against indexed values (which are tokenized). So it can behave very
strangely, at least in multi-shard collections. See also:
https://issues.apache.org/jira/browse/SOLR-8362

Quick clarification, you say "non Docvalued String Field" ... I'm assuming
you're talking about "StrField", not "TextField".

wrt performance difference, I'm willing to bet (though not certain) that
you're really simply noticing a discrepancy between docValues and
non-docValues faceting -- accordingly, for your use case I'd expect
faceting against StrField _with_ docValues to have similar performance to
SortableTextField with docValues. Further possibly-relevant discussion can
be found in the following thread:

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/202006.mbox/%3CCAF%3DheHFd6GBABzKzDQPTfpYUUQJXxYwue4OC86QOm_AR0X3_ZQ%40mail.gmail.com%3E

On Thu, Jan 28, 2021 at 7:25 PM Jae Joo  wrote:

> I am wondering that the performance of facet of DocValued SortableText
> Field is slower than non Docvalued String Field.
>
> Does anyone know why?
>
>
> Thanks,
>
> Jae
>


Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
> Performance and resource is still affected by 30M unique values of T
right?
Yes. The main performance issue would be the per-request allocation of a
30M-element `long[]` for "dv" or "uif" methods (which are by far the most
common methods in practice). With low enough request volume and large
enough heap you might not actually perceive a difference in performance;
but if you encounter problems for the use case you describe, this array
allocation would likely be the cause. (also note that the relevant field
cardinality is the _per-shard_ cardinality, so in a multi-shard collection
the size of the allocated arrays might be somewhat less than the overall
field cardinality)

I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
moment, but rather must be specified explicitly:
https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128

The code snippet above indicates some other restrictions that you're
probably already aware of (doesn't work with prefixes or mincount==0, or
for multi-valued or numeric types); otherwise though (for non-numeric
single-valued field) I think the situation you describe (high-cardinality
field, known low-cardinality for the particular domain) sounds like a
perfect use-case for dvhash.

Michael

On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz 
wrote:

> Hello,
>
> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
>
> As I understand the main determinator of performance and RAM usage of a
> terms facet is cardinality of the field in whole collection, but not the
> cardinality of field in query result.
>
> I have a collection with 100M docs, T field has 30M unique values in
> entire collection. But my query result returns only docs with 2 different T
> values,
>
> {
> “q”: “some query”, //whose result has only 2 different T values
> “facet”: {
> “type”: “terms”,
> “field”: “T”,
> “limit”: 15
> }
>
> Performance and resource is still affected by 30M unique values of T right?
>
> If this is correct, can/how “method”: “dvhash” help in this case?
> If yes, does the default method “smart” take this into account and use the
> dvhash, so I shouldn’t to set it explicitly?
>
> Nice weekends
> ~ufuk
>


Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
Correction!: wrt "dvhash" and numeric types, it looks like I had it exactly
backwards! single-valued numeric types _do_ use (even default to) "dvhash"
... sorry about that! I stand by the rest of the previous message though,
which applies at a minimum to string-like fields.

On Fri, Feb 5, 2021 at 12:49 PM Michael Gibney 
wrote:

> > Performance and resource is still affected by 30M unique values of T
> right?
> Yes. The main performance issue would be the per-request allocation of a
> 30M-element `long[]` for "dv" or "uif" methods (which are by far the most
> common methods in practice). With low enough request volume and large
> enough heap you might not actually perceive a difference in performance;
> but if you encounter problems for the use case you describe, this array
> allocation would likely be the cause. (also note that the relevant field
> cardinality is the _per-shard_ cardinality, so in a multi-shard collection
> the size of the allocated arrays might be somewhat less than the overall
> field cardinality)
>
> I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
> moment, but rather must be specified explicitly:
>
> https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128
>
> The code snippet above indicates some other restrictions that you're
> probably already aware of (doesn't work with prefixes or mincount==0, or
> for multi-valued or numeric types); otherwise though (for non-numeric
> single-valued field) I think the situation you describe (high-cardinality
> field, known low-cardinality for the particular domain) sounds like a
> perfect use-case for dvhash.
>
> Michael
>
> On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz 
> wrote:
>
>> Hello,
>>
>> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
>> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
>>
>> As I understand the main determinator of performance and RAM usage of a
>> terms facet is cardinality of the field in whole collection, but not the
>> cardinality of field in query result.
>>
>> I have a collection with 100M docs, T field has 30M unique values in
>> entire collection. But my query result returns only docs with 2 different T
>> values,
>>
>> {
>> “q”: “some query”, //whose result has only 2 different T values
>> “facet”: {
>> “type”: “terms”,
>> “field”: “T”,
>> “limit”: 15
>> }
>>
>> Performance and resource is still affected by 30M unique values of T
>> right?
>>
>> If this is correct, can/how “method”: “dvhash” help in this case?
>> If yes, does the default method “smart” take this into account and use
>> the dvhash, so I shouldn’t to set it explicitly?
>>
>> Nice weekends
>> ~ufuk
>>
>


Re: Clarification on term facet method dvhash

2021-02-05 Thread Michael Gibney
Happy to help! If I'm correctly reading the block of code linked to above,
"dvhash" is silently ignored for multi-valued fields. So probably not much
performance difference there ;-)

On Fri, Feb 5, 2021 at 2:12 PM ufuk yılmaz 
wrote:

> This is a huge help Mr. Gibney thank you!
>
> One thing I can add is I tried dvhash with a string multi-valued field, it
> worked and didn’t throw any error but I don’t know if it got silently
> ignored or just worked.
>
> Sent from Mail for Windows 10
>
> From: Michael Gibney
> Sent: 05 February 2021 20:52
> To: solr-user@lucene.apache.org
> Subject: Re: Clarification on term facet method dvhash
>
> Correction!: wrt "dvhash" and numeric types, it looks like I had it exactly
> backwards! single-valued numeric types _do_ use (even default to) "dvhash"
> ... sorry about that! I stand by the rest of the previous message though,
> which applies at a minimum to string-like fields.
>
> On Fri, Feb 5, 2021 at 12:49 PM Michael Gibney 
> wrote:
>
> > > Performance and resource is still affected by 30M unique values of T
> > right?
> > Yes. The main performance issue would be the per-request allocation of a
> > 30M-element `long[]` for "dv" or "uif" methods (which are by far the most
> > common methods in practice). With low enough request volume and large
> > enough heap you might not actually perceive a difference in performance;
> > but if you encounter problems for the use case you describe, this array
> > allocation would likely be the cause. (also note that the relevant field
> > cardinality is the _per-shard_ cardinality, so in a multi-shard
> collection
> > the size of the allocated arrays might be somewhat less than the overall
> > field cardinality)
> >
> > I'm reasonably sure that "dvhash" is _not_ auto-picked by "smart" at the
> > moment, but rather must be specified explicitly:
> >
> >
> https://github.com/apache/lucene-solr/blob/6ff4a9b395a68d9b0d9e259537e3f5daf0278d51/solr/core/src/java/org/apache/solr/search/facet/FacetField.java#L124-L128
> >
> > The code snippet above indicates some other restrictions that you're
> > probably already aware of (doesn't work with prefixes or mincount==0, or
> > for multi-valued or numeric types); otherwise though (for non-numeric
> > single-valued field) I think the situation you describe (high-cardinality
> > field, known low-cardinality for the particular domain) sounds like a
> > perfect use-case for dvhash.
> >
> > Michael
> >
> > On Fri, Feb 5, 2021 at 11:56 AM ufuk yılmaz  >
> > wrote:
> >
> >> Hello,
> >>
> >> I’m using Solr 8.4. Very excited about performance improvements in 8.8:
> >> http://joelsolr.blogspot.com/2021/01/optimizations-coming-to-solr.html
> >>
> >> As I understand the main determinator of performance and RAM usage of a
> >> terms facet is cardinality of the field in whole collection, but not the
> >> cardinality of field in query result.
> >>
> >> I have a collection with 100M docs, T field has 30M unique values in
> >> entire collection. But my query result returns only docs with 2
> different T
> >> values,
> >>
> >> {
> >> “q”: “some query”, //whose result has only 2 different T values
> >> “facet”: {
> >> “type”: “terms”,
> >> “field”: “T”,
> >> “limit”: 15
> >> }
> >>
> >> Performance and resource is still affected by 30M unique values of T
> >> right?
> >>
> >> If this is correct, can/how “method”: “dvhash” help in this case?
> >> If yes, does the default method “smart” take this into account and use
> >> the dvhash, so I shouldn’t to set it explicitly?
> >>
> >> Nice weekends
> >> ~ufuk
> >>
> >
>
>


Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread Michael Gibney
`resultId` sounds like it might be a relatively high-cardinality field
(lots of unique values)? What's your number of shards, and replicas per
shard? SOLR-15008 (note: not a bug) describes a situation that may be
fundamentally similar to yours (though to be sure it's impossible to say
for sure without more information):
https://issues.apache.org/jira/browse/SOLR-15008?focusedCommentId=17236213#comment-17236213

In particular, the explanation and troubleshooting advice on the linked
comment might be relevant?

"dvhash" is _not_ mentioned on that SOLR-15008, but if the `processId` main
query significantly reduces the domain -- or more specifically, if
`resultId` is high-cardinality overall, but the cardinality of `resultId`
values _associated with a particular query_ is low -- you might consider
trying `"method"="dvhash"` (which should bypass OrdinalMap creation and
array allocation, if either/both of those contribute to the latency you're
finding).

Michael

On Fri, Feb 5, 2021 at 4:42 PM mmb1234  wrote:

> Hello,
>
> I am seeing very slow response from json faceting against a single core
> (though core is shard leader in a collection).
>
> Fields processId and resultId are non-multivalued, indexed and docvalues
> string (not text).
>
> Soft Commit = 5sec (opensearcher=true) and Hard Commit = 10sec because new
> docs are constantly being indexed with 95% new and 5% overwritten
> (overwrite=true; no atomic update). Caches are not considered useful due to
> commit frequency.
>
> Solr is v8.7.0 on openjdk11.
>
> Is there any way to improve json facet QTime?
>
> ## query only
> curl
> '
> http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x&rows=0
> '
> -d '
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":552,
> "params":{
>   "q":"processId:-xxx-xxx-xxx-x",
>   "cache":"false",
>   "rows":"0"}},
>   "response":{"numFound":231311,"start":0,"numFoundExact":true,"docs":[]
>   }}
>
> ## json facet takes 46secs
> curl
> '
> http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x&rows=0
> '
> -d '
> json.facet={
> categories:{
>   "type": "terms",
>   "field" : "resultId",
>   "limit" : 1
> }
> }'
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":46972,
> "params":{
>   "q":"processId:-xxx-xxx-xxx-x",
>   "json.facet":"{categories:{  \"type\": \"terms\",
> \"field\" : \"resultId\",  \"limit\" : 1}}",
>   "rows":"0"}},
>   "response":{"numFound":231311,"start":0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
> "count":231311,
> "categories":{
>   "buckets":[{
>   "val":"x",
>   "count":943}]}}}
>
>
> ## visualvm CPU sampling almost all time spent in lucene:
>
> org.apache.lucene.util.PriorityQueue.downHeap() 23,009 ms
>
> org.apache.lucene.codecs.lucene80.Lucene80DocValuesProducer$TermsDict.next()
> 13,268 ms
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread Michael Gibney
Apologies, I missed deducing from the request url that you're already
talking strictly about single-shard requests (so everything I was
suggesting about shards.preference etc. is not applicable). "dvhash" is
still worth a try though, esp. with `numFound` being 943 (out of 185
million!). Does this happen on a warm searcher (are subsequent requests
with no intervening updates _ever_ fast?)?

On Fri, Feb 5, 2021 at 6:13 PM mmb1234  wrote:

> Ok. I'll try that. Meanwhile query on resultId is subsecond response. But
> the
> immediate next query for faceting takes 40+secs. The core has 185million
> docs and 63GB index size.
>
> curl
> '
> http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x&rows=0
> '
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":558,
> "params":{
>   "q":"resultId:x",
>   "cache":"false",
>   "rows":"0"}},
>   "response":{"numFound":943,"start":0,"numFoundExact":true,"docs":[]
>   }}
>
>
> curl
> '
> http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=resultId:x&rows=0
> '
> -d '
> json.facet={
> categories:{
>   "type": "terms",
>   "field" : "resultId",
>   "limit" : 1
> }
> }'
> {
>   "responseHeader":{
> "zkConnected":true,
> "status":0,
> "QTime":43834,
> "params":{
>   "q":"resultId:x",
>   "json.facet":"{\ncategories:{\n  \"type\": \"terms\",\n
> \"field\" : \"resultId\",\n  \"limit\" : 1\n}\n}",
>   "cache":"false",
>   "rows":"0"}},
>   "response":{"numFound":943,"start":0,"numFoundExact":true,"docs":[]
>   },
>   "facets":{
> "count":943,
> "categories":{
>   "buckets":[{
>   "val":"x",
>   "count":943}]}}}
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>


Re: Json Faceting Performance Issues on solr v8.7.0

2021-02-05 Thread Michael Gibney
Ah! that's significant. The latency is likely due to building the
OrdinalMap (which maps segment ords to global ords) ... "dvhash" (assuming
the relevant fields are not multivalued) will very likely work; "dvhash"
doesn't map to global ords, so doesn't need to build the OrdinalMap (which
gets built the first time it's needed per-field per-searcher).

If "dvhash" doesn't work for some reason (multivalued fields, needs to work
over broader domains, etc.?) you could probably achieve a decent result by
configuring a static warming query (newSearcher) to issue a request that
facets on the relevant fields. That will delay the opening of each new
searcher, but will ensure that user requests don't block.

SOLR-15008 _was_ actually pretty similar, with the added wrinkle of
involving distributed (multi-shard) requests (and iirc "dvhash" wouldn't
have worked in that case?)

On Fri, Feb 5, 2021 at 8:00 PM mmb1234  wrote:

> > Does this happen on a warm searcher (are subsequent requests with no
> intervening updates _ever_ fast?)?
>
> Subsequent response times very fast if searcher remains open. As a control
> test, I faceted on the same field that I used in the q param.
>
> 1. Start solr
>
> 2. Execute q=resultId:x&rows=0
> =>  500ms
>
> 3. Execute q=resultId:x&rows=0&json.facet-on-resultId
> => 40,000ms
>
> 4. Execute q=resultId:x&rows=0&json.facet-on-resultId
> =>  150ms
>
> 5. Execute q=processId:x&rows=0&json.facet-on-processId
> =>   2,500ms
>
> 6. Execute q=processId:x&rows=0&json.facet-on-processId
> => 200ms
>
>
> curl
> '
> http://localhost:8983/solr/TestCollection_shard1_replica_t3/query?q=processId:-xxx-xxx-xxx-x&rows=0
> '
> -d '
> json.facet={
> categories:{
>   "type": "terms",
>   "field" : "processId",
>   "limit" : 1
> }
> }
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>