Neil: Yeah, the attachment-stripping is catches everyone first time, we’re so used to just adding anything we want to an e-mail…
I don’t know enough about the query parsing to answer off the top of my head. I do know one thing that’s changed is “Split on Whitespace” has changed from true to false by default, so it’d be interesting to add &sow=false to the query. Beyond that, take a look at what &debug=query added to the URL returns. My guess is that it’ll be identical but it’s worth a look. Sorry I can’t be more help here Erick > On Mar 21, 2019, at 1:11 AM, Hubert-Price, Neil <neil.hubert-pr...@sap.com> > wrote: > > Hello Erick, > > This is the first time I've had reason to use the mailing list, so I wasn't > aware of the behaviour around attachments. See below, links to the images > that I originally sent as attachments, both are screenshots from within > Eclipse MAT looking at a SOLR heap dump. > > LargeQueryStructure.png - > https://drive.google.com/open?id=1SkRYav2iV6Z1znmzr4KKJzMcXzNF0_Wg > LargeNumberClauses.png - > https://drive.google.com/open?id=1CaySU2HzyvHsdbIW_n0190ofjPS3hAeN > > The LargeQueryStructure image shows as single thread with retained set of > 4.8GB, with the biggest items being a BooleanWeight object of just over 1.8GB > and a BooleanQuery object of just under 1.8GB > > The LargeNumberClauses image shows a drilldown into the BooleanQuery object, > where a subquery is taking around 0.9GB and contains a BooleanClause[524288] > array of clauses (not shown: each of these 524288 is actually a subquery with > multiple clauses). The array is taking 0.6GB, and there is a second instance > of the same array in another subquery (also not shown). > > > Since the last email we have had some success with a reconfiguration of the > fieldType that I referenced in my original email below. Where it was > originally: > > <fieldType name="lowercase_tokens" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > <filter class="solr.StandardFilterFactory" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.ShingleFilterFactory" maxShingleSize="30" > outputUnigrams="true"/> > </analyzer> > </fieldType> > > We have now reconfigured to: > > <fieldType name="lowercase_tokens" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > <filter class="solr.StandardFilterFactory" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.ShingleFilterFactory" maxShingleSize="30" > outputUnigrams="true"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory" /> > <filter class="solr.StandardFilterFactory" /> > <filter class="solr.LowerCaseFilterFactory" /> > <filter class="solr.LimitTokenCountFilterFactory" > maxTokenCount="8" consumeAllTokens="false" /> > <filter class="solr.ShingleFilterFactory" maxShingleSize="8" > outputUnigrams="true"/> > </analyzer> > </fieldType> > > After the reconfiguration, the huge memory effect of the queries in Solr 7.1 > is gone. We could kill test instances of Solr with a single query in the > original configuration. After reconfiguration we can run multiple similar > queries in parallel, and the Solr process responds in 50-150ms with only > approx. 100MB added to the heap. > > This may well be sufficient for our purposes, as I don't think end users will > notice the difference in practice & queries that were previously failing now > return normally. > > However I am still curious as to how this performs so differently in Solr 4.6 > - the performance in 4.6 without reconfiguration is very similar to Solr 7.1 > after the reconfiguration. It is almost as if something within Solr 4.6 is > causing it to behave as though the number of tokens is limited (although I > can see in the admin pages for Solr 4.6 that the query and index analyser > setup both have original config with maxShingleSize=30 setting). Do you have > any thoughts about this? > > > Many Thanks, > Neil > > On 20/03/2019, 16:13, "Erick Erickson" <erickerick...@gmail.com> wrote: > > The Apache mail server aggressively strips attachments, so yours didn’t > come through. People often provide links to images stored somewhere else.... > > As to why this is behaving this way, I’m pretty clueless. A _complete_ > shot in the dark is the query parsing changed its default for split on > whitespace from true to false, perhaps try specifying "&sow=true". Here’s > some background: > https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/ > > I have no actual, you know, _knowledge_ that it’s related but it’d be > super-easy to try and might give a clue. > > Best, > Erick > >> On Mar 20, 2019, at 2:00 AM, Hubert-Price, Neil <neil.hubert-pr...@sap.com> >> wrote: >> >> Hello All, >> >> We have a recently upgraded system that went from Solr 4.6 to Solr 7.1 (used >> as part of an ecommerce application). In the upgraded version we are seeing >> frequent issues with very high Solr memory usage for certain types of query, >> but the older 4.6 version does not produce the same response. >> >> Having taken a heap dump and investigated, we can see instances of >> individual Solr threads where the retained set is 4GB to 5GB in size. >> Drilling into this we can see a particular subquery with over 500,000 >> clauses. Screenshots below are from Eclipse MAT viewing a heap dump from >> the SOLR process. Observations of the 4.6 version we can see memory >> increments of 100-200 MB for the same query, rather than 4-5 GB. >> >> In both systems the index has around 2 million documents, with average size >> around 8KB. >> >> >> >> >> >> >> >> The subquery with a very large set of clauses relates to a particular field >> setup to use ShingleFilter (with maxShingleSize=30, and >> outputUnigrams=true). Schema.xml definitions for this field are: >> >> <fieldType name="lowercase_tokens" class="solr.TextField" >> positionIncrementGap="100"> >> <analyzer type="index"> >> <tokenizer >> class="solr.WhitespaceTokenizerFactory" /> >> <filter >> class="solr.StandardFilterFactory" /> >> <filter >> class="solr.LowerCaseFilterFactory" /> >> <filter >> class="solr.ShingleFilterFactory" maxShingleSize="30" outputUnigrams="true"/> >> </analyzer> >> </fieldType> >> >> <field name="productdetails_tokens_en" >> type="lowercase_tokens" indexed="true" stored="false" multiValued="true"/> >> >> <copyField source="supercategoryname_text_en" >> dest="productdetails_tokens_en" /> >> <copyField source="supercategorydescription_text_en" >> dest="productdetails_tokens_en" /> >> <copyField source="productNameAndDescription_text_en" >> dest="productdetails_tokens_en" /> >> <copyField source="code_string" >> dest="productdetails_tokens_en" /> >> >> The issue happens when the user search contains large numbers of tokens. In >> the example screenshots above the user search text had 20 tokens. The Solr >> query for that thread was as below (formatting/indentation added by me, the >> original is one long string). This specific query contains tabs, however >> the same behaviour happens when spaces are used as well: >> ( >> +( >> fulltext_en:(9611444500 9611444520 9611444530 >> 9611444540 9611414550 9612194002 9612194002 >> 9612194002 9612194003 9612194007 9611416470 >> 9611416470 9611416470 9611416480 9611416480 >> 9613484402 9613484402 9613484402 9613484402 >> 9613484402) >> OR productdetails_tokens_en:(9611444500 9611444520 9611444530 >> 9611444540 9611414550 9612194002 9612194002 9612194002 >> 9612194003 9612194007 9611416470 9611416470 >> 9611416470 9611416480 9611416480 9613484402 >> 9613484402 9613484402 9613484402 9613484402) >> OR codePartial:(9611444500 9611444520 9611444530 9611444540 >> 9611414550 9612194002 9612194002 9612194002 >> 9612194003 9612194007 9611416470 9611416470 >> 9611416470 9611416480 9611416480 9613484402 >> 9613484402 9613484402 9613484402 9613484402) >> ) >> ) >> AND >> ( >> ( >> ( >> (productChannelVisibility_string_mv:ALL OR >> productChannelVisibility_string_mv:EBUSINESS OR >> productChannelVisibility_string_mv:INTERNET OR >> productChannelVisibility_string_mv:INTRANET) >> AND >> !productChannelVisibility_string_mv:NOTVISIBLE >> ) >> AND >> ( >> +( >> fulltext_en:(9611444500 9611444520 9611444530 >> 9611444540 9611414550 9612194002 9612194002 >> 9612194002 9612194003 9612194007 9611416470 >> 9611416470 9611416470 9611416480 9611416480 >> 9613484402 9613484402 9613484402 9613484402 >> 9613484402) >> OR productdetails_tokens_en:(9611444500 9611444520 9611444530 >> 9611444540 9611414550 9612194002 9612194002 9612194002 >> 9612194003 9612194007 9611416470 9611416470 >> 9611416470 9611416480 9611416480 9613484402 >> 9613484402 9613484402 9613484402 9613484402) >> OR codePartial:(9611444500 9611444520 9611444530 9611444540 >> 9611414550 9612194002 9612194002 9612194002 >> 9612194003 9612194007 9611416470 9611416470 >> 9611416470 9611416480 9611416480 9613484402 >> 9613484402 9613484402 9613484402 9613484402) >> ) >> ) >> ) >> ) >> >> In the heap dump we can see the subqueries relating to >> fulltext_en/codePartial fields both have just 20 clauses. However the two >> subqueries relating to productdetails_tokens_en both have 524288 clauses & >> each of those clauses is a subquery with up to 20 clauses (each of which >> seems to be a different shingled combination of the original tokens). For >> example, selecting an arbitrary single entry from the 524288 clauses, we can >> see a subquery with the following clauses: >> >> Occur.MUST, productdetails_tokens_en: 9611444500 >> Occur.MUST, productdetails_tokens_en: 9611416470 9611416480 >> Occur.MUST, productdetails_tokens_en: 9611444520 >> Occur.MUST, productdetails_tokens_en: 9611444540 >> Occur.MUST, productdetails_tokens_en: 9612194007 >> Occur.MUST, productdetails_tokens_en: 9611444530 >> Occur.MUST, productdetails_tokens_en: 9612194002 9612194002 >> Occur.MUST, productdetails_tokens_en: 9612194002 >> Occur.MUST, productdetails_tokens_en: 9611416480 >> Occur.MUST, productdetails_tokens_en: 9611416470 >> Occur.MUST, productdetails_tokens_en: 9613484402 >> Occur.MUST, productdetails_tokens_en: 9612194003 >> Occur.MUST, productdetails_tokens_en: 9611414550 >> Occur.MUST, productdetails_tokens_en: 9613484402 9613484402 9613484402 >> >> >> >> So the question has two parts: >> - Is this the observed behaviour expected in Solr 7.1 given the >> setup/query described above? (It seems to me that the answer is probably >> yes, because this is the purpose of the ShingleFilter) >> - Why is the same behaviour not in evidence in Solr 4.6? Are there >> major differences with the way that the query is constructed in the earlier >> version. If so, can we change Solr 7.1 config to behave more like Solr 4.6? >> >> Many Thanks, >> Neil >> >> >> >> >> Neil Hubert-Price >> Senior Consultant, SAP CX Success and Services, Northern Europe >> >> neil.hubert-pr...@sap.com >> M: +44 7788 368767 >> >> >> SAP (UK) Limited, Registered in England No. 2152073. Registered Office: >> Clockhouse Place, Bedfont Road, Feltham, Middlesex, TW14 8HD > > >