Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Erick Erickson Thu, 21 Mar 2019 08:04:18 -0700

Neil:

Yeah, the attachment-stripping is catches everyone first time, we’re so used to 
just adding anything we want to an e-mail…


I don’t know enough about the query parsing to answer off the top of my head. I 
do know one thing that’s changed is “Split on Whitespace” has changed from true 
to false by default, so it’d be interesting to add &sow=false to the query.

Beyond that, take a look at what &debug=query added to the URL returns. My 
guess is that it’ll be identical but it’s worth a look.

Sorry I can’t be more help here
Erick

> On Mar 21, 2019, at 1:11 AM, Hubert-Price, Neil <neil.hubert-pr...@sap.com> 
> wrote:
> 
> Hello Erick,
> 
> This is the first time I've had reason to use the mailing list, so I wasn't 
> aware of the behaviour around attachments.  See below, links to the images 
> that I originally sent as attachments, both are screenshots from within 
> Eclipse MAT looking at a SOLR heap dump.
> 
> LargeQueryStructure.png - 
> https://drive.google.com/open?id=1SkRYav2iV6Z1znmzr4KKJzMcXzNF0_Wg 
> LargeNumberClauses.png - 
> https://drive.google.com/open?id=1CaySU2HzyvHsdbIW_n0190ofjPS3hAeN
> 
> The LargeQueryStructure image shows as single thread with retained set of 
> 4.8GB, with the biggest items being a BooleanWeight object of just over 1.8GB 
> and a BooleanQuery object of just under 1.8GB
> 
> The LargeNumberClauses image shows a drilldown into the BooleanQuery object, 
> where a subquery is taking around 0.9GB and contains a BooleanClause[524288] 
> array of clauses (not shown: each of these 524288 is actually a subquery with 
> multiple clauses).  The array is taking 0.6GB, and there is a second instance 
> of the same array in another subquery (also not shown).
> 
> 
> Since the last email we have had some success with a reconfiguration of the 
> fieldType that I referenced in my original email below.  Where it was 
> originally:
> 
> <fieldType name="lowercase_tokens" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>               <tokenizer class="solr.WhitespaceTokenizerFactory" />
>               <filter class="solr.StandardFilterFactory" />
>               <filter class="solr.LowerCaseFilterFactory" />
>               <filter class="solr.ShingleFilterFactory" maxShingleSize="30" 
> outputUnigrams="true"/>
>       </analyzer>
> </fieldType>
> 
> We have now reconfigured to:
> 
> <fieldType name="lowercase_tokens" class="solr.TextField" 
> positionIncrementGap="100">
>       <analyzer type="index">
>               <tokenizer class="solr.WhitespaceTokenizerFactory" />
>               <filter class="solr.StandardFilterFactory" />
>               <filter class="solr.LowerCaseFilterFactory" />
>               <filter class="solr.ShingleFilterFactory" maxShingleSize="30" 
> outputUnigrams="true"/>
>       </analyzer>
>       <analyzer type="query">
>               <tokenizer class="solr.WhitespaceTokenizerFactory" />
>               <filter class="solr.StandardFilterFactory" />
>               <filter class="solr.LowerCaseFilterFactory" />
>               <filter class="solr.LimitTokenCountFilterFactory" 
> maxTokenCount="8" consumeAllTokens="false" />
>               <filter class="solr.ShingleFilterFactory" maxShingleSize="8" 
> outputUnigrams="true"/>
>       </analyzer>
> </fieldType>
> 
> After the reconfiguration, the huge memory effect of the queries in Solr 7.1 
> is gone.  We could kill test instances of Solr with a single query in the 
> original configuration. After reconfiguration we can run multiple similar 
> queries in parallel, and the Solr process responds in 50-150ms with only 
> approx. 100MB added to the heap.
> 
> This may well be sufficient for our purposes, as I don't think end users will 
> notice the difference in practice & queries that were previously failing now 
> return normally.
> 
> However I am still curious as to how this performs so differently in Solr 4.6 
> - the performance in 4.6 without reconfiguration is very similar to Solr 7.1 
> after the reconfiguration.  It is almost as if something within Solr 4.6 is 
> causing it to behave as though the number of tokens is limited (although I 
> can see in the admin pages for Solr 4.6 that the query and index analyser 
> setup both have original config with maxShingleSize=30 setting).  Do you have 
> any thoughts about this?
> 
> 
> Many Thanks,
> Neil
> 
> On 20/03/2019, 16:13, "Erick Erickson" <erickerick...@gmail.com> wrote:
> 
>    The Apache mail server aggressively strips attachments, so yours didn’t 
> come through. People often provide links to images stored somewhere else....
> 
>    As to why this is behaving this way, I’m pretty clueless. A _complete_ 
> shot in the dark is the query parsing changed its default for split on 
> whitespace from true to false, perhaps try specifying "&sow=true". Here’s 
> some background: 
> https://lucidworks.com/2017/04/18/multi-word-synonyms-solr-adds-query-time-support/
> 
>    I have no actual, you know, _knowledge_ that it’s related but it’d be 
> super-easy to try and might give a clue.
> 
>    Best,
>    Erick
> 
>> On Mar 20, 2019, at 2:00 AM, Hubert-Price, Neil <neil.hubert-pr...@sap.com> 
>> wrote:
>> 
>> Hello All,
>> 
>> We have a recently upgraded system that went from Solr 4.6 to Solr 7.1 (used 
>> as part of an ecommerce application).  In the upgraded version we are seeing 
>> frequent issues with very high Solr memory usage for certain types of query, 
>> but the older 4.6 version does not produce the same response.
>> 
>> Having taken a heap dump and investigated, we can see instances of 
>> individual Solr threads where the retained set is 4GB to 5GB in size.  
>> Drilling into this we can see a particular subquery with over 500,000 
>> clauses.  Screenshots below are from Eclipse MAT viewing a heap dump from 
>> the SOLR process. Observations of the 4.6 version we can see memory 
>> increments of 100-200 MB for the same query, rather than 4-5 GB.
>> 
>> In both systems the index has around 2 million documents, with average size 
>> around 8KB.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> The subquery with a very large set of clauses relates to a particular field 
>> setup to use ShingleFilter (with maxShingleSize=30, and 
>> outputUnigrams=true). Schema.xml definitions for this field are:
>> 
>> <fieldType name="lowercase_tokens" class="solr.TextField" 
>> positionIncrementGap="100">
>>                                <analyzer type="index">
>>                                                <tokenizer 
>> class="solr.WhitespaceTokenizerFactory" />
>>                                                <filter 
>> class="solr.StandardFilterFactory" />
>>                                                <filter 
>> class="solr.LowerCaseFilterFactory" />
>>                                                <filter 
>> class="solr.ShingleFilterFactory" maxShingleSize="30" outputUnigrams="true"/>
>>                                </analyzer>
>>                </fieldType>
>> 
>>                <field name="productdetails_tokens_en" 
>> type="lowercase_tokens" indexed="true" stored="false" multiValued="true"/>
>> 
>>                <copyField source="supercategoryname_text_en" 
>> dest="productdetails_tokens_en" />
>>                <copyField source="supercategorydescription_text_en" 
>> dest="productdetails_tokens_en" />
>>                <copyField source="productNameAndDescription_text_en" 
>> dest="productdetails_tokens_en" />
>>                <copyField source="code_string" 
>> dest="productdetails_tokens_en" />
>> 
>> The issue happens when the user search contains large numbers of tokens.  In 
>> the example screenshots above the user search text had 20 tokens. The Solr 
>> query for that thread was as below (formatting/indentation added by me, the 
>> original is one long string).  This specific query contains tabs, however 
>> the same behaviour happens when spaces are used as well:
>> (
>> +(
>>  fulltext_en:(9611444500            9611444520       9611444530       
>> 9611444540       9611414550 9612194002                9612194002       
>> 9612194002       9612194003       9612194007 9611416470             
>> 9611416470       9611416470                9611416480       9611416480 
>> 9613484402             9613484402       9613484402       9613484402       
>> 9613484402)
>>  OR productdetails_tokens_en:(9611444500       9611444520       9611444530   
>>     9611444540       9611414550 9612194002       9612194002       9612194002 
>>       9612194003       9612194007 9611416470             9611416470          
>>       9611416470       9611416480       9611416480 9613484402             
>> 9613484402       9613484402       9613484402                9613484402)
>>  OR codePartial:(9611444500     9611444520       9611444530       9611444540 
>>       9611414550 9612194002                9612194002       9612194002       
>> 9612194003       9612194007 9611416470             9611416470       
>> 9611416470                9611416480       9611416480 9613484402             
>> 9613484402       9613484402       9613484402       9613484402)
>> )
>> )
>> AND
>> (
>> (
>>  (
>>   (productChannelVisibility_string_mv:ALL OR 
>> productChannelVisibility_string_mv:EBUSINESS OR 
>> productChannelVisibility_string_mv:INTERNET OR 
>> productChannelVisibility_string_mv:INTRANET)
>>   AND
>>   !productChannelVisibility_string_mv:NOTVISIBLE
>>  )
>>  AND
>>  (
>>   +(
>>    fulltext_en:(9611444500          9611444520       9611444530       
>> 9611444540       9611414550 9612194002                9612194002       
>> 9612194002       9612194003       9612194007 9611416470             
>> 9611416470       9611416470                9611416480       9611416480 
>> 9613484402             9613484402       9613484402       9613484402       
>> 9613484402)
>>    OR productdetails_tokens_en:(9611444500     9611444520       9611444530   
>>     9611444540       9611414550 9612194002       9612194002       9612194002 
>>       9612194003       9612194007 9611416470             9611416470          
>>       9611416470       9611416480       9611416480 9613484402             
>> 9613484402       9613484402       9613484402                9613484402)
>>    OR codePartial:(9611444500  9611444520       9611444530       9611444540  
>>      9611414550 9612194002                9612194002       9612194002       
>> 9612194003       9612194007 9611416470             9611416470       
>> 9611416470                9611416480       9611416480 9613484402             
>> 9613484402       9613484402       9613484402       9613484402)
>>   )
>>  )
>> )
>> )
>> 
>> In the heap dump we can see the subqueries relating to 
>> fulltext_en/codePartial fields both have just 20 clauses.  However the two 
>> subqueries relating to productdetails_tokens_en both have 524288 clauses & 
>> each of those clauses is a subquery with up to 20 clauses (each of which 
>> seems to be a different shingled combination of the original tokens). For 
>> example, selecting an arbitrary single entry from the 524288 clauses, we can 
>> see a subquery with the following clauses:
>> 
>> Occur.MUST, productdetails_tokens_en: 9611444500
>> Occur.MUST, productdetails_tokens_en: 9611416470 9611416480
>> Occur.MUST, productdetails_tokens_en: 9611444520
>> Occur.MUST, productdetails_tokens_en: 9611444540
>> Occur.MUST, productdetails_tokens_en: 9612194007
>> Occur.MUST, productdetails_tokens_en: 9611444530
>> Occur.MUST, productdetails_tokens_en: 9612194002 9612194002
>> Occur.MUST, productdetails_tokens_en: 9612194002
>> Occur.MUST, productdetails_tokens_en: 9611416480
>> Occur.MUST, productdetails_tokens_en: 9611416470
>> Occur.MUST, productdetails_tokens_en: 9613484402
>> Occur.MUST, productdetails_tokens_en: 9612194003
>> Occur.MUST, productdetails_tokens_en: 9611414550
>> Occur.MUST, productdetails_tokens_en: 9613484402 9613484402 9613484402       
>>      
>> 
>> 
>> So the question has two parts:
>> -          Is this the observed behaviour expected in Solr 7.1 given the 
>> setup/query described above? (It seems to me that the answer is probably 
>> yes, because this is the purpose of the ShingleFilter)
>> -          Why is the same behaviour not in evidence in Solr 4.6?  Are there 
>> major differences with the way that the query is constructed in the earlier 
>> version.  If so, can we change Solr 7.1 config to behave more like Solr 4.6?
>> 
>> Many Thanks,
>> Neil
>> 
>> 
>> 
>> 
>> Neil Hubert-Price
>> Senior Consultant, SAP CX Success and Services, Northern Europe
>> 
>> neil.hubert-pr...@sap.com
>> M: +44 7788 368767
>> 
>> 
>> SAP (UK) Limited, Registered in England No. 2152073. Registered Office: 
>> Clockhouse Place, Bedfont Road, Feltham, Middlesex, TW14 8HD
> 
> 
>

Re: Use of ShingleFilter causing very large BooleanQuery structures in Solr 7.1

Reply via email to