[ 
https://issues.apache.org/jira/browse/SOLR-9764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965394#comment-15965394
 ] 

David Smiley commented on SOLR-9764:
------------------------------------

I'm looking at this closer again.  Comments/Questions:
* The change in {{DocSetBase.getBits}} from '64' vs 'size()' seems odd to me.   
Wouldn't, say {{Math.max(64,size())}} (or perhaps use a larger number like 
1024) make more sense?  size() is almost certainly too small; no?
* Perhaps {{DocSetCollector.getDocSet}} should return {{DocSet.EMPTY}}?  Or 
perhaps this should be the job of {{DocSetUtil.getDocSet}} since it already 
optimizes to a shared reference for the live docs.  That is quite minor though; 
it's cheap & light-weight.
* {{SolrIndexSearcher.getDocSetBits}} will call {{getDocSet}} which will ensure 
the query gets put into the filter cache.  Yet it also upgrades it to a 
{{BitDocSet}} if it isn't and will put it in again, overriding the existing 
SortedIntSet (like that's what it is).  Why?  What if it's a match-no-docs?  If 
this is deliberate it deserves a comment; if not then probably a minor perf bug.

The main thing I'm investigating, however, is how might the filterCache's 
{{maxRamMB}} setting not over-count the shared liveDocs: either count it zero 
times, one times (both fine possibilities), but definitely not more than once.  
Without resorting to the cache knowing about live docs (ugh; pretty ugly), I 
think this requires a MatchAll instance like Michael had since created.  The 
match-all (live docs) can easily be a common cache entry for range faceting on 
time, especially with time based shards.

> Design a memory efficient DocSet if a query returns all docs
> ------------------------------------------------------------
>
>                 Key: SOLR-9764
>                 URL: https://issues.apache.org/jira/browse/SOLR-9764
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Michael Sun
>            Assignee: Yonik Seeley
>             Fix For: 6.5, master (7.0)
>
>         Attachments: SOLR_9764_no_cloneMe.patch, SOLR-9764.patch, 
> SOLR-9764.patch, SOLR-9764.patch, SOLR-9764.patch, SOLR-9764.patch, 
> SOLR-9764.patch, SOLR-9764.patch, SOLR-9764.patch
>
>
> In some use cases, particularly use cases with time series data, using 
> collection alias and partitioning data into multiple small collections using 
> timestamp, a filter query can match all documents in a collection. Currently 
> BitDocSet is used which contains a large array of long integers with every 
> bits set to 1. After querying, the resulted DocSet saved in filter cache is 
> large and becomes one of the main memory consumers in these use cases.
> For example. suppose a Solr setup has 14 collections for data in last 14 
> days, each collection with one day of data. A filter query for last one week 
> data would result in at least six DocSet in filter cache which matches all 
> documents in six collections respectively.   
> This is to design a new DocSet that is memory efficient for such a use case.  
> The new DocSet removes the large array, reduces memory usage and GC pressure 
> without losing advantage of large filter cache.
> In particular, for use cases when using time series data, collection alias 
> and partition data into multiple small collections using timestamp, the gain 
> can be large.
> For further optimization, it may be helpful to design a DocSet with run 
> length encoding. Thanks [~mmokhtar] for suggestion. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to