Darn, spoke too soon. Field collapsing throws off my facet counts where facet 
fields differ within groups.

Back to the drawing board. FWIW, I tried hyperloglog for JSON facet aggregate 
counts and it has the same issue as unique() when used as the facet sort 
parameter - while reasonably fast it uses masses of memory.

Cheers,
~Mike

------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 18:53, Bryant, Michael 
<michael.bry...@kcl.ac.uk<mailto:michael.bry...@kcl.ac.uk>> wrote:

Hi Tom,

Well the collapsing query parser is… a much better solution to my problems!  
Thanks for cluing me in to this, I love it when you can delete a load of hacks 
for something both simpler and faster.

Best,
~Mike


------
Mike Bryant

Research Associate
Department of Digital Humanities
King’s College London

On 10 Feb 2017, at 14:37, Tom Evans 
<tevans...@googlemail.com<mailto:tevans...@googlemail.com><mailto:tevans...@googlemail.com>>
 wrote:

Hi Mike

Looks like you are trying to get a list of the distinct item ids in a
result set, ordered by the most frequent item ids?

Can you use collapsing qparser for this instead? Should be much quicker.

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcwiki.apache.org%2Fconfluence%2Fdisplay%2Fsolr%2FCollapse%2Band%2BExpand%2BResults&data=01%7C01%7Cmichael.bryant%40kcl.ac.uk%7C3ff47afc049f4d3ce3ac08d451c25d84%7C8370cf1416f34c16b83c724071654356%7C0&sdata=sCjlX%2BLSh%2FdLmpMQCtKVH2wz8ESB1bZpDEkZWKxET2U%3D&reserved=0

Every document with the same item_id would need to be on the same
shard for this to work, and I'm not sure you can actually get the
count of collapsed documents or not, if that is necessary for you.


Another option might be to use hyperloglog function - hll() - instead
of unique(), which should give slightly better performance.

Cheers

Tom

On Thu, Feb 9, 2017 at 11:58 AM, Bryant, Michael
<michael.bry...@kcl.ac.uk> wrote:
Hi all,

I'm converting my legacy facets to JSON facets and am seeing much better 
performance, especially with high cardinality facet fields. However, the one 
issue I can't seem to resolve is excessive memory usage (and OOM errors) when 
trying to simulate the effect of "group.facet" to sort facets according to a 
grouping field.

My situation, slightly simplified is:

Solr 4.6.1

*   Doc set: ~200,000 docs
*   Grouping by item_id, an indexed, stored, single value string field with 
~50,000 unique values, ~4 docs per item
*   Faceting by person_id, an indexed, stored, multi-value string field with 
~50,000 values (w/ a very skewed distribution)
*   No docValues fields

Each document here is a description of an item, and there are several 
descriptions per item in multiple languages.

With legacy facets I use group.field=item_id and group.facet=true, which gives 
me facet counts with the number of items rather than descriptions, and 
correctly sorted by descending item count.

With JSON facets I'm doing the equivalent like so:

&json.facet={
  "people": {
      "type": "terms",
      "field": "person_id",
      "facet": {
          "grouped_count": "unique(item_id)"
      },
      "sort": "grouped_count desc"
  }
}

This works, and is somewhat faster than legacy faceting, but it also produces a 
massive spike in memory usage when (and only when) the sort parameter is set to 
the aggregate field. A server that runs happily with a 512MB heap OOMs unless I 
give it a 4GB heap. With sort set to (the default) "count desc" there is no 
memory usage spike.

I would be curious if anyone has experienced this kind of memory usage when 
sorting JSON facets by stats and if there’s anything I can do to mitigate it. 
I’ve tried reindexing with docValues enabled on the relevant fields and it 
seems to make no difference in this respect.

Many thanks,
~Mike


Reply via email to