Re: Facet Performance
queryResultCache doesn’t really help with faceting, even if it’s hit for the main query. That cache only stores a subset of the hits, and to facet properly you need the entire result set…. > On Jun 17, 2020, at 12:47 PM, James Bodkin > wrote: > > We've noticed that the filterCache uses a significant amount of memory, as > we've assigned 8GB Heap per instance. > In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space > alone, further memory is required to ensure the index is always memory mapped > for performance reasons. > > Ideally I would like to be able to reduce the amount of memory assigned to > the heap by using docValues instead of indexed but it doesn't seem possible. > The QTime (after warming) for facet.method=enum is around 150-250ms whereas > the QTime for facet.method=fc is around 1000-1200ms. > As we require the results in real-time for customers searching on our > website, the later QTime of 1000-1200ms is too slow for us to be able to use. > > Our facet queries change as the customer selects different search criteria, > and hence the possible number of potential queries makes it very difficult > for the query result cache. > We already have a custom implementation in which we check our redis cache for > queries before they are sent to our aggregators which runs at 30% hit rate. > > Kind Regards, > > James Bodkin > > On 17/06/2020, 16:21, "Michael Gibney" wrote: > >To expand a bit on what Erick said regarding performance: my sense is >that the RefGuide assertion that "docValues=true" makes faceting >"faster" could use some qualification/clarification. My take, fwiw: > >First, to reiterate/paraphrase what Erick said: the "faster" assertion >is not comparing to "facet.method=enum". For low-cardinality fields, >if you have the heap space, and are very intentional about configuring >your filterCache (and monitoring it as access patterns might change), >"facet.method=enum" will likely be as fast as you can get (at least >for "legacy" facets or whatever -- not sure about "enum" method in >JSON facets). > >Even where "docValues=true" arguably does make faceting "faster", the >main benefit is that the "uninverted" data structures are serialized >on disk, so you're avoiding the need to uninvert each facet field >on-heap for every new indexSearcher, which is generally high-latency >-- user perception of this latency can be mitigated using warming >queries, but it can still be problematic, esp. for frequent index >updates. On-heap uninversion also inherently consumes a lot of heap >space, which has general implications wrt GC, etc ... so in that >respect even if faceting per se might not be "faster" with >"docValues=true", your overall system may in many cases perform >better. > >(and Anthony, I'm pretty sure that tag/ex on facets should be >orthogonal to the "facet.method=enum"/filterCache discussion, as >tag/ex only affects the DocSet domain over which facets are calculated >... I think that step is pretty cleanly separated from the actual >calculation of the facets. I'm not 100% sure on that, so proceed with >caution, but it could definitely be worth evaluating for your use >case!) > >Michael > >On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson > wrote: >> >> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ >> use a docValues=false >> field for faceting/grouping/sorting/function queries. The primary point of >> docValues=true is twofold: >> >> 1> reduce Java heap requirements by using the OS memory to hold it >> >> 2> uninverting can be expensive CPU wise too, although not with just a few >>unique values (for each term, read the list of docs that have it and flip >> a bit). >> >> It doesn’t really make sense to set it on an index=false field, since >> uninverting only happens on >> index=true docValues=false. OTOH, I don’t think it would do any harm either. >> That said, I frankly >> don’t know how that interacts with facet.method=enum. >> >> As far as speed… yeah, you’re in the edge cases. All things being equal, >> stuffing these into the >> filterCache is the fastest way to facet if you have the memory. I’ve seen >> very few installations >> where people have that luxury though. Each entry in the filterCache can >> occupy maxDoc/8 + some overhead >> bytes. If maxDoc is very large, this’ll chew up an enormous amount of >> memory. I’m cheating >> a bit here since the size might be smaller if only a few docs have any >> particular entry then the >> size is smaller. But that’s the worst-case you have to allow for ‘cause you >> could theoretically hit >> the perfect storm where, due to some particular sequence of queries, your >> entire filter >> cache fills up with entries that size. >> >> You’ll have some overhead to keep the cache at that size, but it sounds like >> it’s worth it. >
Re: Facet Performance
We've noticed that the filterCache uses a significant amount of memory, as we've assigned 8GB Heap per instance. In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space alone, further memory is required to ensure the index is always memory mapped for performance reasons. Ideally I would like to be able to reduce the amount of memory assigned to the heap by using docValues instead of indexed but it doesn't seem possible. The QTime (after warming) for facet.method=enum is around 150-250ms whereas the QTime for facet.method=fc is around 1000-1200ms. As we require the results in real-time for customers searching on our website, the later QTime of 1000-1200ms is too slow for us to be able to use. Our facet queries change as the customer selects different search criteria, and hence the possible number of potential queries makes it very difficult for the query result cache. We already have a custom implementation in which we check our redis cache for queries before they are sent to our aggregators which runs at 30% hit rate. Kind Regards, James Bodkin On 17/06/2020, 16:21, "Michael Gibney" wrote: To expand a bit on what Erick said regarding performance: my sense is that the RefGuide assertion that "docValues=true" makes faceting "faster" could use some qualification/clarification. My take, fwiw: First, to reiterate/paraphrase what Erick said: the "faster" assertion is not comparing to "facet.method=enum". For low-cardinality fields, if you have the heap space, and are very intentional about configuring your filterCache (and monitoring it as access patterns might change), "facet.method=enum" will likely be as fast as you can get (at least for "legacy" facets or whatever -- not sure about "enum" method in JSON facets). Even where "docValues=true" arguably does make faceting "faster", the main benefit is that the "uninverted" data structures are serialized on disk, so you're avoiding the need to uninvert each facet field on-heap for every new indexSearcher, which is generally high-latency -- user perception of this latency can be mitigated using warming queries, but it can still be problematic, esp. for frequent index updates. On-heap uninversion also inherently consumes a lot of heap space, which has general implications wrt GC, etc ... so in that respect even if faceting per se might not be "faster" with "docValues=true", your overall system may in many cases perform better. (and Anthony, I'm pretty sure that tag/ex on facets should be orthogonal to the "facet.method=enum"/filterCache discussion, as tag/ex only affects the DocSet domain over which facets are calculated ... I think that step is pretty cleanly separated from the actual calculation of the facets. I'm not 100% sure on that, so proceed with caution, but it could definitely be worth evaluating for your use case!) Michael On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson wrote: > > Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ use a docValues=false > field for faceting/grouping/sorting/function queries. The primary point of docValues=true is twofold: > > 1> reduce Java heap requirements by using the OS memory to hold it > > 2> uninverting can be expensive CPU wise too, although not with just a few > unique values (for each term, read the list of docs that have it and flip a bit). > > It doesn’t really make sense to set it on an index=false field, since uninverting only happens on > index=true docValues=false. OTOH, I don’t think it would do any harm either. That said, I frankly > don’t know how that interacts with facet.method=enum. > > As far as speed… yeah, you’re in the edge cases. All things being equal, stuffing these into the > filterCache is the fastest way to facet if you have the memory. I’ve seen very few installations > where people have that luxury though. Each entry in the filterCache can occupy maxDoc/8 + some overhead > bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. I’m cheating > a bit here since the size might be smaller if only a few docs have any particular entry then the > size is smaller. But that’s the worst-case you have to allow for ‘cause you could theoretically hit > the perfect storm where, due to some particular sequence of queries, your entire filter > cache fills up with entries that size. > > You’ll have some overhead to keep the cache at that size, but it sounds like it’s worth it. > > Best, > Erick > > > > > On Jun 17, 2020, at 10:05 AM, James Bodkin wrote: > > > > The large majority of the relevant fields have fewer than 20 unique values. We have two fields over that with 150 unique values and 5300 unique values retrospectively. > > At the moment, our
Re: Facet Performance
To expand a bit on what Erick said regarding performance: my sense is that the RefGuide assertion that "docValues=true" makes faceting "faster" could use some qualification/clarification. My take, fwiw: First, to reiterate/paraphrase what Erick said: the "faster" assertion is not comparing to "facet.method=enum". For low-cardinality fields, if you have the heap space, and are very intentional about configuring your filterCache (and monitoring it as access patterns might change), "facet.method=enum" will likely be as fast as you can get (at least for "legacy" facets or whatever -- not sure about "enum" method in JSON facets). Even where "docValues=true" arguably does make faceting "faster", the main benefit is that the "uninverted" data structures are serialized on disk, so you're avoiding the need to uninvert each facet field on-heap for every new indexSearcher, which is generally high-latency -- user perception of this latency can be mitigated using warming queries, but it can still be problematic, esp. for frequent index updates. On-heap uninversion also inherently consumes a lot of heap space, which has general implications wrt GC, etc ... so in that respect even if faceting per se might not be "faster" with "docValues=true", your overall system may in many cases perform better. (and Anthony, I'm pretty sure that tag/ex on facets should be orthogonal to the "facet.method=enum"/filterCache discussion, as tag/ex only affects the DocSet domain over which facets are calculated ... I think that step is pretty cleanly separated from the actual calculation of the facets. I'm not 100% sure on that, so proceed with caution, but it could definitely be worth evaluating for your use case!) Michael On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson wrote: > > Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ > use a docValues=false > field for faceting/grouping/sorting/function queries. The primary point of > docValues=true is twofold: > > 1> reduce Java heap requirements by using the OS memory to hold it > > 2> uninverting can be expensive CPU wise too, although not with just a few > unique values (for each term, read the list of docs that have it and flip > a bit). > > It doesn’t really make sense to set it on an index=false field, since > uninverting only happens on > index=true docValues=false. OTOH, I don’t think it would do any harm either. > That said, I frankly > don’t know how that interacts with facet.method=enum. > > As far as speed… yeah, you’re in the edge cases. All things being equal, > stuffing these into the > filterCache is the fastest way to facet if you have the memory. I’ve seen > very few installations > where people have that luxury though. Each entry in the filterCache can > occupy maxDoc/8 + some overhead > bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. > I’m cheating > a bit here since the size might be smaller if only a few docs have any > particular entry then the > size is smaller. But that’s the worst-case you have to allow for ‘cause you > could theoretically hit > the perfect storm where, due to some particular sequence of queries, your > entire filter > cache fills up with entries that size. > > You’ll have some overhead to keep the cache at that size, but it sounds like > it’s worth it. > > Best, > Erick > > > > > On Jun 17, 2020, at 10:05 AM, James Bodkin > > wrote: > > > > The large majority of the relevant fields have fewer than 20 unique values. > > We have two fields over that with 150 unique values and 5300 unique values > > retrospectively. > > At the moment, our filterCache is configured with a maximum size of 8192. > > > > From the DocValues documentation > > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that > > this approach promises to make lookups for faceting, sorting and grouping > > much faster. > > Hence I thought that using DocValues would be better than using Indexed and > > in turn improve our response times and possibly lower memory requirements. > > It sounds like this isn't the case if you are able to allocate enough > > memory to the filterCache. > > > > I haven't yet tried changing the uninvertible setting, I was looking at the > > documentation for this field earlier today. > > Should we be setting uninvertible="false" if docValues="true" regardless of > > whether indexed is true or false? > > > > Kind Regards, > > > > James Bodkin > > > > On 17/06/2020, 14:02, "Michael Gibney" wrote: > > > >facet.method=enum works by executing a query (against indexed values) > >for each indexed value in a given field (which, for indexed=false, is > >"no values"). So that explains why facet.method=enum no longer works. > >I was going to suggest that you might not want to set indexed=false on > >the docValues facet fields anyway, since the indexed values are still > >used for facet refinement (assuming your index is distributed). > > >
Re: Facet Performance
Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ use a docValues=false field for faceting/grouping/sorting/function queries. The primary point of docValues=true is twofold: 1> reduce Java heap requirements by using the OS memory to hold it 2> uninverting can be expensive CPU wise too, although not with just a few unique values (for each term, read the list of docs that have it and flip a bit). It doesn’t really make sense to set it on an index=false field, since uninverting only happens on index=true docValues=false. OTOH, I don’t think it would do any harm either. That said, I frankly don’t know how that interacts with facet.method=enum. As far as speed… yeah, you’re in the edge cases. All things being equal, stuffing these into the filterCache is the fastest way to facet if you have the memory. I’ve seen very few installations where people have that luxury though. Each entry in the filterCache can occupy maxDoc/8 + some overhead bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. I’m cheating a bit here since the size might be smaller if only a few docs have any particular entry then the size is smaller. But that’s the worst-case you have to allow for ‘cause you could theoretically hit the perfect storm where, due to some particular sequence of queries, your entire filter cache fills up with entries that size. You’ll have some overhead to keep the cache at that size, but it sounds like it’s worth it. Best, Erick > On Jun 17, 2020, at 10:05 AM, James Bodkin > wrote: > > The large majority of the relevant fields have fewer than 20 unique values. > We have two fields over that with 150 unique values and 5300 unique values > retrospectively. > At the moment, our filterCache is configured with a maximum size of 8192. > > From the DocValues documentation > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that > this approach promises to make lookups for faceting, sorting and grouping > much faster. > Hence I thought that using DocValues would be better than using Indexed and > in turn improve our response times and possibly lower memory requirements. It > sounds like this isn't the case if you are able to allocate enough memory to > the filterCache. > > I haven't yet tried changing the uninvertible setting, I was looking at the > documentation for this field earlier today. > Should we be setting uninvertible="false" if docValues="true" regardless of > whether indexed is true or false? > > Kind Regards, > > James Bodkin > > On 17/06/2020, 14:02, "Michael Gibney" wrote: > >facet.method=enum works by executing a query (against indexed values) >for each indexed value in a given field (which, for indexed=false, is >"no values"). So that explains why facet.method=enum no longer works. >I was going to suggest that you might not want to set indexed=false on >the docValues facet fields anyway, since the indexed values are still >used for facet refinement (assuming your index is distributed). > >What's the number of unique values in the relevant fields? If it's low >enough, setting docValues=false and indexed=true and using >facet.method=enum (with a sufficiently large filterCache) is >definitely a viable option, and will almost certainly be faster than >docValues-based faceting. (As an aside, noting for future reference: >high-cardinality facets over high-cardinality DocSet domains might be >able to benefit from a term facet count cache: >https://issues.apache.org/jira/browse/SOLR-13807) > >I think you didn't specifically mention whether you acted on Erick's >suggestion of setting "uninvertible=false" (I think Erick accidentally >said "uninvertible=true") to fail fast. I'd also recommend doing that, >perhaps even above all else -- it shouldn't actually *do* anything, >but will help ensure that things are behaving as you expect them to! > >Michael > >On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > wrote: >> >> Thanks, I've implemented some queries that improve the first-hit execution >> for faceting. >> >> Since turning off indexed on those fields, we've noticed that >> facet.method=enum no longer returns the facets when used. >> Using facet.method=fc/fcs is significantly slower compared to >> facet.method=enum for us. Why do these two differences exist? >> >> On 16/06/2020, 17:52, "Erick Erickson" wrote: >> >>Ok, I see the disconnect... Necessary parts if the index are read from >> disk >>lazily. So your newSearcher or firstSearcher query needs to do whatever >>operation causes the relevant parts of the index to be read. In this case, >>probably just facet on all the fields you care about. I'd add sorting too >>if you sort on different fields. >> >>The *:* query without facets or sorting does virtually nothing due to some >>special handling... >> >>On Tue, Jun 16,
Re: Facet Performance
The large majority of the relevant fields have fewer than 20 unique values. We have two fields over that with 150 unique values and 5300 unique values retrospectively. At the moment, our filterCache is configured with a maximum size of 8192. From the DocValues documentation (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that this approach promises to make lookups for faceting, sorting and grouping much faster. Hence I thought that using DocValues would be better than using Indexed and in turn improve our response times and possibly lower memory requirements. It sounds like this isn't the case if you are able to allocate enough memory to the filterCache. I haven't yet tried changing the uninvertible setting, I was looking at the documentation for this field earlier today. Should we be setting uninvertible="false" if docValues="true" regardless of whether indexed is true or false? Kind Regards, James Bodkin On 17/06/2020, 14:02, "Michael Gibney" wrote: facet.method=enum works by executing a query (against indexed values) for each indexed value in a given field (which, for indexed=false, is "no values"). So that explains why facet.method=enum no longer works. I was going to suggest that you might not want to set indexed=false on the docValues facet fields anyway, since the indexed values are still used for facet refinement (assuming your index is distributed). What's the number of unique values in the relevant fields? If it's low enough, setting docValues=false and indexed=true and using facet.method=enum (with a sufficiently large filterCache) is definitely a viable option, and will almost certainly be faster than docValues-based faceting. (As an aside, noting for future reference: high-cardinality facets over high-cardinality DocSet domains might be able to benefit from a term facet count cache: https://issues.apache.org/jira/browse/SOLR-13807) I think you didn't specifically mention whether you acted on Erick's suggestion of setting "uninvertible=false" (I think Erick accidentally said "uninvertible=true") to fail fast. I'd also recommend doing that, perhaps even above all else -- it shouldn't actually *do* anything, but will help ensure that things are behaving as you expect them to! Michael On Wed, Jun 17, 2020 at 4:31 AM James Bodkin wrote: > > Thanks, I've implemented some queries that improve the first-hit execution for faceting. > > Since turning off indexed on those fields, we've noticed that facet.method=enum no longer returns the facets when used. > Using facet.method=fc/fcs is significantly slower compared to facet.method=enum for us. Why do these two differences exist? > > On 16/06/2020, 17:52, "Erick Erickson" wrote: > > Ok, I see the disconnect... Necessary parts if the index are read from disk > lazily. So your newSearcher or firstSearcher query needs to do whatever > operation causes the relevant parts of the index to be read. In this case, > probably just facet on all the fields you care about. I'd add sorting too > if you sort on different fields. > > The *:* query without facets or sorting does virtually nothing due to some > special handling... > > On Tue, Jun 16, 2020, 10:48 James Bodkin > wrote: > > > I've been trying to build a query that I can use in newSearcher based off > > the information in your previous e-mail. I thought you meant to build a *:* > > query as per Query 1 in my previous e-mail but I'm still seeing the > > first-hit execution. > > Now I'm wondering if you meant to create a *:* query with each of the > > fields as part of the fl query parameters or a *:* query with each of the > > fields and values as part of the fq query parameters. > > > > At the moment I've been running these manually as I expected that I would > > see the first-execution penalty disappear by the time I got to query 4, as > > I thought this would replicate the actions of the newSeacher. > > Unfortunately we can't use the autowarm count that is available as part of > > the filterCache/filterCache due to the custom deployment mechanism we use > > to update our index. > > > > Kind Regards, > > > > James Bodkin > > > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > > > Did you try the autowarming like I mentioned in my previous e-mail? > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > james.bod...@loveholidays.com> wrote: > > > > > > We've changed the schema to enable docValues for these fields and > > this led to an improvement in the response time. We found a further > > i
Re: Facet Performance
Ah, interesting! So if the number of possible values is low (like <= 10), it is faster to *not *use docvalues on that (indexed) faceted field? Does this hold true even when using faceting techniques like tag and exclusion? Thanks, Anthony On Wed, Jun 17, 2020 at 9:37 AM David Smiley wrote: > I strongly recommend setting indexed=true on a field you facet on for the > purposes of efficient refinement (fq=field:value). But it strictly isn't > required, as you have discovered. > > ~ David > > > On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney > wrote: > > > facet.method=enum works by executing a query (against indexed values) > > for each indexed value in a given field (which, for indexed=false, is > > "no values"). So that explains why facet.method=enum no longer works. > > I was going to suggest that you might not want to set indexed=false on > > the docValues facet fields anyway, since the indexed values are still > > used for facet refinement (assuming your index is distributed). > > > > What's the number of unique values in the relevant fields? If it's low > > enough, setting docValues=false and indexed=true and using > > facet.method=enum (with a sufficiently large filterCache) is > > definitely a viable option, and will almost certainly be faster than > > docValues-based faceting. (As an aside, noting for future reference: > > high-cardinality facets over high-cardinality DocSet domains might be > > able to benefit from a term facet count cache: > > https://issues.apache.org/jira/browse/SOLR-13807) > > > > I think you didn't specifically mention whether you acted on Erick's > > suggestion of setting "uninvertible=false" (I think Erick accidentally > > said "uninvertible=true") to fail fast. I'd also recommend doing that, > > perhaps even above all else -- it shouldn't actually *do* anything, > > but will help ensure that things are behaving as you expect them to! > > > > Michael > > > > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > > wrote: > > > > > > Thanks, I've implemented some queries that improve the first-hit > > execution for faceting. > > > > > > Since turning off indexed on those fields, we've noticed that > > facet.method=enum no longer returns the facets when used. > > > Using facet.method=fc/fcs is significantly slower compared to > > facet.method=enum for us. Why do these two differences exist? > > > > > > On 16/06/2020, 17:52, "Erick Erickson" > wrote: > > > > > > Ok, I see the disconnect... Necessary parts if the index are read > > from disk > > > lazily. So your newSearcher or firstSearcher query needs to do > > whatever > > > operation causes the relevant parts of the index to be read. In > this > > case, > > > probably just facet on all the fields you care about. I'd add > > sorting too > > > if you sort on different fields. > > > > > > The *:* query without facets or sorting does virtually nothing due > > to some > > > special handling... > > > > > > On Tue, Jun 16, 2020, 10:48 James Bodkin < > > james.bod...@loveholidays.com> > > > wrote: > > > > > > > I've been trying to build a query that I can use in newSearcher > > based off > > > > the information in your previous e-mail. I thought you meant to > > build a *:* > > > > query as per Query 1 in my previous e-mail but I'm still seeing > the > > > > first-hit execution. > > > > Now I'm wondering if you meant to create a *:* query with each of > > the > > > > fields as part of the fl query parameters or a *:* query with > each > > of the > > > > fields and values as part of the fq query parameters. > > > > > > > > At the moment I've been running these manually as I expected that > > I would > > > > see the first-execution penalty disappear by the time I got to > > query 4, as > > > > I thought this would replicate the actions of the newSeacher. > > > > Unfortunately we can't use the autowarm count that is available > as > > part of > > > > the filterCache/filterCache due to the custom deployment > mechanism > > we use > > > > to update our index. > > > > > > > > Kind Regards, > > > > > > > > James Bodkin > > > > > > > > On 16/06/2020, 15:30, "Erick Erickson" > > > wrote: > > > > > > > > Did you try the autowarming like I mentioned in my previous > > e-mail? > > > > > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > > > james.bod...@loveholidays.com> wrote: > > > > > > > > > > We've changed the schema to enable docValues for these > > fields and > > > > this led to an improvement in the response time. We found a > further > > > > improvement by also switching off indexed as these fields are > used > > for > > > > faceting and filtering only. > > > > > Since those changes, we've found that the first-execution > for > > > > queries is really noticeable. I thought this would be the > > filterCache based > > > > on what I saw in NewRel
Re: Facet Performance
I strongly recommend setting indexed=true on a field you facet on for the purposes of efficient refinement (fq=field:value). But it strictly isn't required, as you have discovered. ~ David On Wed, Jun 17, 2020 at 9:02 AM Michael Gibney wrote: > facet.method=enum works by executing a query (against indexed values) > for each indexed value in a given field (which, for indexed=false, is > "no values"). So that explains why facet.method=enum no longer works. > I was going to suggest that you might not want to set indexed=false on > the docValues facet fields anyway, since the indexed values are still > used for facet refinement (assuming your index is distributed). > > What's the number of unique values in the relevant fields? If it's low > enough, setting docValues=false and indexed=true and using > facet.method=enum (with a sufficiently large filterCache) is > definitely a viable option, and will almost certainly be faster than > docValues-based faceting. (As an aside, noting for future reference: > high-cardinality facets over high-cardinality DocSet domains might be > able to benefit from a term facet count cache: > https://issues.apache.org/jira/browse/SOLR-13807) > > I think you didn't specifically mention whether you acted on Erick's > suggestion of setting "uninvertible=false" (I think Erick accidentally > said "uninvertible=true") to fail fast. I'd also recommend doing that, > perhaps even above all else -- it shouldn't actually *do* anything, > but will help ensure that things are behaving as you expect them to! > > Michael > > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > wrote: > > > > Thanks, I've implemented some queries that improve the first-hit > execution for faceting. > > > > Since turning off indexed on those fields, we've noticed that > facet.method=enum no longer returns the facets when used. > > Using facet.method=fc/fcs is significantly slower compared to > facet.method=enum for us. Why do these two differences exist? > > > > On 16/06/2020, 17:52, "Erick Erickson" wrote: > > > > Ok, I see the disconnect... Necessary parts if the index are read > from disk > > lazily. So your newSearcher or firstSearcher query needs to do > whatever > > operation causes the relevant parts of the index to be read. In this > case, > > probably just facet on all the fields you care about. I'd add > sorting too > > if you sort on different fields. > > > > The *:* query without facets or sorting does virtually nothing due > to some > > special handling... > > > > On Tue, Jun 16, 2020, 10:48 James Bodkin < > james.bod...@loveholidays.com> > > wrote: > > > > > I've been trying to build a query that I can use in newSearcher > based off > > > the information in your previous e-mail. I thought you meant to > build a *:* > > > query as per Query 1 in my previous e-mail but I'm still seeing the > > > first-hit execution. > > > Now I'm wondering if you meant to create a *:* query with each of > the > > > fields as part of the fl query parameters or a *:* query with each > of the > > > fields and values as part of the fq query parameters. > > > > > > At the moment I've been running these manually as I expected that > I would > > > see the first-execution penalty disappear by the time I got to > query 4, as > > > I thought this would replicate the actions of the newSeacher. > > > Unfortunately we can't use the autowarm count that is available as > part of > > > the filterCache/filterCache due to the custom deployment mechanism > we use > > > to update our index. > > > > > > Kind Regards, > > > > > > James Bodkin > > > > > > On 16/06/2020, 15:30, "Erick Erickson" > wrote: > > > > > > Did you try the autowarming like I mentioned in my previous > e-mail? > > > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > > james.bod...@loveholidays.com> wrote: > > > > > > > > We've changed the schema to enable docValues for these > fields and > > > this led to an improvement in the response time. We found a further > > > improvement by also switching off indexed as these fields are used > for > > > faceting and filtering only. > > > > Since those changes, we've found that the first-execution for > > > queries is really noticeable. I thought this would be the > filterCache based > > > on what I saw in NewRelic however it is probably trying to read the > > > docValues from disk. How can we use the autowarming to improve > this? > > > > > > > > For example, I've run the following queries in sequence and > each > > > query has a first-execution penalty. > > > > > > > > Query 1: > > > > > > > > q=*:* > > > > facet=true > > > > facet.field=D_DepartureAirport > > > > facet.field=D_Destination > > > > facet.limit=-1 > > > > rows=0 > > >
Re: Facet Performance
facet.method=enum works by executing a query (against indexed values) for each indexed value in a given field (which, for indexed=false, is "no values"). So that explains why facet.method=enum no longer works. I was going to suggest that you might not want to set indexed=false on the docValues facet fields anyway, since the indexed values are still used for facet refinement (assuming your index is distributed). What's the number of unique values in the relevant fields? If it's low enough, setting docValues=false and indexed=true and using facet.method=enum (with a sufficiently large filterCache) is definitely a viable option, and will almost certainly be faster than docValues-based faceting. (As an aside, noting for future reference: high-cardinality facets over high-cardinality DocSet domains might be able to benefit from a term facet count cache: https://issues.apache.org/jira/browse/SOLR-13807) I think you didn't specifically mention whether you acted on Erick's suggestion of setting "uninvertible=false" (I think Erick accidentally said "uninvertible=true") to fail fast. I'd also recommend doing that, perhaps even above all else -- it shouldn't actually *do* anything, but will help ensure that things are behaving as you expect them to! Michael On Wed, Jun 17, 2020 at 4:31 AM James Bodkin wrote: > > Thanks, I've implemented some queries that improve the first-hit execution > for faceting. > > Since turning off indexed on those fields, we've noticed that > facet.method=enum no longer returns the facets when used. > Using facet.method=fc/fcs is significantly slower compared to > facet.method=enum for us. Why do these two differences exist? > > On 16/06/2020, 17:52, "Erick Erickson" wrote: > > Ok, I see the disconnect... Necessary parts if the index are read from > disk > lazily. So your newSearcher or firstSearcher query needs to do whatever > operation causes the relevant parts of the index to be read. In this case, > probably just facet on all the fields you care about. I'd add sorting too > if you sort on different fields. > > The *:* query without facets or sorting does virtually nothing due to some > special handling... > > On Tue, Jun 16, 2020, 10:48 James Bodkin > wrote: > > > I've been trying to build a query that I can use in newSearcher based > off > > the information in your previous e-mail. I thought you meant to build a > *:* > > query as per Query 1 in my previous e-mail but I'm still seeing the > > first-hit execution. > > Now I'm wondering if you meant to create a *:* query with each of the > > fields as part of the fl query parameters or a *:* query with each of > the > > fields and values as part of the fq query parameters. > > > > At the moment I've been running these manually as I expected that I > would > > see the first-execution penalty disappear by the time I got to query 4, > as > > I thought this would replicate the actions of the newSeacher. > > Unfortunately we can't use the autowarm count that is available as part > of > > the filterCache/filterCache due to the custom deployment mechanism we > use > > to update our index. > > > > Kind Regards, > > > > James Bodkin > > > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > > > Did you try the autowarming like I mentioned in my previous e-mail? > > > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > > james.bod...@loveholidays.com> wrote: > > > > > > We've changed the schema to enable docValues for these fields and > > this led to an improvement in the response time. We found a further > > improvement by also switching off indexed as these fields are used for > > faceting and filtering only. > > > Since those changes, we've found that the first-execution for > > queries is really noticeable. I thought this would be the filterCache > based > > on what I saw in NewRelic however it is probably trying to read the > > docValues from disk. How can we use the autowarming to improve this? > > > > > > For example, I've run the following queries in sequence and each > > query has a first-execution penalty. > > > > > > Query 1: > > > > > > q=*:* > > > facet=true > > > facet.field=D_DepartureAirport > > > facet.field=D_Destination > > > facet.limit=-1 > > > rows=0 > > > > > > Query 2: > > > > > > q=*:* > > > fq=D_DepartureAirport:(2660) > > > facet=true > > > facet.field=D_Destination > > > facet.limit=-1 > > > rows=0 > > > > > > Query 3: > > > > > > q=*:* > > > fq=D_DepartureAirport:(2661) > > > facet=true > > > facet.field=D_Destination > > > facet.limit=-1 > > > rows=0 > > > > > >
Re: Facet Performance
Thanks, I've implemented some queries that improve the first-hit execution for faceting. Since turning off indexed on those fields, we've noticed that facet.method=enum no longer returns the facets when used. Using facet.method=fc/fcs is significantly slower compared to facet.method=enum for us. Why do these two differences exist? On 16/06/2020, 17:52, "Erick Erickson" wrote: Ok, I see the disconnect... Necessary parts if the index are read from disk lazily. So your newSearcher or firstSearcher query needs to do whatever operation causes the relevant parts of the index to be read. In this case, probably just facet on all the fields you care about. I'd add sorting too if you sort on different fields. The *:* query without facets or sorting does virtually nothing due to some special handling... On Tue, Jun 16, 2020, 10:48 James Bodkin wrote: > I've been trying to build a query that I can use in newSearcher based off > the information in your previous e-mail. I thought you meant to build a *:* > query as per Query 1 in my previous e-mail but I'm still seeing the > first-hit execution. > Now I'm wondering if you meant to create a *:* query with each of the > fields as part of the fl query parameters or a *:* query with each of the > fields and values as part of the fq query parameters. > > At the moment I've been running these manually as I expected that I would > see the first-execution penalty disappear by the time I got to query 4, as > I thought this would replicate the actions of the newSeacher. > Unfortunately we can't use the autowarm count that is available as part of > the filterCache/filterCache due to the custom deployment mechanism we use > to update our index. > > Kind Regards, > > James Bodkin > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > Did you try the autowarming like I mentioned in my previous e-mail? > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > james.bod...@loveholidays.com> wrote: > > > > We've changed the schema to enable docValues for these fields and > this led to an improvement in the response time. We found a further > improvement by also switching off indexed as these fields are used for > faceting and filtering only. > > Since those changes, we've found that the first-execution for > queries is really noticeable. I thought this would be the filterCache based > on what I saw in NewRelic however it is probably trying to read the > docValues from disk. How can we use the autowarming to improve this? > > > > For example, I've run the following queries in sequence and each > query has a first-execution penalty. > > > > Query 1: > > > > q=*:* > > facet=true > > facet.field=D_DepartureAirport > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 2: > > > > q=*:* > > fq=D_DepartureAirport:(2660) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 3: > > > > q=*:* > > fq=D_DepartureAirport:(2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 4: > > > > q=*:* > > fq=D_DepartureAirport:(2660+OR+2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > We've kept the field type as a string, as the value is mapped by > application that accesses Solr. In the examples above, the values are > mapped to airports and destinations. > > Is it possible to prewarm the above queries without having to define > all the potential filters manually in the auto warming? > > > > At the moment, we update and optimise our index in a different > environment and then copy the index to our production instances by using a > rolling deployment in Kubernetes. > > > > Kind Regards, > > > > James Bodkin > > > > On 12/06/2020, 18:58, "Erick Erickson" > wrote: > > > >I question whether fiterCache has anything to do with it, I > suspect what’s really happening is that first time you’re reading the > relevant bits from disk into memory. And to double check you should have > docVaues enabled for all these fields. The “uninverting” process can be > very expensive, and docValues bypasses that. > > > >As of Solr 7.6, you can define “uninvertible=true” to your > field(Type) to “fail fast” if Solr needs to uninvert the field. > > > >But that’s an aside. In either case, my claim is that first
Re: Facet Performance
Ok, I see the disconnect... Necessary parts if the index are read from disk lazily. So your newSearcher or firstSearcher query needs to do whatever operation causes the relevant parts of the index to be read. In this case, probably just facet on all the fields you care about. I'd add sorting too if you sort on different fields. The *:* query without facets or sorting does virtually nothing due to some special handling... On Tue, Jun 16, 2020, 10:48 James Bodkin wrote: > I've been trying to build a query that I can use in newSearcher based off > the information in your previous e-mail. I thought you meant to build a *:* > query as per Query 1 in my previous e-mail but I'm still seeing the > first-hit execution. > Now I'm wondering if you meant to create a *:* query with each of the > fields as part of the fl query parameters or a *:* query with each of the > fields and values as part of the fq query parameters. > > At the moment I've been running these manually as I expected that I would > see the first-execution penalty disappear by the time I got to query 4, as > I thought this would replicate the actions of the newSeacher. > Unfortunately we can't use the autowarm count that is available as part of > the filterCache/filterCache due to the custom deployment mechanism we use > to update our index. > > Kind Regards, > > James Bodkin > > On 16/06/2020, 15:30, "Erick Erickson" wrote: > > Did you try the autowarming like I mentioned in my previous e-mail? > > > On Jun 16, 2020, at 10:18 AM, James Bodkin < > james.bod...@loveholidays.com> wrote: > > > > We've changed the schema to enable docValues for these fields and > this led to an improvement in the response time. We found a further > improvement by also switching off indexed as these fields are used for > faceting and filtering only. > > Since those changes, we've found that the first-execution for > queries is really noticeable. I thought this would be the filterCache based > on what I saw in NewRelic however it is probably trying to read the > docValues from disk. How can we use the autowarming to improve this? > > > > For example, I've run the following queries in sequence and each > query has a first-execution penalty. > > > > Query 1: > > > > q=*:* > > facet=true > > facet.field=D_DepartureAirport > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 2: > > > > q=*:* > > fq=D_DepartureAirport:(2660) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 3: > > > > q=*:* > > fq=D_DepartureAirport:(2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > Query 4: > > > > q=*:* > > fq=D_DepartureAirport:(2660+OR+2661) > > facet=true > > facet.field=D_Destination > > facet.limit=-1 > > rows=0 > > > > We've kept the field type as a string, as the value is mapped by > application that accesses Solr. In the examples above, the values are > mapped to airports and destinations. > > Is it possible to prewarm the above queries without having to define > all the potential filters manually in the auto warming? > > > > At the moment, we update and optimise our index in a different > environment and then copy the index to our production instances by using a > rolling deployment in Kubernetes. > > > > Kind Regards, > > > > James Bodkin > > > > On 12/06/2020, 18:58, "Erick Erickson" > wrote: > > > >I question whether fiterCache has anything to do with it, I > suspect what’s really happening is that first time you’re reading the > relevant bits from disk into memory. And to double check you should have > docVaues enabled for all these fields. The “uninverting” process can be > very expensive, and docValues bypasses that. > > > >As of Solr 7.6, you can define “uninvertible=true” to your > field(Type) to “fail fast” if Solr needs to uninvert the field. > > > >But that’s an aside. In either case, my claim is that first-time > execution does “something”, either reads the serialized docValues from disk > or uninverts the file on Solr’s heap. > > > >You can have this autowarmed by any combination of > >1> specifying an autowarm count on your queryResultCache. That’s > hit or miss, as it replays the most recent N queries which may or may not > contain the sorts. That said, specifying 10-20 for autowarm count is > usually a good idea, assuming you’re not committing more than, say, every > 30 seconds. I’d add the same to filterCache too. > > > >2> specifying a newSearcher or firstSearcher query in > solrconfig.xml. The difference is that newSearcher is fired every time a > commit happens, while firstSearcher is only fired when Solr starts, the > theory being that there’s no cache autowarming available when
Re: Facet Performance
I've been trying to build a query that I can use in newSearcher based off the information in your previous e-mail. I thought you meant to build a *:* query as per Query 1 in my previous e-mail but I'm still seeing the first-hit execution. Now I'm wondering if you meant to create a *:* query with each of the fields as part of the fl query parameters or a *:* query with each of the fields and values as part of the fq query parameters. At the moment I've been running these manually as I expected that I would see the first-execution penalty disappear by the time I got to query 4, as I thought this would replicate the actions of the newSeacher. Unfortunately we can't use the autowarm count that is available as part of the filterCache/filterCache due to the custom deployment mechanism we use to update our index. Kind Regards, James Bodkin On 16/06/2020, 15:30, "Erick Erickson" wrote: Did you try the autowarming like I mentioned in my previous e-mail? > On Jun 16, 2020, at 10:18 AM, James Bodkin wrote: > > We've changed the schema to enable docValues for these fields and this led to an improvement in the response time. We found a further improvement by also switching off indexed as these fields are used for faceting and filtering only. > Since those changes, we've found that the first-execution for queries is really noticeable. I thought this would be the filterCache based on what I saw in NewRelic however it is probably trying to read the docValues from disk. How can we use the autowarming to improve this? > > For example, I've run the following queries in sequence and each query has a first-execution penalty. > > Query 1: > > q=*:* > facet=true > facet.field=D_DepartureAirport > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 2: > > q=*:* > fq=D_DepartureAirport:(2660) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 3: > > q=*:* > fq=D_DepartureAirport:(2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 4: > > q=*:* > fq=D_DepartureAirport:(2660+OR+2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > We've kept the field type as a string, as the value is mapped by application that accesses Solr. In the examples above, the values are mapped to airports and destinations. > Is it possible to prewarm the above queries without having to define all the potential filters manually in the auto warming? > > At the moment, we update and optimise our index in a different environment and then copy the index to our production instances by using a rolling deployment in Kubernetes. > > Kind Regards, > > James Bodkin > > On 12/06/2020, 18:58, "Erick Erickson" wrote: > >I question whether fiterCache has anything to do with it, I suspect what’s really happening is that first time you’re reading the relevant bits from disk into memory. And to double check you should have docVaues enabled for all these fields. The “uninverting” process can be very expensive, and docValues bypasses that. > >As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail fast” if Solr needs to uninvert the field. > >But that’s an aside. In either case, my claim is that first-time execution does “something”, either reads the serialized docValues from disk or uninverts the file on Solr’s heap. > >You can have this autowarmed by any combination of >1> specifying an autowarm count on your queryResultCache. That’s hit or miss, as it replays the most recent N queries which may or may not contain the sorts. That said, specifying 10-20 for autowarm count is usually a good idea, assuming you’re not committing more than, say, every 30 seconds. I’d add the same to filterCache too. > >2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The difference is that newSearcher is fired every time a commit happens, while firstSearcher is only fired when Solr starts, the theory being that there’s no cache autowarming available when Solr fist powers up. Usually, people don’t bother with firstSearcher or just make it the same as newSearcher. Note that a query doesn’t have to be “real” at all. You can just add all the facet fields to a *:* query in a single go. > >BTW, Trie fields will stay around for a long time even though deprecated. Or at least until we find something to replace them with that doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be more efficient than strings. > >Best, >Erick >
Re: Facet Performance
Did you try the autowarming like I mentioned in my previous e-mail? > On Jun 16, 2020, at 10:18 AM, James Bodkin > wrote: > > We've changed the schema to enable docValues for these fields and this led to > an improvement in the response time. We found a further improvement by also > switching off indexed as these fields are used for faceting and filtering > only. > Since those changes, we've found that the first-execution for queries is > really noticeable. I thought this would be the filterCache based on what I > saw in NewRelic however it is probably trying to read the docValues from > disk. How can we use the autowarming to improve this? > > For example, I've run the following queries in sequence and each query has a > first-execution penalty. > > Query 1: > > q=*:* > facet=true > facet.field=D_DepartureAirport > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 2: > > q=*:* > fq=D_DepartureAirport:(2660) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 3: > > q=*:* > fq=D_DepartureAirport:(2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > Query 4: > > q=*:* > fq=D_DepartureAirport:(2660+OR+2661) > facet=true > facet.field=D_Destination > facet.limit=-1 > rows=0 > > We've kept the field type as a string, as the value is mapped by application > that accesses Solr. In the examples above, the values are mapped to airports > and destinations. > Is it possible to prewarm the above queries without having to define all the > potential filters manually in the auto warming? > > At the moment, we update and optimise our index in a different environment > and then copy the index to our production instances by using a rolling > deployment in Kubernetes. > > Kind Regards, > > James Bodkin > > On 12/06/2020, 18:58, "Erick Erickson" wrote: > >I question whether fiterCache has anything to do with it, I suspect what’s > really happening is that first time you’re reading the relevant bits from > disk into memory. And to double check you should have docVaues enabled for > all these fields. The “uninverting” process can be very expensive, and > docValues bypasses that. > >As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to > “fail fast” if Solr needs to uninvert the field. > >But that’s an aside. In either case, my claim is that first-time execution > does “something”, either reads the serialized docValues from disk or > uninverts the file on Solr’s heap. > >You can have this autowarmed by any combination of >1> specifying an autowarm count on your queryResultCache. That’s hit or > miss, as it replays the most recent N queries which may or may not contain > the sorts. That said, specifying 10-20 for autowarm count is usually a good > idea, assuming you’re not committing more than, say, every 30 seconds. I’d > add the same to filterCache too. > >2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The > difference is that newSearcher is fired every time a commit happens, while > firstSearcher is only fired when Solr starts, the theory being that there’s > no cache autowarming available when Solr fist powers up. Usually, people > don’t bother with firstSearcher or just make it the same as newSearcher. Note > that a query doesn’t have to be “real” at all. You can just add all the facet > fields to a *:* query in a single go. > >BTW, Trie fields will stay around for a long time even though deprecated. > Or at least until we find something to replace them with that doesn’t have > this penalty, so I’d feel pretty safe using those and they’ll be more > efficient than strings. > >Best, >Erick >
Re: Facet Performance
We've changed the schema to enable docValues for these fields and this led to an improvement in the response time. We found a further improvement by also switching off indexed as these fields are used for faceting and filtering only. Since those changes, we've found that the first-execution for queries is really noticeable. I thought this would be the filterCache based on what I saw in NewRelic however it is probably trying to read the docValues from disk. How can we use the autowarming to improve this? For example, I've run the following queries in sequence and each query has a first-execution penalty. Query 1: q=*:* facet=true facet.field=D_DepartureAirport facet.field=D_Destination facet.limit=-1 rows=0 Query 2: q=*:* fq=D_DepartureAirport:(2660) facet=true facet.field=D_Destination facet.limit=-1 rows=0 Query 3: q=*:* fq=D_DepartureAirport:(2661) facet=true facet.field=D_Destination facet.limit=-1 rows=0 Query 4: q=*:* fq=D_DepartureAirport:(2660+OR+2661) facet=true facet.field=D_Destination facet.limit=-1 rows=0 We've kept the field type as a string, as the value is mapped by application that accesses Solr. In the examples above, the values are mapped to airports and destinations. Is it possible to prewarm the above queries without having to define all the potential filters manually in the auto warming? At the moment, we update and optimise our index in a different environment and then copy the index to our production instances by using a rolling deployment in Kubernetes. Kind Regards, James Bodkin On 12/06/2020, 18:58, "Erick Erickson" wrote: I question whether fiterCache has anything to do with it, I suspect what’s really happening is that first time you’re reading the relevant bits from disk into memory. And to double check you should have docVaues enabled for all these fields. The “uninverting” process can be very expensive, and docValues bypasses that. As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail fast” if Solr needs to uninvert the field. But that’s an aside. In either case, my claim is that first-time execution does “something”, either reads the serialized docValues from disk or uninverts the file on Solr’s heap. You can have this autowarmed by any combination of 1> specifying an autowarm count on your queryResultCache. That’s hit or miss, as it replays the most recent N queries which may or may not contain the sorts. That said, specifying 10-20 for autowarm count is usually a good idea, assuming you’re not committing more than, say, every 30 seconds. I’d add the same to filterCache too. 2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The difference is that newSearcher is fired every time a commit happens, while firstSearcher is only fired when Solr starts, the theory being that there’s no cache autowarming available when Solr fist powers up. Usually, people don’t bother with firstSearcher or just make it the same as newSearcher. Note that a query doesn’t have to be “real” at all. You can just add all the facet fields to a *:* query in a single go. BTW, Trie fields will stay around for a long time even though deprecated. Or at least until we find something to replace them with that doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be more efficient than strings. Best, Erick
Re: Facet Performance
I question whether fiterCache has anything to do with it, I suspect what’s really happening is that first time you’re reading the relevant bits from disk into memory. And to double check you should have docVaues enabled for all these fields. The “uninverting” process can be very expensive, and docValues bypasses that. As of Solr 7.6, you can define “uninvertible=true” to your field(Type) to “fail fast” if Solr needs to uninvert the field. But that’s an aside. In either case, my claim is that first-time execution does “something”, either reads the serialized docValues from disk or uninverts the file on Solr’s heap. You can have this autowarmed by any combination of 1> specifying an autowarm count on your queryResultCache. That’s hit or miss, as it replays the most recent N queries which may or may not contain the sorts. That said, specifying 10-20 for autowarm count is usually a good idea, assuming you’re not committing more than, say, every 30 seconds. I’d add the same to filterCache too. 2> specifying a newSearcher or firstSearcher query in solrconfig.xml. The difference is that newSearcher is fired every time a commit happens, while firstSearcher is only fired when Solr starts, the theory being that there’s no cache autowarming available when Solr fist powers up. Usually, people don’t bother with firstSearcher or just make it the same as newSearcher. Note that a query doesn’t have to be “real” at all. You can just add all the facet fields to a *:* query in a single go. BTW, Trie fields will stay around for a long time even though deprecated. Or at least until we find something to replace them with that doesn’t have this penalty, so I’d feel pretty safe using those and they’ll be more efficient than strings. Best, Erick > On Jun 12, 2020, at 12:39 PM, James Bodkin > wrote: > > We've run the performance test after changing the fields to be of the type > string. We're seeing improved performance, especially after the first time > the query has run. The first run is taking around 1-2 seconds rather than 6-8 > seconds and when the filter cache is present, the response time is around > 400ms. > Do you have any more suggestions that we could try in order to optimise the > performance? > > On 11/06/2020, 14:49, "Erick Erickson" wrote: > >There’s a lot of confusion about using points-based fields for faceting, > see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. > >Two options you might try: >1> copyField to a string field and facet on that (won’t work, of course, > for any kind of interval/range facet) >2> use the deprecated Trie field instead. You could use the copyField to a > Trie field for this too. > >Best, >Erick >
Re: Facet Performance
We've run the performance test after changing the fields to be of the type string. We're seeing improved performance, especially after the first time the query has run. The first run is taking around 1-2 seconds rather than 6-8 seconds and when the filter cache is present, the response time is around 400ms. Do you have any more suggestions that we could try in order to optimise the performance? On 11/06/2020, 14:49, "Erick Erickson" wrote: There’s a lot of confusion about using points-based fields for faceting, see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. Two options you might try: 1> copyField to a string field and facet on that (won’t work, of course, for any kind of interval/range facet) 2> use the deprecated Trie field instead. You could use the copyField to a Trie field for this too. Best, Erick
Re: Facet Performance
Could you explain why the performance is an issue for points-based fields? I've looked through the referenced issue (which is fixed in the version we are running) but I'm missing the link between the two. Is there an issue to improve this for points-based fields? We're going to change the field type to a string, as our queries are always looking for a specific value (and not intervals/ranges) and rerun our load test. Kind Regards, James Bodkin On 11/06/2020, 14:49, "Erick Erickson" wrote: There’s a lot of confusion about using points-based fields for faceting, see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. Two options you might try: 1> copyField to a string field and facet on that (won’t work, of course, for any kind of interval/range facet) 2> use the deprecated Trie field instead. You could use the copyField to a Trie field for this too. Best, Erick
Re: Facet Performance
There’s a lot of confusion about using points-based fields for faceting, see: https://issues.apache.org/jira/browse/SOLR-13227 for instance. Two options you might try: 1> copyField to a string field and facet on that (won’t work, of course, for any kind of interval/range facet) 2> use the deprecated Trie field instead. You could use the copyField to a Trie field for this too. Best, Erick > On Jun 11, 2020, at 9:39 AM, James Bodkin > wrote: > > We’ve been running a load test against our index and have noticed that the > facet queries are significantly slower than we would like. > Currently these types of queries are taking several seconds to execute and > are wondering if it would be possible to speed these up. > Repeating the same query over and over does not improve the response time so > does not appear to utilise any caching. > Ideally we would like to be targeting a response time around tens or hundreds > of milliseconds if possible. > > An example query that is taking around 2-3 seconds to execute is: > > q=*.* > facet=true > facet.field=D_UserRatingGte > facet.mincount=1 > facet.limit=-1 > rows=0 > > "response":{"numFound":18979503,"start":0,"maxScore":1.0,"docs":[]} > "facet_counts":{ >"facet_queries":{}, >"facet_fields":{ > "D_UserRatingGte":[ >"1575",16614238, >"1576",16614238, >"1577",16614238, >"1578",16065938, >"1579",12079545, >"1580",458799]}, >"facet_ranges":{}, >"facet_intervals":{}, >"facet_heatmaps":{}}} > > I have also tried the equivalent query using the JSON Facet API with the same > outcome of slow response time. > Additionally I have tried changing the facet method (on both facet apis) with > the same outcome of slow response time. > > The underlying field for the above query is configured as a > solr.IntPointField with docValues, indexed and multiValued set to true. > The index has just under 19 million documents and the physical size on disk > is 10.95GB. The index is read-only and consists of 4 segments with 0 > deletions. > We’re running standalone Solr 8.3.1 with a 8GB Heap and the underlying Google > Cloud Virtual Machine in our load test environment has 6 vCPUs, 32G RAM and > 100GB SSD. > > Would anyone be able to point me in a direction to either improve the > performance or understand the current performance is expected? > > Kind Regards, > > James Bodkin
Facet Performance
We’ve been running a load test against our index and have noticed that the facet queries are significantly slower than we would like. Currently these types of queries are taking several seconds to execute and are wondering if it would be possible to speed these up. Repeating the same query over and over does not improve the response time so does not appear to utilise any caching. Ideally we would like to be targeting a response time around tens or hundreds of milliseconds if possible. An example query that is taking around 2-3 seconds to execute is: q=*.* facet=true facet.field=D_UserRatingGte facet.mincount=1 facet.limit=-1 rows=0 "response":{"numFound":18979503,"start":0,"maxScore":1.0,"docs":[]} "facet_counts":{ "facet_queries":{}, "facet_fields":{ "D_UserRatingGte":[ "1575",16614238, "1576",16614238, "1577",16614238, "1578",16065938, "1579",12079545, "1580",458799]}, "facet_ranges":{}, "facet_intervals":{}, "facet_heatmaps":{}}} I have also tried the equivalent query using the JSON Facet API with the same outcome of slow response time. Additionally I have tried changing the facet method (on both facet apis) with the same outcome of slow response time. The underlying field for the above query is configured as a solr.IntPointField with docValues, indexed and multiValued set to true. The index has just under 19 million documents and the physical size on disk is 10.95GB. The index is read-only and consists of 4 segments with 0 deletions. We’re running standalone Solr 8.3.1 with a 8GB Heap and the underlying Google Cloud Virtual Machine in our load test environment has 6 vCPUs, 32G RAM and 100GB SSD. Would anyone be able to point me in a direction to either improve the performance or understand the current performance is expected? Kind Regards, James Bodkin
Re: Facet performance problem
On 2/20/2018 1:18 AM, LOPEZ-CORTES Mariano-ext wrote: We return a facet list of values in "motifPresence" field (person status). Status: [ ] status1 [x] status2 [x] status3 The user then selects 1 or multiple status (It's this step that we called "facet filtering"). Query is then re-executed with fq=motifPresence:(status2 OR status3) We use fq in order to not alter the score in main query. We've read that docValues=true for facet fields. We need also indexed=true? Facets, grouping, and sorting are more efficient with docValues, but searches aren't helped by docValues. Without indexed="true", searches on the field will be VERY slow. A filter query is still a search. The "filter" in filter query just refers to the fact that it's separate from the main query, and that it does not affect relevancy scoring. Thanks, Shawn
RE: Facet performance problem
Our query looks like this: ...factet=true&facet.field=motifPresence We return a facet list of values in "motifPresence" field (person status). Status: [ ] status1 [x] status2 [x] status3 The user then selects 1 or multiple status (It's this step that we called "facet filtering"). Query is then re-executed with fq=motifPresence:(status2 OR status3) We use fq in order to not alter the score in main query. We've read that docValues=true for facet fields. We need also indexed=true? Is there any other problem in our solution? -Message d'origine- De : Erick Erickson [mailto:erickerick...@gmail.com] Envoyé : lundi 19 février 2018 18:18 À : solr-user Objet : Re: Facet performance problem I'm confused here. What do you mean by "facet filtering"? Your examples have no facets at all, just a _filter query_. I'll assume you want to use filter query (fq), and faceting has nothing to do with it. This is one of the tricky bits of docValues. While it's _possible_ to search on a field that's defined as above, it's very inefficient since there's no "inverted index" for the field, you specified 'indexed="false" '. So the docValues are searched, and it's essentially a table scan. If you mean to search against this field, set indexed="true". You'll have to completely reindex your corpus of course. If you intend to facet, group or sort on this field, you should _also_ have docValues="true". Best, Erick On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext wrote: > Hi > > We have following environement : > > 3 nodes cluster > 1 shard > Replication factor = 2 > 8GB per node > > 29 millions of documents > > We've faceting over field "motifPresence" defined as follow: > > indexed="false" stored="true" required="false"/> > > Once the user selects motifPresence filter we executes search again with: > > fq: (value1 OR value2 OR value3 OR ...) > > The problem is: During facet filtering query is too slow and her response > time is greater than main search (without facet filtering). > > Thanks in advance!
Re: Facet performance problem
I'm confused here. What do you mean by "facet filtering"? Your examples have no facets at all, just a _filter query_. I'll assume you want to use filter query (fq), and faceting has nothing to do with it. This is one of the tricky bits of docValues. While it's _possible_ to search on a field that's defined as above, it's very inefficient since there's no "inverted index" for the field, you specified 'indexed="false" '. So the docValues are searched, and it's essentially a table scan. If you mean to search against this field, set indexed="true". You'll have to completely reindex your corpus of course. If you intend to facet, group or sort on this field, you should _also_ have docValues="true". Best, Erick On Mon, Feb 19, 2018 at 7:47 AM, MOUSSA MZE Oussama-ext wrote: > Hi > > We have following environement : > > 3 nodes cluster > 1 shard > Replication factor = 2 > 8GB per node > > 29 millions of documents > > We've faceting over field "motifPresence" defined as follow: > > stored="true" required="false"/> > > Once the user selects motifPresence filter we executes search again with: > > fq: (value1 OR value2 OR value3 OR ...) > > The problem is: During facet filtering query is too slow and her response > time is greater than main search (without facet filtering). > > Thanks in advance!
Facet performance problem
Hi We have following environement : 3 nodes cluster 1 shard Replication factor = 2 8GB per node 29 millions of documents We've faceting over field "motifPresence" defined as follow: Once the user selects motifPresence filter we executes search again with: fq: (value1 OR value2 OR value3 OR ...) The problem is: During facet filtering query is too slow and her response time is greater than main search (without facet filtering). Thanks in advance!
Re: Really slow facet performance in 6.6
On Mon, Oct 23, 2017 at 3:06 PM, John Davis wrote: > Hello, > > We are seeing really slow facet performance with new solr release. This is > on an index of 2M documents. A few things we've tried: What happens when you run this facet request again? The first time a UIF faceting method runs for a field on a changed index, the data structure needs to be rebuilt (i.e. it's not good for NRT). Maybe that build time is being included. Otherwise I've never seen faceting so slow and there is something else going on here. -Yonik
Re: Really slow facet performance in 6.6
have a look for more background information: https://issues.apache.org/jira/browse/SOLR-8096 it's not only related to version 6.6. It's a question of design since 5.x Günter On 23.10.2017 21:06, John Davis wrote: Hello, We are seeing really slow facet performance with new solr release. This is on an index of 2M documents. A few things we've tried: 1. method=uif however that didn't help much (the facet fields have docValues=false since they are multi-valued). Debug info below. 2. changing query (q=) that selects what documents to compute facets on didn't help a lot, except repeating the same query was fast presumably due to exact cache hits. Sample debug info: “timing”: { “prepare”: { “debug”: { “time”: 0.0 }, “expand”: { “time”: 0.0 }, “facet”: { “time”: 0.0 }, “facet_module”: { “time”: 0.0 }, “highlight”: { “time”: 0.0 }, “mlt”: { “time”: 0.0 }, “query”: { “time”: 0.0 }, “stats”: { “time”: 0.0 }, “terms”: { “time”: 0.0 }, “time”: 0.0 }, “process”: { “debug”: { “time”: 87.0 }, “expand”: { “time”: 0.0 }, “facet”: { “time”: 9814.0 }, “facet_module”: { “time”: 0.0 }, “highlight”: { “time”: 0.0 }, “mlt”: { “time”: 0.0 }, “query”: { “time”: 20.0 }, “stats”: { “time”: 0.0 }, “terms”: { “time”: 0.0 }, “time”: 9922.0 }, “time”: 9923.0 } }, "facet-debug": { "elapse": 8310, "sub-facet": [ { "action": "field facet", "elapse": 8310, "maxThreads": 2, "processor": "SimpleFacets", "sub-facet": [ {}, { "appliedMethod": "UIF", "field": "school", "inputDocSetSize": 476, "requestedMethod": "UIF" }, { "appliedMethod": "UIF", "elapse": 2575, "field": "work", "inputDocSetSize": 476, "requestedMethod": "UIF" }, { "appliedMethod": "UIF", "elapse": 8310, "field": "level", "inputDocSetSize": 476, "requestedMethod": "UIF" } ] } Thanks John -- Günter Hipler Universität Basel | Universitätsbibliothek | Projekt swissbib Schönbeinstrasse 18-20 | 4056 Basel | Schweiz Tel +41 61 207 31 12 | Fax +41 61 207 31 03 E-Mail guenter.hip...@unibas.ch | http://www.ub.unibas.ch | https://www.swissbib.ch
Re: Really slow facet performance in 6.6
John Davis wrote: > We are seeing really slow facet performance with new solr release. > This is on an index of 2M documents. I am currently running some performance experiments on simple String faceting, comparing Solr 4 & 6. There is definitely a performance difference, but it is not trivial to pinpoint where it is. My first thought was that it was tied to the Solr version, with Solr 6 being markedly slower than Solr 4. However, looking at segment count, I can see that Solr 6 has twice as many segments as Solr 4 for my test setup. I tried optimizing down to 10 segments, which flipped the result: Suddenly Solr 6 was faster than Solr 4. I'm still poking at this, but I guess my takeaway for now is to be sure to compare on fair terms. The strategy for creating segments can be tweaked and (guessing a lot here) it seems that Solr 6 defaults leans towards faster indexing (by having more small segments) at the cost of faceting performance. These JIRAs seems relevant: https://issues.apache.org/jira/browse/SOLR-8096 https://issues.apache.org/jira/browse/SOLR-9599 > 1. method=uif however that didn't help much (the facet fields have > docValues=false since they are multi-valued). Debug info below. docValues works fine with multi-values (at least for Strings). - Toke Eskildsen
Really slow facet performance in 6.6
Hello, We are seeing really slow facet performance with new solr release. This is on an index of 2M documents. A few things we've tried: 1. method=uif however that didn't help much (the facet fields have docValues=false since they are multi-valued). Debug info below. 2. changing query (q=) that selects what documents to compute facets on didn't help a lot, except repeating the same query was fast presumably due to exact cache hits. Sample debug info: “timing”: { “prepare”: { “debug”: { “time”: 0.0 }, “expand”: { “time”: 0.0 }, “facet”: { “time”: 0.0 }, “facet_module”: { “time”: 0.0 }, “highlight”: { “time”: 0.0 }, “mlt”: { “time”: 0.0 }, “query”: { “time”: 0.0 }, “stats”: { “time”: 0.0 }, “terms”: { “time”: 0.0 }, “time”: 0.0 }, “process”: { “debug”: { “time”: 87.0 }, “expand”: { “time”: 0.0 }, “facet”: { “time”: 9814.0 }, “facet_module”: { “time”: 0.0 }, “highlight”: { “time”: 0.0 }, “mlt”: { “time”: 0.0 }, “query”: { “time”: 20.0 }, “stats”: { “time”: 0.0 }, “terms”: { “time”: 0.0 }, “time”: 9922.0 }, “time”: 9923.0 } }, "facet-debug": { "elapse": 8310, "sub-facet": [ { "action": "field facet", "elapse": 8310, "maxThreads": 2, "processor": "SimpleFacets", "sub-facet": [ {}, { "appliedMethod": "UIF", "field": "school", "inputDocSetSize": 476, "requestedMethod": "UIF" }, { "appliedMethod": "UIF", "elapse": 2575, "field": "work", "inputDocSetSize": 476, "requestedMethod": "UIF" }, { "appliedMethod": "UIF", "elapse": 8310, "field": "level", "inputDocSetSize": 476, "requestedMethod": "UIF" } ] } Thanks John
Re: JSON facet performance for aggregations
hi yonik, i like your work on solr very much, and i'm hoping it can deliver what we are looking to acheive here... and apologies for the direct aproach but i dont i have a choice, i've sumitted the request below to the mailing list and i still havent had a reply ... and part of me wondering it's because either i have missed out on something very obvious, or maybe my aproach to my problem is using the wrong technology here! The mailing list is not allowing me to send you a direct link to the issue unless you want to see my message with alot of xml 😊 so i'm pasting the contents of my message below: thanks, ~ i have an english book which i have indexed its contents successfully into field called 'content, with the following properties: so if need to return the number of a specific term regex e.g. '*olomo*' then my document should contain 2 and give me 'Solomon' with a term frequency = 2. I've tried going through the term vector section in the reference and various other posts on the internet but still i havent managed to figure out how. the nearest i found is the following syntax/way: http://localhost:8983/solr/test/tvrh?q=content:[*%20TO%20*]&indent=true&tv.tf=true&tv.df=true which brings my pc to a near halt for about a couple of minutes, and then it returns the term frequency of every term! but i only need the term frequency of particular pattern/regex: is there a way to narrow it down to just one regex term, e.g. *thing*, so it will find soothing, somthing, everything each with their number of occurences for the document? thanks, ~ From: Yonik Seeley Sent: 24 May 2017 10:45 To: solr-user@lucene.apache.org Subject: Re: JSON facet performance for aggregations On Mon, May 8, 2017 at 11:27 AM, Yonik Seeley wrote: > I opened https://issues.apache.org/jira/browse/SOLR-10634 to address > this performance issue. OK, this has been committed. A quick test shows about a 30x speedup when faceting on a string/numeric docvalues field with 100K unique values and doing a simple aggregation on another numeric field (and when the limit:-1). -Yonik
Re: JSON facet performance for aggregations
On Mon, May 8, 2017 at 11:27 AM, Yonik Seeley wrote: > I opened https://issues.apache.org/jira/browse/SOLR-10634 to address > this performance issue. OK, this has been committed. A quick test shows about a 30x speedup when faceting on a string/numeric docvalues field with 100K unique values and doing a simple aggregation on another numeric field (and when the limit:-1). -Yonik
Re: JSON facet performance for aggregations
On Mon, May 8, 2017 at 3:55 AM, Mikhail Ibraheem wrote: > Thanks Yonik. > It is double because our use case allows to group by any field of any type. Grouping in Solr does not require a double type, so I'm not sure how that logically follows. Perhaps it's a limitation in the system using Solr? > According to your below valuable explanation, is it better at this case to > use flat faceting instead of JSON faceting? I don't think it would help. I opened https://issues.apache.org/jira/browse/SOLR-10634 to address this performance issue. > Indexing the field should give us better performance than flat faceting? Indexing the studentId field should give better performance wherever you need to search for or filter by specific student ids. -Yonik > Indexing the field should give us better performance than flat faceting? > Do you recommend streaming at that case? > > Please advise. > > Thanks > Mikhail > > -Original Message- > From: Yonik Seeley [mailto:ysee...@gmail.com] > Sent: Sunday, May 07, 2017 6:25 PM > To: solr-user@lucene.apache.org > Subject: Re: JSON facet performance for aggregations > > OK, so I think I know what's going on. > > The current code is more optimized for finding the top K buckets from a total > of N. > When one asks to return the top 10 buckets when there are potentially > millions of buckets, it makes sense to defer calculating other metrics for > those buckets until we know which ones they are. After we identify the top > 10 buckets, we calculate the domain for that bucket and use that to calculate > the remaining metrics. > > The current method is obviously much slower when one is requesting > *all* buckets. We might as well just calculate all metrics in the first pass > rather than trying to defer them. > > This inefficiency is compounded by the fact that the fields are not indexed. > In the second phase, finding the domain for a bucket is a field query. For > an indexed field, this would involve a single term lookup. For a non-indexed > docValues field, this involves a full column scan. > > If you ever want to do quick lookups on studentId, it would make sense for it > to be indexed (and why is it a double, anyway?) > > I'll open up a JIRA issue for the first problem (don't defer metrics if we're > going to return all buckets anyway) > > -Yonik > > > On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem > wrote: >> Hi Yonik, >> We are using Solr 6.5 >> Both studentId and grades are double: >> > indexed="false" stored="true" docValues="true" multiValued="false" >> required="false"/> >> >> We have 1.5 million records. >> >> Thanks >> Mikhail >> >> -Original Message- >> From: Yonik Seeley [mailto:ysee...@gmail.com] >> Sent: Sunday, April 30, 2017 1:04 PM >> To: solr-user@lucene.apache.org >> Subject: Re: JSON facet performance for aggregations >> >> It is odd there would be quite such a big performance delta. >> What version of solr are you using? >> What is the fieldType of "grades"? >> -Yonik >> >> >> On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem >> wrote: >>> 1- >>> studentId has docValue = true . it is of type double which is >>> >> stored="true" docValues="true" multiValued="false" required="false"/> >>> >>> >>> 2- If we just facet without aggregation it finishes in good time 60ms: >>> >>> json.facet={ >>>studentId:{ >>> type:terms, >>> limit:-1, >>> field:" studentId " >>> >>>} >>> } >>> >>> >>> Thanks >>> >>> >>> -Original Message- >>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >>> Sent: Sunday, April 30, 2017 10:44 AM >>> To: solr-user@lucene.apache.org >>> Subject: RE: JSON facet performance for aggregations >>> >>> Please enable doc values and try. >>> There is a bug in the source code which causes json facet on string field >>> to run very slow. On numeric fields it runs fine with doc value enabled. >>> >>> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" >>> >>> wrote: >>> >>>> Hi Vijay, >>>> It is already numeric field. >>>> It is huge difference between json and flat here. Do you know the >>>> reason for this? Is there a way to improve it ? >>&
RE: JSON facet performance for aggregations
Thanks Yonik. It is double because our use case allows to group by any field of any type. According to your below valuable explanation, is it better at this case to use flat faceting instead of JSON faceting? Indexing the field should give us better performance than flat faceting? Do you recommend streaming at that case? Please advise. Thanks Mikhail -Original Message- From: Yonik Seeley [mailto:ysee...@gmail.com] Sent: Sunday, May 07, 2017 6:25 PM To: solr-user@lucene.apache.org Subject: Re: JSON facet performance for aggregations OK, so I think I know what's going on. The current code is more optimized for finding the top K buckets from a total of N. When one asks to return the top 10 buckets when there are potentially millions of buckets, it makes sense to defer calculating other metrics for those buckets until we know which ones they are. After we identify the top 10 buckets, we calculate the domain for that bucket and use that to calculate the remaining metrics. The current method is obviously much slower when one is requesting *all* buckets. We might as well just calculate all metrics in the first pass rather than trying to defer them. This inefficiency is compounded by the fact that the fields are not indexed. In the second phase, finding the domain for a bucket is a field query. For an indexed field, this would involve a single term lookup. For a non-indexed docValues field, this involves a full column scan. If you ever want to do quick lookups on studentId, it would make sense for it to be indexed (and why is it a double, anyway?) I'll open up a JIRA issue for the first problem (don't defer metrics if we're going to return all buckets anyway) -Yonik On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem wrote: > Hi Yonik, > We are using Solr 6.5 > Both studentId and grades are double: >indexed="false" stored="true" docValues="true" multiValued="false" > required="false"/> > > We have 1.5 million records. > > Thanks > Mikhail > > -Original Message- > From: Yonik Seeley [mailto:ysee...@gmail.com] > Sent: Sunday, April 30, 2017 1:04 PM > To: solr-user@lucene.apache.org > Subject: Re: JSON facet performance for aggregations > > It is odd there would be quite such a big performance delta. > What version of solr are you using? > What is the fieldType of "grades"? > -Yonik > > > On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem > wrote: >> 1- >> studentId has docValue = true . it is of type double which is >> > stored="true" docValues="true" multiValued="false" required="false"/> >> >> >> 2- If we just facet without aggregation it finishes in good time 60ms: >> >> json.facet={ >>studentId:{ >> type:terms, >> limit:-1, >> field:" studentId " >> >>} >> } >> >> >> Thanks >> >> >> -Original Message- >> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >> Sent: Sunday, April 30, 2017 10:44 AM >> To: solr-user@lucene.apache.org >> Subject: RE: JSON facet performance for aggregations >> >> Please enable doc values and try. >> There is a bug in the source code which causes json facet on string field to >> run very slow. On numeric fields it runs fine with doc value enabled. >> >> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" >> >> wrote: >> >>> Hi Vijay, >>> It is already numeric field. >>> It is huge difference between json and flat here. Do you know the >>> reason for this? Is there a way to improve it ? >>> >>> -Original Message- >>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >>> Sent: Sunday, April 30, 2017 9:58 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: JSON facet performance for aggregations >>> >>> Json facet on string fields run lot slower than on numeric fields. >>> Try and see if you can represent studentid as a numeric field. >>> >>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" >>> >>> wrote: >>> >>> > Hi, >>> > >>> > I am trying to do aggregation with JSON faceting but performance >>> > is very bad for one of the requests: >>> > >>> > json.facet={ >>> > >>> >studentId:{ >>> > >>> > type:terms, >>> > >>> > limit:-1, >>> > >>> > field:"studentId", >>> > >>> >
Re: JSON facet performance for aggregations
OK, so I think I know what's going on. The current code is more optimized for finding the top K buckets from a total of N. When one asks to return the top 10 buckets when there are potentially millions of buckets, it makes sense to defer calculating other metrics for those buckets until we know which ones they are. After we identify the top 10 buckets, we calculate the domain for that bucket and use that to calculate the remaining metrics. The current method is obviously much slower when one is requesting *all* buckets. We might as well just calculate all metrics in the first pass rather than trying to defer them. This inefficiency is compounded by the fact that the fields are not indexed. In the second phase, finding the domain for a bucket is a field query. For an indexed field, this would involve a single term lookup. For a non-indexed docValues field, this involves a full column scan. If you ever want to do quick lookups on studentId, it would make sense for it to be indexed (and why is it a double, anyway?) I'll open up a JIRA issue for the first problem (don't defer metrics if we're going to return all buckets anyway) -Yonik On Sun, Apr 30, 2017 at 8:58 AM, Mikhail Ibraheem wrote: > Hi Yonik, > We are using Solr 6.5 > Both studentId and grades are double: >stored="true" docValues="true" multiValued="false" required="false"/> > > We have 1.5 million records. > > Thanks > Mikhail > > -Original Message- > From: Yonik Seeley [mailto:ysee...@gmail.com] > Sent: Sunday, April 30, 2017 1:04 PM > To: solr-user@lucene.apache.org > Subject: Re: JSON facet performance for aggregations > > It is odd there would be quite such a big performance delta. > What version of solr are you using? > What is the fieldType of "grades"? > -Yonik > > > On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem > wrote: >> 1- >> studentId has docValue = true . it is of type double which is >> > stored="true" docValues="true" multiValued="false" required="false"/> >> >> >> 2- If we just facet without aggregation it finishes in good time 60ms: >> >> json.facet={ >>studentId:{ >> type:terms, >> limit:-1, >> field:" studentId " >> >>} >> } >> >> >> Thanks >> >> >> -Original Message- >> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >> Sent: Sunday, April 30, 2017 10:44 AM >> To: solr-user@lucene.apache.org >> Subject: RE: JSON facet performance for aggregations >> >> Please enable doc values and try. >> There is a bug in the source code which causes json facet on string field to >> run very slow. On numeric fields it runs fine with doc value enabled. >> >> On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" >> >> wrote: >> >>> Hi Vijay, >>> It is already numeric field. >>> It is huge difference between json and flat here. Do you know the >>> reason for this? Is there a way to improve it ? >>> >>> -Original Message- >>> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >>> Sent: Sunday, April 30, 2017 9:58 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: JSON facet performance for aggregations >>> >>> Json facet on string fields run lot slower than on numeric fields. >>> Try and see if you can represent studentid as a numeric field. >>> >>> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" >>> >>> wrote: >>> >>> > Hi, >>> > >>> > I am trying to do aggregation with JSON faceting but performance is >>> > very bad for one of the requests: >>> > >>> > json.facet={ >>> > >>> >studentId:{ >>> > >>> > type:terms, >>> > >>> > limit:-1, >>> > >>> > field:"studentId", >>> > >>> > facet:{ >>> > >>> > x:"sum(grades)" >>> > >>> > } >>> > >>> >} >>> > >>> > } >>> > >>> > >>> > >>> > This request finishes in 250 seconds, and we can't paginate for >>> > this service for functional reason so we have to use limit:-1, and >>> > the cardinality of the studentId is 7500. >>> > >>> > >>> > >>> > If I try the same with flat facet it finishes in 3 seconds : >>> > stats=true&facet=true&stats.field={!tag=piv1 >>> > sum=true}grades&facet.pivot={!stats=piv1}studentId >>> > >>> > >>> > >>> > We are hoping to use one approach json or flat for all our services. >>> > JSON facet performance is better for many case. >>> > >>> > >>> > >>> > Please advise on why the performance for this is so bad and if we >>> > can improve it. Also what is the default algorithm used for json facet. >>> > >>> > >>> > >>> > Thanks >>> > >>> > Mikhail >>> > >>>
RE: JSON facet performance for aggregations
Hi Yonik, We are using Solr 6.5 Both studentId and grades are double: We have 1.5 million records. Thanks Mikhail -Original Message- From: Yonik Seeley [mailto:ysee...@gmail.com] Sent: Sunday, April 30, 2017 1:04 PM To: solr-user@lucene.apache.org Subject: Re: JSON facet performance for aggregations It is odd there would be quite such a big performance delta. What version of solr are you using? What is the fieldType of "grades"? -Yonik On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem wrote: > 1- > studentId has docValue = true . it is of type double which is > stored="true" docValues="true" multiValued="false" required="false"/> > > > 2- If we just facet without aggregation it finishes in good time 60ms: > > json.facet={ >studentId:{ > type:terms, > limit:-1, > field:" studentId " > >} > } > > > Thanks > > > -Original Message- > From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] > Sent: Sunday, April 30, 2017 10:44 AM > To: solr-user@lucene.apache.org > Subject: RE: JSON facet performance for aggregations > > Please enable doc values and try. > There is a bug in the source code which causes json facet on string field to > run very slow. On numeric fields it runs fine with doc value enabled. > > On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" > > wrote: > >> Hi Vijay, >> It is already numeric field. >> It is huge difference between json and flat here. Do you know the >> reason for this? Is there a way to improve it ? >> >> -Original Message- >> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >> Sent: Sunday, April 30, 2017 9:58 AM >> To: solr-user@lucene.apache.org >> Subject: Re: JSON facet performance for aggregations >> >> Json facet on string fields run lot slower than on numeric fields. >> Try and see if you can represent studentid as a numeric field. >> >> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" >> >> wrote: >> >> > Hi, >> > >> > I am trying to do aggregation with JSON faceting but performance is >> > very bad for one of the requests: >> > >> > json.facet={ >> > >> >studentId:{ >> > >> > type:terms, >> > >> > limit:-1, >> > >> > field:"studentId", >> > >> > facet:{ >> > >> > x:"sum(grades)" >> > >> > } >> > >> > } >> > >> > } >> > >> > >> > >> > This request finishes in 250 seconds, and we can't paginate for >> > this service for functional reason so we have to use limit:-1, and >> > the cardinality of the studentId is 7500. >> > >> > >> > >> > If I try the same with flat facet it finishes in 3 seconds : >> > stats=true&facet=true&stats.field={!tag=piv1 >> > sum=true}grades&facet.pivot={!stats=piv1}studentId >> > >> > >> > >> > We are hoping to use one approach json or flat for all our services. >> > JSON facet performance is better for many case. >> > >> > >> > >> > Please advise on why the performance for this is so bad and if we >> > can improve it. Also what is the default algorithm used for json facet. >> > >> > >> > >> > Thanks >> > >> > Mikhail >> > >>
Re: JSON facet performance for aggregations
It is odd there would be quite such a big performance delta. What version of solr are you using? What is the fieldType of "grades"? -Yonik On Sun, Apr 30, 2017 at 5:15 AM, Mikhail Ibraheem wrote: > 1- > studentId has docValue = true . it is of type double which is name="double" class="solr.TrieDoubleField" indexed="false" stored="true" > docValues="true" multiValued="false" required="false"/> > > > 2- If we just facet without aggregation it finishes in good time 60ms: > > json.facet={ >studentId:{ > type:terms, > limit:-1, > field:" studentId " > >} > } > > > Thanks > > > -Original Message- > From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] > Sent: Sunday, April 30, 2017 10:44 AM > To: solr-user@lucene.apache.org > Subject: RE: JSON facet performance for aggregations > > Please enable doc values and try. > There is a bug in the source code which causes json facet on string field to > run very slow. On numeric fields it runs fine with doc value enabled. > > On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" > wrote: > >> Hi Vijay, >> It is already numeric field. >> It is huge difference between json and flat here. Do you know the >> reason for this? Is there a way to improve it ? >> >> -Original Message- >> From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] >> Sent: Sunday, April 30, 2017 9:58 AM >> To: solr-user@lucene.apache.org >> Subject: Re: JSON facet performance for aggregations >> >> Json facet on string fields run lot slower than on numeric fields. Try >> and see if you can represent studentid as a numeric field. >> >> On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" >> >> wrote: >> >> > Hi, >> > >> > I am trying to do aggregation with JSON faceting but performance is >> > very bad for one of the requests: >> > >> > json.facet={ >> > >> >studentId:{ >> > >> > type:terms, >> > >> > limit:-1, >> > >> > field:"studentId", >> > >> > facet:{ >> > >> > x:"sum(grades)" >> > >> > } >> > >> >} >> > >> > } >> > >> > >> > >> > This request finishes in 250 seconds, and we can't paginate for this >> > service for functional reason so we have to use limit:-1, and the >> > cardinality of the studentId is 7500. >> > >> > >> > >> > If I try the same with flat facet it finishes in 3 seconds : >> > stats=true&facet=true&stats.field={!tag=piv1 >> > sum=true}grades&facet.pivot={!stats=piv1}studentId >> > >> > >> > >> > We are hoping to use one approach json or flat for all our services. >> > JSON facet performance is better for many case. >> > >> > >> > >> > Please advise on why the performance for this is so bad and if we >> > can improve it. Also what is the default algorithm used for json facet. >> > >> > >> > >> > Thanks >> > >> > Mikhail >> > >>
RE: JSON facet performance for aggregations
1- studentId has docValue = true . it is of type double which is 2- If we just facet without aggregation it finishes in good time 60ms: json.facet={ studentId:{ type:terms, limit:-1, field:" studentId " } } Thanks -Original Message- From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] Sent: Sunday, April 30, 2017 10:44 AM To: solr-user@lucene.apache.org Subject: RE: JSON facet performance for aggregations Please enable doc values and try. There is a bug in the source code which causes json facet on string field to run very slow. On numeric fields it runs fine with doc value enabled. On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" wrote: > Hi Vijay, > It is already numeric field. > It is huge difference between json and flat here. Do you know the > reason for this? Is there a way to improve it ? > > -Original Message- > From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] > Sent: Sunday, April 30, 2017 9:58 AM > To: solr-user@lucene.apache.org > Subject: Re: JSON facet performance for aggregations > > Json facet on string fields run lot slower than on numeric fields. Try > and see if you can represent studentid as a numeric field. > > On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" > > wrote: > > > Hi, > > > > I am trying to do aggregation with JSON faceting but performance is > > very bad for one of the requests: > > > > json.facet={ > > > >studentId:{ > > > > type:terms, > > > > limit:-1, > > > > field:"studentId", > > > > facet:{ > > > > x:"sum(grades)" > > > > } > > > >} > > > > } > > > > > > > > This request finishes in 250 seconds, and we can't paginate for this > > service for functional reason so we have to use limit:-1, and the > > cardinality of the studentId is 7500. > > > > > > > > If I try the same with flat facet it finishes in 3 seconds : > > stats=true&facet=true&stats.field={!tag=piv1 > > sum=true}grades&facet.pivot={!stats=piv1}studentId > > > > > > > > We are hoping to use one approach json or flat for all our services. > > JSON facet performance is better for many case. > > > > > > > > Please advise on why the performance for this is so bad and if we > > can improve it. Also what is the default algorithm used for json facet. > > > > > > > > Thanks > > > > Mikhail > > >
RE: JSON facet performance for aggregations
Please enable doc values and try. There is a bug in the source code which causes json facet on string field to run very slow. On numeric fields it runs fine with doc value enabled. On Apr 30, 2017 1:41 PM, "Mikhail Ibraheem" wrote: > Hi Vijay, > It is already numeric field. > It is huge difference between json and flat here. Do you know the reason > for this? Is there a way to improve it ? > > -Original Message- > From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] > Sent: Sunday, April 30, 2017 9:58 AM > To: solr-user@lucene.apache.org > Subject: Re: JSON facet performance for aggregations > > Json facet on string fields run lot slower than on numeric fields. Try and > see if you can represent studentid as a numeric field. > > On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" > wrote: > > > Hi, > > > > I am trying to do aggregation with JSON faceting but performance is > > very bad for one of the requests: > > > > json.facet={ > > > >studentId:{ > > > > type:terms, > > > > limit:-1, > > > > field:"studentId", > > > > facet:{ > > > > x:"sum(grades)" > > > > } > > > >} > > > > } > > > > > > > > This request finishes in 250 seconds, and we can't paginate for this > > service for functional reason so we have to use limit:-1, and the > > cardinality of the studentId is 7500. > > > > > > > > If I try the same with flat facet it finishes in 3 seconds : > > stats=true&facet=true&stats.field={!tag=piv1 > > sum=true}grades&facet.pivot={!stats=piv1}studentId > > > > > > > > We are hoping to use one approach json or flat for all our services. > > JSON facet performance is better for many case. > > > > > > > > Please advise on why the performance for this is so bad and if we can > > improve it. Also what is the default algorithm used for json facet. > > > > > > > > Thanks > > > > Mikhail > > >
RE: JSON facet performance for aggregations
Hi Vijay, It is already numeric field. It is huge difference between json and flat here. Do you know the reason for this? Is there a way to improve it ? -Original Message- From: Vijay Tiwary [mailto:vijaykr.tiw...@gmail.com] Sent: Sunday, April 30, 2017 9:58 AM To: solr-user@lucene.apache.org Subject: Re: JSON facet performance for aggregations Json facet on string fields run lot slower than on numeric fields. Try and see if you can represent studentid as a numeric field. On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" wrote: > Hi, > > I am trying to do aggregation with JSON faceting but performance is > very bad for one of the requests: > > json.facet={ > >studentId:{ > > type:terms, > > limit:-1, > > field:"studentId", > > facet:{ > > x:"sum(grades)" > > } > >} > > } > > > > This request finishes in 250 seconds, and we can't paginate for this > service for functional reason so we have to use limit:-1, and the > cardinality of the studentId is 7500. > > > > If I try the same with flat facet it finishes in 3 seconds : > stats=true&facet=true&stats.field={!tag=piv1 > sum=true}grades&facet.pivot={!stats=piv1}studentId > > > > We are hoping to use one approach json or flat for all our services. > JSON facet performance is better for many case. > > > > Please advise on why the performance for this is so bad and if we can > improve it. Also what is the default algorithm used for json facet. > > > > Thanks > > Mikhail >
Re: JSON facet performance for aggregations
Json facet on string fields run lot slower than on numeric fields. Try and see if you can represent studentid as a numeric field. On Apr 30, 2017 1:19 PM, "Mikhail Ibraheem" wrote: > Hi, > > I am trying to do aggregation with JSON faceting but performance is very > bad for one of the requests: > > json.facet={ > >studentId:{ > > type:terms, > > limit:-1, > > field:"studentId", > > facet:{ > > x:"sum(grades)" > > } > >} > > } > > > > This request finishes in 250 seconds, and we can't paginate for this > service for functional reason so we have to use limit:-1, and the > cardinality of the studentId is 7500. > > > > If I try the same with flat facet it finishes in 3 seconds : > stats=true&facet=true&stats.field={!tag=piv1 > sum=true}grades&facet.pivot={!stats=piv1}studentId > > > > We are hoping to use one approach json or flat for all our services. JSON > facet performance is better for many case. > > > > Please advise on why the performance for this is so bad and if we can > improve it. Also what is the default algorithm used for json facet. > > > > Thanks > > Mikhail >
JSON facet performance for aggregations
Hi, I am trying to do aggregation with JSON faceting but performance is very bad for one of the requests: json.facet={ studentId:{ type:terms, limit:-1, field:"studentId", facet:{ x:"sum(grades)" } } } This request finishes in 250 seconds, and we can't paginate for this service for functional reason so we have to use limit:-1, and the cardinality of the studentId is 7500. If I try the same with flat facet it finishes in 3 seconds : stats=true&facet=true&stats.field={!tag=piv1 sum=true}grades&facet.pivot={!stats=piv1}studentId We are hoping to use one approach json or flat for all our services. JSON facet performance is better for many case. Please advise on why the performance for this is so bad and if we can improve it. Also what is the default algorithm used for json facet. Thanks Mikhail
Re: prefix facet performance
In SimpleFacets.getFacetTermEnumCounts, we seek to the first term matching the prefix using the index and then for each term after compare the prefix until it no longer matches. -Yonik On Mon, Apr 24, 2017 at 5:04 AM, alessandro.benedetti wrote: > Thanks Yonik and Maria. > It make sense, if we reduce the number of terms, term enum becomes a very > good solution. > @Yonik : do we still check the prefix on the term dictionary one by one, or > an FST is used to identify the set of candidate terms ? > > I will check the code later, > > Regards > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: > http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html > Sent from the Solr - User mailing list archive at Nabble.com.
Re: prefix facet performance
Thanks Yonik and Maria. It make sense, if we reduce the number of terms, term enum becomes a very good solution. @Yonik : do we still check the prefix on the term dictionary one by one, or an FST is used to identify the set of candidate terms ? I will check the code later, Regards - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331553.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: prefix facet performance
I see. Once I specify a prefix the number of terms is MUCH smaller. Thank you again for all your help. Maria On Fri, Apr 21, 2017 at 1:46 PM, Yonik Seeley wrote: > On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea > wrote: > > The field is: > > > > > > > > and using unique() I found that it has 700K+ unique values. > > > > The query before (that takes ~10s): > > > > wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field= > concept&facet.prefix=A/ > > > > the query after (that is almost instant): > > > > wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field= > concept&facet.prefix=A/&facet.method=enum' > > Ah, the fact that you specify a facet.prefix makes this perfectly > aligned for the "enum" method, which can skip directly to the first > term on-or-after "A/" > facet.method=enum goes term-by-term, calculating the intersection with > the facet domain. > In this case, it's the number of terms that start with "A/" that > matters, not the number of terms in the entire field (hence the > speedup). > > -Yonik >
Re: prefix facet performance
On Fri, Apr 21, 2017 at 4:25 PM, Maria Muslea wrote: > The field is: > > > > and using unique() I found that it has 700K+ unique values. > > The query before (that takes ~10s): > > wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/ > > the query after (that is almost instant): > > wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/&facet.method=enum' Ah, the fact that you specify a facet.prefix makes this perfectly aligned for the "enum" method, which can skip directly to the first term on-or-after "A/" facet.method=enum goes term-by-term, calculating the intersection with the facet domain. In this case, it's the number of terms that start with "A/" that matters, not the number of terms in the entire field (hence the speedup). -Yonik
Re: prefix facet performance
The field is: and using unique() I found that it has 700K+ unique values. The query before (that takes ~10s): wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/ the query after (that is almost instant): wt=json&indent=true&q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/&facet.method=enum' Maria On Fri, Apr 21, 2017 at 8:59 AM, alessandro.benedetti wrote: > That is quite interesting ! > You can use the stats module ( in association with the Json facets if you > need it) to calculate an accurate approximation of the unique values [1] > [2] > . > > Good to know it improved your scenario, I may need to update my knowledge > of > term enum internals! > Can you describe your schema configuration for the field and the way you > were faceting before in comparison to the way you facet now ( with the > related benefit) > > [1] https://cwiki.apache.org/confluence/display/solr/The+Stats+Component > [2] http://yonik.com/solr-count-distinct/ > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/prefix-facet-performance-tp4330684p4331309.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: prefix facet performance
That is quite interesting ! You can use the stats module ( in association with the Json facets if you need it) to calculate an accurate approximation of the unique values [1] [2] . Good to know it improved your scenario, I may need to update my knowledge of term enum internals! Can you describe your schema configuration for the field and the way you were faceting before in comparison to the way you facet now ( with the related benefit) [1] https://cwiki.apache.org/confluence/display/solr/The+Stats+Component [2] http://yonik.com/solr-count-distinct/ - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331309.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: prefix facet performance
Actually using facet.method=enum made a HUGE difference even in my case where I have many unique values. I am happy with the query response time now. Is there a way in SOLR to count the unique values for a field? If not, I could run the reindexing and count the unique values while I add them to give you a more accurate count of how many I have (there is a good chance that I have more than 500K). Thanks, Maria On Fri, Apr 21, 2017 at 1:16 AM, alessandro.benedetti wrote: > Hi Maria, > If you have 100-500.000 unique values for the field you are interested in, > and the cardinality of your search results is actually quite small in > comparison, I am not that sure term enum will help you that much ... > > To simplify, with the term enum approach, you iterate over each unique > value, if it matches the prefix and then you count the intersection of the > result set with the posting list for that term. > In your case, your result set is likely to be much smaller than the number > of unique values. > I would assume you are using the fc approach, which in my opinion was not a > bad idea. > Let's start from the algorithm you are using and the schema config for your > field, > > Cheers > > > > - > --- > Alessandro Benedetti > Search Consultant, R&D Software Engineer, Director > Sease Ltd. - www.sease.io > -- > View this message in context: http://lucene.472066.n3. > nabble.com/prefix-facet-performance-tp4330684p4331221.html > Sent from the Solr - User mailing list archive at Nabble.com. >
Re: prefix facet performance
Hi Maria, If you have 100-500.000 unique values for the field you are interested in, and the cardinality of your search results is actually quite small in comparison, I am not that sure term enum will help you that much ... To simplify, with the term enum approach, you iterate over each unique value, if it matches the prefix and then you count the intersection of the result set with the posting list for that term. In your case, your result set is likely to be much smaller than the number of unique values. I would assume you are using the fc approach, which in my opinion was not a bad idea. Let's start from the algorithm you are using and the schema config for your field, Cheers - --- Alessandro Benedetti Search Consultant, R&D Software Engineer, Director Sease Ltd. - www.sease.io -- View this message in context: http://lucene.472066.n3.nabble.com/prefix-facet-performance-tp4330684p4331221.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: prefix facet performance
Hmmm, not sure. Probably in the range of 100K-500K. Before writing the email I was just looking at: http://yonik.com/facet-performance/ Wow, using facet.method=enum makes a big difference. I will read on it to understand what it does. Thank you so much. Maria On Tue, Apr 18, 2017 at 5:21 PM, Yonik Seeley wrote: > How many unique values in the index? > You could try facet.method=enum > > -Yonik > > > On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea > wrote: > > Hi, > > > > I have ~40K documents in SOLR (not many) and a multivalued facet field > that > > contains at least 2K values per document. > > > > The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, > and > > I use facet.prefix. > > > > q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/ > > > > > > with "concept" defined as: > > > > > > > > > > > > This generates the output that I am looking for, but it takes more than > 10 > > seconds per query. > > > > > > Is there any way that I could improve the facet query performance for > this > > example? > > > > > > Thank you, > > > > Maria >
Re: prefix facet performance
How many unique values in the index? You could try facet.method=enum -Yonik On Tue, Apr 18, 2017 at 8:16 PM, Maria Muslea wrote: > Hi, > > I have ~40K documents in SOLR (not many) and a multivalued facet field that > contains at least 2K values per document. > > The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and > I use facet.prefix. > > q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/ > > > with "concept" defined as: > > > > > > This generates the output that I am looking for, but it takes more than 10 > seconds per query. > > > Is there any way that I could improve the facet query performance for this > example? > > > Thank you, > > Maria
prefix facet performance
Hi, I have ~40K documents in SOLR (not many) and a multivalued facet field that contains at least 2K values per document. The values of the facet field look like: A/B, A/C, A/D, C/E, M/F, etc, and I use facet.prefix. q=*:*&rows=0&facet=true&facet.field=concept&facet.prefix=A/ with "concept" defined as: This generates the output that I am looking for, but it takes more than 10 seconds per query. Is there any way that I could improve the facet query performance for this example? Thank you, Maria
Re: 5.4 facet performance thumbs-up
Awesome, thanks for the feedback! -Yonik On Tue, Dec 22, 2015 at 5:36 PM, Aigner, Max wrote: > I'm happy to report that we are seeing significant speed-ups in our queries > with Json facets on 5.4 vs regular facets on 5.1. Our queries contain mostly > terms facets, many of them with exclusion tags and prefix filtering. > Nice work!
5.4 facet performance thumbs-up
I'm happy to report that we are seeing significant speed-ups in our queries with Json facets on 5.4 vs regular facets on 5.1. Our queries contain mostly terms facets, many of them with exclusion tags and prefix filtering. Nice work!
Re: 答复: (Issue) How improve solr facet performance
Alice, RE grouping, try Solr 4.8’s new “collapse” qparser w/ “expand" SearchComponent. The ref guide has the docs. It’s usually a faster equivalent approach to group=true Do you care to comment further on NewEgg’s apparent switch from Endeca to Solr? (confirm true/false and rationale) ~ David Smiley Freelance Apache Lucene/Solr Search Consultant/Developer http://www.linkedin.com/in/davidwsmiley On Tue, May 27, 2014 at 4:17 AM, Alice.H.Yang (mis.cnsh04.Newegg) 41493 < alice.h.y...@newegg.com> wrote: > Hi, Token > > 1. > I set the 3 fields with hundreds of values uses fc and the rest > uses enum, the performance is improved 2 times compared with no parameter, > and then I add facet.method=20 , the performance is improved about 4 times > compared with no parameter. > And I also tried setting 9 facet field to one copyfield, I test > the performance, it is improved about 2.5 times compared with no parameter. > So, It is improved a lot under your advice, thanks a lot. > 2. > Now I have another performance issue, It's the group performance. > The number of data is as same as facet performance scenario. > When the keyword search hits about one million documents, the QTime is > about 600ms.(It doesn't query the first time, it's in cache) > > Query url: > > select?fl=item_catalog&q=default_search:paramter&defType=edismax&rows=50&group=true&group.field=item_group_id&group.ngroups=true&group.sort=stock4sort%20desc,final_price%20asc,is_selleritem%20asc&sort=score%20desc,default_sort%20desc > > It need Qtime about 600ms. > > This query have two parameter: > 1. fl one field > 2. group=true, > group.ngroups=true > > If I set group=false,, the QTime is only 1 ms. > But I need do group and group.ngroups, How can I improve the group > performance under this demand. Do you have some advice for me. I'm looking > forward to your reply. > > Best Regards, > Alice Yang > +86-021-51530666*41493 > Floor 19,KaiKai Plaza,888,Wanhandu Rd,Shanghai(200042) > > > -----邮件原件- > 发件人: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] > 发送时间: 2014年5月24日 15:17 > 收件人: solr-user@lucene.apache.org > 主题: RE: (Issue) How improve solr facet performance > > Alice.H.Yang (mis.cnsh04.Newegg) 41493 [alice.h.y...@newegg.com] wrote: > > 1. I'm sorry, I have made a mistake, the total number of documents is > 32 Million, not 320 Million. > > 2. The system memory is large for solr index, OS total has 256G, I set > the solr tomcat HEAPSIZE="-Xms25G -Xmx100G" > > 100G is a very high number. What special requirements dictates such a > large heap size? > > > Reply: 9 fields I facet on. > > Solr treats each facet separately and with facet.method=fc and 10M hits, > this means that it will iterate 9*10M = 90M document IDs and update the > counters for those. > > > Reply: 3 facet fields have one hundred unique values, other 6 facet > fields' unique values are between 3 to 15. > > So very low cardinality. This is confirmed by your low response time of > 6ms for 2925 hits. > > > And we test this scenario: If the number of facet fields' unique values > is less we add facet.method=enum, there is a little to improve performance. > > That is a shame: enum is normally the simple answer to a setup like yours. > Have you tried fine-tuning your fc/enum selection, so that the 3 fields > with hundreds of values uses fc and the rest uses enum? That might halve > your response time. > > > Since the number of unique facets is so low, I do not think that DocValues > can help you here. Besides the fine-grained fc/enum-selection above, you > could try collapsing all 9 facet-fields into a single field. The idea > behind this is that for facet.method=fc, performing faceting on a field > with (for example) 300 unique values takes practically the same amount of > time as faceting on a field with 1000 unique values: Faceting on a single > slightly larger field is much faster than faceting on 9 smaller fields. > After faceting with facet.limit=-1 on the single super-facet-field, you > must match the returned values back to their original fields: > > > If you have the facet-fields > > field0: 34 > field1: 187 > field2: 78432 > field3: 3 > ... > > then collapse them by or-ing a field-specific mask that is bigger than the > max in any field, then put it all into a single field: > > fieldAll: 0xA000 | 34 > fieldAll: 0xA100 | 187 > fieldAll: 0xA200 | 78432 > fieldAll: 0xA300 | 3 > ... > > perform the facet request on fieldAll with facet.limit=-1 and split the > resulting counts with > > for (entry: facetResultAll) { > switch (0xFF00 & entry.value) { > case 0xA000: > field0.add(entry.value, entry.count); > break; > case 0xA100: > field1.add(entry.value, entry.count); > break; > ... > } > } > > > Regards, > Toke Eskildsen, State and University Library, Denmark >
答复: (Issue) How improve solr facet performance
Hi, Token 1. I set the 3 fields with hundreds of values uses fc and the rest uses enum, the performance is improved 2 times compared with no parameter, and then I add facet.method=20 , the performance is improved about 4 times compared with no parameter. And I also tried setting 9 facet field to one copyfield, I test the performance, it is improved about 2.5 times compared with no parameter. So, It is improved a lot under your advice, thanks a lot. 2. Now I have another performance issue, It's the group performance. The number of data is as same as facet performance scenario. When the keyword search hits about one million documents, the QTime is about 600ms.(It doesn't query the first time, it's in cache) Query url: select?fl=item_catalog&q=default_search:paramter&defType=edismax&rows=50&group=true&group.field=item_group_id&group.ngroups=true&group.sort=stock4sort%20desc,final_price%20asc,is_selleritem%20asc&sort=score%20desc,default_sort%20desc It need Qtime about 600ms. This query have two parameter: 1. fl one field 2. group=true, group.ngroups=true If I set group=false,, the QTime is only 1 ms. But I need do group and group.ngroups, How can I improve the group performance under this demand. Do you have some advice for me. I'm looking forward to your reply. Best Regards, Alice Yang +86-021-51530666*41493 Floor 19,KaiKai Plaza,888,Wanhandu Rd,Shanghai(200042) -邮件原件- 发件人: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 发送时间: 2014年5月24日 15:17 收件人: solr-user@lucene.apache.org 主题: RE: (Issue) How improve solr facet performance Alice.H.Yang (mis.cnsh04.Newegg) 41493 [alice.h.y...@newegg.com] wrote: > 1. I'm sorry, I have made a mistake, the total number of documents is 32 > Million, not 320 Million. > 2. The system memory is large for solr index, OS total has 256G, I set the > solr tomcat HEAPSIZE="-Xms25G -Xmx100G" 100G is a very high number. What special requirements dictates such a large heap size? > Reply: 9 fields I facet on. Solr treats each facet separately and with facet.method=fc and 10M hits, this means that it will iterate 9*10M = 90M document IDs and update the counters for those. > Reply: 3 facet fields have one hundred unique values, other 6 facet fields' > unique values are between 3 to 15. So very low cardinality. This is confirmed by your low response time of 6ms for 2925 hits. > And we test this scenario: If the number of facet fields' unique values is > less we add facet.method=enum, there is a little to improve performance. That is a shame: enum is normally the simple answer to a setup like yours. Have you tried fine-tuning your fc/enum selection, so that the 3 fields with hundreds of values uses fc and the rest uses enum? That might halve your response time. Since the number of unique facets is so low, I do not think that DocValues can help you here. Besides the fine-grained fc/enum-selection above, you could try collapsing all 9 facet-fields into a single field. The idea behind this is that for facet.method=fc, performing faceting on a field with (for example) 300 unique values takes practically the same amount of time as faceting on a field with 1000 unique values: Faceting on a single slightly larger field is much faster than faceting on 9 smaller fields. After faceting with facet.limit=-1 on the single super-facet-field, you must match the returned values back to their original fields: If you have the facet-fields field0: 34 field1: 187 field2: 78432 field3: 3 ... then collapse them by or-ing a field-specific mask that is bigger than the max in any field, then put it all into a single field: fieldAll: 0xA000 | 34 fieldAll: 0xA100 | 187 fieldAll: 0xA200 | 78432 fieldAll: 0xA300 | 3 ... perform the facet request on fieldAll with facet.limit=-1 and split the resulting counts with for (entry: facetResultAll) { switch (0xFF00 & entry.value) { case 0xA000: field0.add(entry.value, entry.count); break; case 0xA100: field1.add(entry.value, entry.count); break; ... } } Regards, Toke Eskildsen, State and University Library, Denmark
RE: (Issue) How improve solr facet performance
Alice.H.Yang (mis.cnsh04.Newegg) 41493 [alice.h.y...@newegg.com] wrote: > 1. I'm sorry, I have made a mistake, the total number of documents is 32 > Million, not 320 Million. > 2. The system memory is large for solr index, OS total has 256G, I set the > solr tomcat HEAPSIZE="-Xms25G -Xmx100G" 100G is a very high number. What special requirements dictates such a large heap size? > Reply: 9 fields I facet on. Solr treats each facet separately and with facet.method=fc and 10M hits, this means that it will iterate 9*10M = 90M document IDs and update the counters for those. > Reply: 3 facet fields have one hundred unique values, other 6 facet fields' > unique values are between 3 to 15. So very low cardinality. This is confirmed by your low response time of 6ms for 2925 hits. > And we test this scenario: If the number of facet fields' unique values is > less we add facet.method=enum, there is a little to improve performance. That is a shame: enum is normally the simple answer to a setup like yours. Have you tried fine-tuning your fc/enum selection, so that the 3 fields with hundreds of values uses fc and the rest uses enum? That might halve your response time. Since the number of unique facets is so low, I do not think that DocValues can help you here. Besides the fine-grained fc/enum-selection above, you could try collapsing all 9 facet-fields into a single field. The idea behind this is that for facet.method=fc, performing faceting on a field with (for example) 300 unique values takes practically the same amount of time as faceting on a field with 1000 unique values: Faceting on a single slightly larger field is much faster than faceting on 9 smaller fields. After faceting with facet.limit=-1 on the single super-facet-field, you must match the returned values back to their original fields: If you have the facet-fields field0: 34 field1: 187 field2: 78432 field3: 3 ... then collapse them by or-ing a field-specific mask that is bigger than the max in any field, then put it all into a single field: fieldAll: 0xA000 | 34 fieldAll: 0xA100 | 187 fieldAll: 0xA200 | 78432 fieldAll: 0xA300 | 3 ... perform the facet request on fieldAll with facet.limit=-1 and split the resulting counts with for (entry: facetResultAll) { switch (0xFF00 & entry.value) { case 0xA000: field0.add(entry.value, entry.count); break; case 0xA100: field1.add(entry.value, entry.count); break; ... } } Regards, Toke Eskildsen, State and University Library, Denmark
fw: (Issue) How improve solr facet performance
Hi, Solr Developer Thanks very much for your timely reply. 1. I'm sorry, I have made a mistake, the total number of documents is 32 Million, not 320 Million. 2. The system memory is large for solr index, OS total has 256G, I set the solr tomcat HEAPSIZE="-Xms25G -Xmx100G" -How many fields are you faceting on? Reply: 9 fields I facet on. - How many unique values does your facet fields have (approximately)? Reply: 3 facet fields have one hundred unique values, other 6 facet fields' unique values are between 3 to 15. - What is the content of your facets (Strings, numbers?) Reply: 9 fields are all numbers. - Which facet.method do you use? Reply: Used the default facet.method=fc And we test this scenario: If the number of facet fields' unique values is less we add facet.method=enum, there is a little to improve performance. - What is the response time with faceting and a few thousand hits? Reply: QTime is 6 Best Regards, Alice Yang +86-021-51530666*41493 Floor 19,KaiKai Plaza,888,Wanhandu Rd,Shanghai(200042) -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: Friday, May 23, 2014 8:08 PM To: d...@lucene.apache.org Subject: Re: (Issue) How improve solr facet performance On Fri, 2014-05-23 at 11:45 +0200, Alice.H.Yang (mis.cnsh04.Newegg) 41493 wrote: > We are blocked by solr facet performance when query hits many > documents. (about 10,000,000) [320M documents, immediate response for plain search with 1M hits] > But when we add several facet.field to do facet ,QTime increaseto > 220ms or more. It is not clear whether your observation of increased response time is due to many hits or faceting in itself. - How many fields are you faceting on? - How many unique values does your facet fields have (approximately)? - What is the content of your facets (Strings, numbers?) - Which facet.method do you use? - What is the response time with faceting and a few thousand hits? > Do you have some advice on how improve the facet performance when hit > many documents. That depends on whether your bottleneck is the hitcount itself, the number of unique facet values or something third like I/O. - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
RE: Facet performance
On Tue, October 22, 2013 5:23 PM Michael Lemke wrote: >On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote: >>On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote: >>> QTime fc: >>>never returns, webserver restarts itself after 30 min with 100% CPU >>> load >> >>It might be because it dies due to garbage collection. But since more >>memory (as your test server presumably has) just leads to the too many >>values-error, there isn't much to do. > >Essentially, fc is out then. > >> >>> QTime=41205 facet.prefix=q=frequent_word >>> numFound=44532 >>> >>> Same query repeated: >>> QTime=225810 facet.prefix=q=ottomotor >>> numFound=909 >>> QTime=199839 facet.prefix=q=ottomotor >>> numFound=909 >> >>I am stumped on this, sorry. I do not understand why the 'ottomotor' >>query can take 5 times as long as the 'frequent_word'-one. > >I looked into this some more this morning. I noticed the java process was >doing >a lot of I/O as shown in Process Explorer. For the frequent_word it read >about >180MB, for ottomotor is was about seven times as much, ~ 1,200 MB. > Got another observation today. The response time for q=ottomotor depends on facet.limit: QTime=59300 facet.limit=2 QTime=69395 facet.limit=4 QTime=85208 facet.limit=6 QTime=158150 facet.limit=8 QTime=186276 facet.limit=10 QTime=231763 facet.limit=15 QTime=260437 facet.limit=20 QTime=312268 facet.limit=30 For q=frequent_word the result is much less pronounced and shows only for facet.limit >= 15 : QTime=0 facet.limit=0 QTime=20535 facet.limit=1 QTime=13456 facet.limit=2 QTime=13925 facet.limit=4 QTime=13705 facet.limit=6 QTime=13924 facet.limit=8 QTime=13799 facet.limit=10 QTime=14361 facet.limit=15 QTime=14704 facet.limit=20 QTime=15189 facet.limit=30 QTime=16783 facet.limit=50 QTime=57128 facet.limit=500 Looks to me for solr to collect enough facets to fulfill the limit constraint it has to read much more of the index in the case of the infrequent word. >jconsole didn't show anything unusual according to our more experienced Java >experts here. Nor was the machine swapping. > >Is it possible to screw up an index such that this sort of faceting leads to >constant reading of the index? Something like full table scans in a db? > Michael
RE: Facet performance
On Tue, 2013-10-22 at 17:25 +0200, Lemke, Michael SZ/HZA-ZSW wrote: > On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote: > >> This is with Solr 1.4. > >Really ? > >This sound really outdated to me. > >Have you tried a tried more recent version, 4.5 just went out ? > > Sorry, can't. Too much `grown' stuff. I did not see that. I guess I parsed it as 4.1. Well, that rules out DocValues and fcs (as far as I remember). I am a bit surprised that the limit on #terms with fc is also in 1.4. I thought it was introduced in a later version. We too has been in a position where upgrading was hard due to homegrown addons. We even scrapped some DidYouMean-like functionality when going from 3.x to 4.x, but 4.x was so much better that there were little choice. Last suggestion for using fc: Create 2 or more CONTENT-fields and choose between them randomly when indexing. Facet on all the CONTENT fields and merge the results. It will take a bit more RAM though, so it is still out on your (assumedly) 32 bit machine. Regards, Toke Eskildsen, State and University Library, Denmark
RE: Facet performance
On Tue, October 22, 2013 11:54 AM Andre Bois-Crettez wrote: > >> This is with Solr 1.4. >Really ? >This sound really outdated to me. >Have you tried a tried more recent version, 4.5 just went out ? Sorry, can't. Too much `grown' stuff. Michael
RE: Facet performance
On Tue, October 22, 2013 9:23 AM Toke Eskildsen wrote: >On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote: >> QTime fc: >>never returns, webserver restarts itself after 30 min with 100% CPU >> load > >It might be because it dies due to garbage collection. But since more >memory (as your test server presumably has) just leads to the too many >values-error, there isn't much to do. Essentially, fc is out then. > >> QTime=41205 facet.prefix=q=frequent_word >> numFound=44532 >> >> Same query repeated: >> QTime=225810 facet.prefix=q=ottomotor >> numFound=909 >> QTime=199839 facet.prefix=q=ottomotor >> numFound=909 > >I am stumped on this, sorry. I do not understand why the 'ottomotor' >query can take 5 times as long as the 'frequent_word'-one. I looked into this some more this morning. I noticed the java process was doing a lot of I/O as shown in Process Explorer. For the frequent_word it read about 180MB, for ottomotor is was about seven times as much, ~ 1,200 MB. jconsole didn’t show anything unusual according to our more experienced Java experts here. Nor was the machine swapping. Is it possible to screw up an index such that this sort of faceting leads to constant reading of the index? Something like full table scans in a db? Michael
Re: Facet performance
This is with Solr 1.4. Really ? This sound really outdated to me. Have you tried a tried more recent version, 4.5 just went out ? -- André Bois-Crettez Software Architect Search Developer http://www.kelkoo.com/ Kelkoo SAS Société par Actions Simplifiée Au capital de € 4.168.964,30 Siège social : 8, rue du Sentier 75002 Paris 425 093 069 RCS Paris Ce message et les pièces jointes sont confidentiels et établis à l'attention exclusive de leurs destinataires. Si vous n'êtes pas le destinataire de ce message, merci de le détruire et d'en avertir l'expéditeur.
RE: Facet performance
On Mon, 2013-10-21 at 16:57 +0200, Lemke, Michael SZ/HZA-ZSW wrote: > QTime enum: > 1st call: 1200 > subsequent calls: 200 Those numbers seems fine. > QTime fc: >never returns, webserver restarts itself after 30 min with 100% CPU > load It might be because it dies due to garbage collection. But since more memory (as your test server presumably has) just leads to the too many values-error, there isn't much to do. > QTime=41205 facet.prefix=q=frequent_word > numFound=44532 > > Same query repeated: > QTime=225810 facet.prefix=q=ottomotor > numFound=909 > QTime=199839 facet.prefix=q=ottomotor > numFound=909 I am stumped on this, sorry. I do not understand why the 'ottomotor' query can take 5 times as long as the 'frequent_word'-one. > QTime=185948 facet.prefix=q=ottomotor > numFound=909 > > QTime=3344 facet.prefix=d q=ottomotor > numFound=909 Fits with expectations. > >- Documents in your index > 13,434,414 > > >- Unique values in the CONTENT field > Not sure how to get this. In luke I find > 21,797,514 term count CONTENT Those are the relevant numbers for faceting. There is a limit of 2^24 (16M) terms for facet.method=enum, although I am a bit unsure if that is for the whole index or per segment. Come to think of it, if you have a multi-segmented index, you might want to try facet.method.fcs. It should have faster startup than fc and better performance than enum for fields with a large number of unique values. Memory requirements should be between fc and enum. > >- Xmx > The maximum the system allows me to get: 1612m > > Maybe I have a hopelessly under-dimensioned server for this sort of things? Well, 1612m should be enough for the faceting in itself; it it the startup that is the killer. A rule of thumb for fc is that the internal structure takes at least #docs*log(#references) + #references*log(#unique_values) bytes If your content field is a description, let's say that each description has 40 words, which gives us 500M references from documents to facet values. This translates to 13M*log(500M) + 500M*log(22M) bytes ~= 13M*29 + 500M*25 bytes ~= 380MB. Taking into account that building the structure has an overhead of 2-3 times that, we are approaching the memory limit of 1612m. If the index is updated, a new facet structure is build all over again while the old structure is still in memory. If you need better performance on your large field I would suggest, in order of priority: - facet.method=fcs - facet.method=fcs with DocValues - Shard your index and use facet.method=fc - SOLR-2412 (https://issues.apache.org/jira/browse/SOLR-2412) SOLR-2412 is a last resort, but it does have the same speed as facet.method=fc only without the 16M unique values limitation. Regards, Toke Eskildsen, State and University Library, Denmark
RE: Facet performance
On Mon, October 21, 2013 10:04 AM, Toke Eskildsen wrote: >On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote: >> Toke Eskildsen wrote: >> > Unfortunately the enum-solution is normally quite slow when there >> > are enough unique values to trigger the "too many > values"-exception. >> > [...] >> >> [...] And yes, the fc method was terribly slow in a case where it did >> work. Something like 20 minutes whereas enum returned within a few >> seconds. > >Err.. What? That sounds _very_ strange. You have millions of unique >values so fc should be a lot faster than enum, not the other way around. > >I assume the 20 minutes was for the first call. How fast does subsequent >calls return for fc? QTime enum: 1st call: 1200 subsequent calls: 200 QTime fc: never returns, webserver restarts itself after 30 min with 100% CPU load This is on the test system, the production system managed to return with "... Too many values for UnInvertedField faceting ...". However, I also have different faceting queries I played with today. One complete example: q=ottomotor&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 These are the results, all with facet.method=enum (fc doesn't work). They were executed in the sequence shown on an otherwise unused server: QTime=41205 facet.prefix=q=frequent_word numFound=44532 Same query repeated: QTime=225810 facet.prefix=q=ottomotor numFound=909 QTime=199839 facet.prefix=q=ottomotor numFound=909 QTime=0 facet.prefix=q=ottomotor jkdhwjfh numFound=0 QTime=0 facet.prefix=q=jkdhwjfh numFound=0 QTime=185948 facet.prefix=q=ottomotor numFound=909 QTime=3344 facet.prefix=d q=ottomotor numFound=909 QTime=3078 facet.prefix=d q=ottomotor numFound=909 QTime=3141 facet.prefix=d q=ottomotor numFound=909 The response time is obviously not dependent on the number of documents found. Caching doesn't kick in either. > > >Maybe you could provide some approximate numbers? I'll try, see below. Thanks for asking and having a closer look. > >- Documents in your index 13,434,414 >- Unique values in the CONTENT field Not sure how to get this. In luke I find 21,797,514 term count CONTENT Is that what you mean? >- Hits are returned from a typical query Hm, that can be anything between 0 and 40,000 or more. Or do you mean from the facets? Or do my tests above answer it? >- Xmx The maximum the system allows me to get: 1612m Maybe I have a hopelessly under-dimensioned server for this sort of things? Thanks a lot for your help, Michael
RE: Facet performance
On Fri, 2013-10-18 at 18:30 +0200, Lemke, Michael SZ/HZA-ZSW wrote: > Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: > > Unfortunately the enum-solution is normally quite slow when there > > are enough unique values to trigger the "too many > values"-exception. > > [...] > > [...] And yes, the fc method was terribly slow in a case where it did > work. Something like 20 minutes whereas enum returned within a few > seconds. Err.. What? That sounds _very_ strange. You have millions of unique values so fc should be a lot faster than enum, not the other way around. I assume the 20 minutes was for the first call. How fast does subsequent calls return for fc? Maybe you could provide some approximate numbers? - Documents in your index - Unique values in the CONTENT field - Hits are returned from a typical query - Xmx Regards, Toke Eskildsen, State and University Library, Denmark
RE: Facet performance
: >> 1. q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 : >> 2. q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 : > : >> The only difference is am empty facet.prefix in the first query. : >If you index was just opened when you issued your queries, the first : request will be notably slower than the second as the facet values might : not be in the disk cache. : : I know but it shouldn't be orders of magnitudes as in this example, should it? in and of itself: it can be if your index is large enough and none of the disk pages are in the file system buffer. more significantly however, is that depending on how big your filterCache is, the first request could eaisly be caching all of filters needed for the second query -- at a minimum it's definitely caching your main query which will be re-used and save a lot of time independent of hte faceting. -Hoss
Re: Facet performance
DocValues is the new black http://wiki.apache.org/solr/DocValues Otis -- Solr & ElasticSearch Support -- http://sematext.com/ SOLR Performance Monitoring -- http://sematext.com/spm On Fri, Oct 18, 2013 at 12:30 PM, Lemke, Michael SZ/HZA-ZSW wrote: > Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: >>Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: >>> 1. >>> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 >>> 2. >>> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 >> >>> The only difference is am empty facet.prefix in the first query. >> >>> The first query returns after some 20 seconds (QTime 2 in the result) >>> while >>> the second one takes only 80 msec (QTime 80). Why is this? >> >>If you index was just opened when you issued your queries, the first request >>will be notably slower than the second as the facet values might not be in > the disk cache. > > I know but it shouldn't be orders of magnitudes as in this example, should it? > >> >>Furthermore, for enum the difference between no prefix and some prefix is >>huge. As enum iterates values first (as opposed to fc that iterates hits >>first), limiting to only the values that starts with 'a' ought to speed up >>retrieval by a factor 10 or more. > > Thanks. That is what we sort of figured but it's good to know for sure. Of > course it begs the question if there is a way to speed this up? > >> >>> And as side note: facet.method=fc makes the queries run 'forever' and >>> eventually >>> fail with org.apache.solr.common.SolrException: Too many values for >>> UnInvertedField faceting on field CONTENT. >> >>An internal memory structure optimization in Solr limits the amount of >>possible unique values when using fc. It is not a bug as such, but more a >>consequence of a choice. Unfortunately the enum-solution is normally quite >>slow when there are enough unique values to trigger the "too many >>values"-exception. I know too little about the structures for DocValues to >>say if they will help here, but you might want to take a look at those. > > What is DocValues? Haven't heard of it yet. And yes, the fc method was > terribly slow in a case where it did work. Something like 20 minutes whereas > enum returned within a few seconds. > > Michael >
RE: Facet performance
Toke Eskildsen [mailto:t...@statsbiblioteket.dk] wrote: >Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: >> 1. >> q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 >> 2. >> q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 > >> The only difference is am empty facet.prefix in the first query. > >> The first query returns after some 20 seconds (QTime 2 in the result) >> while >> the second one takes only 80 msec (QTime 80). Why is this? > >If you index was just opened when you issued your queries, the first request >will be notably slower than the second as the facet values might not be in the disk cache. I know but it shouldn't be orders of magnitudes as in this example, should it? > >Furthermore, for enum the difference between no prefix and some prefix is >huge. As enum iterates values first (as opposed to fc that iterates hits >first), limiting to only the values that starts with 'a' ought to speed up >retrieval by a factor 10 or more. Thanks. That is what we sort of figured but it's good to know for sure. Of course it begs the question if there is a way to speed this up? > >> And as side note: facet.method=fc makes the queries run 'forever' and >> eventually >> fail with org.apache.solr.common.SolrException: Too many values for >> UnInvertedField faceting on field CONTENT. > >An internal memory structure optimization in Solr limits the amount of >possible unique values when using fc. It is not a bug as such, but more a >consequence of a choice. Unfortunately the enum-solution is normally quite >slow when there are enough unique values to trigger the "too many >values"-exception. I know too little about the structures for DocValues to say >if they will help here, but you might want to take a look at those. What is DocValues? Haven't heard of it yet. And yes, the fc method was terribly slow in a case where it did work. Something like 20 minutes whereas enum returned within a few seconds. Michael
RE: Facet performance
Lemke, Michael SZ/HZA-ZSW [lemke...@schaeffler.com] wrote: > 1. > q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 > 2. > q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 > The only difference is am empty facet.prefix in the first query. > The first query returns after some 20 seconds (QTime 2 in the result) > while > the second one takes only 80 msec (QTime 80). Why is this? If you index was just opened when you issued your queries, the first request will be notably slower than the second as the facet values might not be in the disk cache. Furthermore, for enum the difference between no prefix and some prefix is huge. As enum iterates values first (as opposed to fc that iterates hits first), limiting to only the values that starts with 'a' ought to speed up retrieval by a factor 10 or more. > And as side note: facet.method=fc makes the queries run 'forever' and > eventually > fail with org.apache.solr.common.SolrException: Too many values for > UnInvertedField faceting on field CONTENT. An internal memory structure optimization in Solr limits the amount of possible unique values when using fc. It is not a bug as such, but more a consequence of a choice. Unfortunately the enum-solution is normally quite slow when there are enough unique values to trigger the "too many values"-exception. I know too little about the structures for DocValues to say if they will help here, but you might want to take a look at those. - Toke Eskildsen
Facet performance
I am working with Solr facet fields and come across a performance problem I don't understand. Consider these two queries: 1. q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 2. q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0 The only difference is am empty facet.prefix in the first query. The first query returns after some 20 seconds (QTime 2 in the result) while the second one takes only 80 msec (QTime 80). Why is this? And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT. This is with Solr 1.4.
Re: Multivalued fields and facet performance
Otis, The reason I ask is that I run a number of sites on Solr, some with 10 million+ docs faceting on similar types of data, and have not seen anywhere near this length of initial delay. The main difference is that these sites facet on single value fields rather that multivalued and that this site is searching on 3 times the volume of data. Would switching to single valued (I'd rather not) make much of a difference. I've also noticed that multivalued fields aren't populating the lucene field cache. Is this the correct behaviour. Regards Howard On 10 January 2011 14:55, Otis Gospodnetic wrote: > Hi Howard, > > This is normal. Your first query is reading a bunch of index data from > disk and > your RAM is then caching it. If your first query involves sorting, some > more > data for FieldCache is being read and stored. If there are multiple sort > fields, one such thing for each. If facets are involves, more of that > stuff. > If you are optimizing your index you are likely to be forcing more disk > IO > > Otis > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > > > > - Original Message > > From: Howard Lee > > To: solr-user@lucene.apache.org > > Sent: Mon, January 10, 2011 8:59:03 AM > > Subject: Multivalued fields and facet performance > > > > Hi, > > > > I'd appreciate some explanation on what may be going on in the following > > scenario using multivalued fields and facets. > > > > Solr version: 1.5 > > > > Our index contains 35 million docs, and our search is using 2 > multivalued > > fields as facets. There are approx 5 million different values in one > field > > and 5000 in the other. We are seeing the following, and I'm curious as > what > > is actually happening in the background. > > > > The first search can take up to 5 minutes, all subsequent queries of any > q > > return in under a second. This is fine unless you are the first search > or > > new searcher. > > > > I plan on adding a first searcher and new searcher in the config to > avoid > > long delays every time the index is updated (once a day) but I have > concerns > > of the length of the delay in launching a new searcher, and whether this > is > > causing too much overhead. > > > > Can someone explain to me what processes are going on in the backgroud > that > > cause this behaviour so I can understand the implications or make some > > adjustments in the config to compensate. > > > > thanx > > > > Howard > > > -- WORKDIGITAL LTD workdigital.co.uk 32-34 Broadwick Street W1A 2HG London, UK Howard Lee CEO M +44(0)7931 476 766 E how...@workdigital.co.uk workhound.co.uk - salarytrack.co.uk - twitterjobsearch.com - dreamjobalert.co.uk - recruitmentadnetwork.com
Re: Multivalued fields and facet performance
Hi Howard, This is normal. Your first query is reading a bunch of index data from disk and your RAM is then caching it. If your first query involves sorting, some more data for FieldCache is being read and stored. If there are multiple sort fields, one such thing for each. If facets are involves, more of that stuff. If you are optimizing your index you are likely to be forcing more disk IO Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message > From: Howard Lee > To: solr-user@lucene.apache.org > Sent: Mon, January 10, 2011 8:59:03 AM > Subject: Multivalued fields and facet performance > > Hi, > > I'd appreciate some explanation on what may be going on in the following > scenario using multivalued fields and facets. > > Solr version: 1.5 > > Our index contains 35 million docs, and our search is using 2 multivalued > fields as facets. There are approx 5 million different values in one field > and 5000 in the other. We are seeing the following, and I'm curious as what > is actually happening in the background. > > The first search can take up to 5 minutes, all subsequent queries of any q > return in under a second. This is fine unless you are the first search or > new searcher. > > I plan on adding a first searcher and new searcher in the config to avoid > long delays every time the index is updated (once a day) but I have concerns > of the length of the delay in launching a new searcher, and whether this is > causing too much overhead. > > Can someone explain to me what processes are going on in the backgroud that > cause this behaviour so I can understand the implications or make some > adjustments in the config to compensate. > > thanx > > Howard >
Multivalued fields and facet performance
Hi, I'd appreciate some explanation on what may be going on in the following scenario using multivalued fields and facets. Solr version: 1.5 Our index contains 35 million docs, and our search is using 2 multivalued fields as facets. There are approx 5 million different values in one field and 5000 in the other. We are seeing the following, and I'm curious as what is actually happening in the background. The first search can take up to 5 minutes, all subsequent queries of any q return in under a second. This is fine unless you are the first search or new searcher. I plan on adding a first searcher and new searcher in the config to avoid long delays every time the index is updated (once a day) but I have concerns of the length of the delay in launching a new searcher, and whether this is causing too much overhead. Can someone explain to me what processes are going on in the backgroud that cause this behaviour so I can understand the implications or make some adjustments in the config to compensate. thanx Howard
facet performance when number of values is large
I have a facet field whose values are created by users. So potentially there could be a very large number of values. is that going to be a problem performance-wise? A few more questions to help me understand how facet works: - after the filter cache warmed up, will the (if any) performance problems caused by large number of facet values go away? I thought that would be the case but according to the benchmark here: http://wiki.apache.org/solr/HierarchicalFaceting SOLR-64 still had very poor performance even after the filter caches are warmed - In the wiki it was stated that facet.method=fc is excellent for situations where the number of indexed values for the field is high. Would that be the solution?
Re: facet performance tips
Right, I haven't used SOLR-475 yet and am more familiar with Bobo. I believe there are differences but I haven't gone into them yet. As I'm using Solr 1.4 now, maybe I'll test the UnInvertedField modality. Feel free to report back results as I don't think I've seen much yet? On Thu, Aug 13, 2009 at 10:51 AM, Fuad Efendi wrote: > SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to > be); check this > http://issues.apache.org/jira/browse/SOLR-475 > (and probably http://issues.apache.org/jira/browse/SOLR-711) > > -Original Message- > From: Jason Rutherglen > > Yeah we need a performance comparison, I haven't had time to put > one together. If/when I do I'll compare Bobo performance against > Solr bitset intersection based facets, compare memory > consumption. > > For near realtime Solr needs to cache and merge bitsets at the > SegmentReader level, and Bobo needs to be upgraded to work with > Lucene 2.9's searching at the segment level (currently it uses a > MultiSearcher). > > Distributed search on either should be fairly straightforward? > > On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: >> It seems BOBO-Browse is alternate faceting engine; would be interesting to >> compare performance with SOLR... Distributed? >> >> >> -Original Message- >> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] >> Sent: August-12-09 6:12 PM >> To: solr-user@lucene.apache.org >> Subject: Re: facet performance tips >> >> For your fields with many terms you may want to try Bobo >> http://code.google.com/p/bobo-browse/ which could work well with your >> case. >> >> >> >> >> > > >
RE: facet performance tips
SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to be); check this http://issues.apache.org/jira/browse/SOLR-475 (and probably http://issues.apache.org/jira/browse/SOLR-711) -Original Message- From: Jason Rutherglen Yeah we need a performance comparison, I haven't had time to put one together. If/when I do I'll compare Bobo performance against Solr bitset intersection based facets, compare memory consumption. For near realtime Solr needs to cache and merge bitsets at the SegmentReader level, and Bobo needs to be upgraded to work with Lucene 2.9's searching at the segment level (currently it uses a MultiSearcher). Distributed search on either should be fairly straightforward? On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: > It seems BOBO-Browse is alternate faceting engine; would be interesting to > compare performance with SOLR... Distributed? > > > -Original Message- > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: August-12-09 6:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > For your fields with many terms you may want to try Bobo > http://code.google.com/p/bobo-browse/ which could work well with your > case. > > > > >
Re: facet performance tips
Yeah we need a performance comparison, I haven't had time to put one together. If/when I do I'll compare Bobo performance against Solr bitset intersection based facets, compare memory consumption. For near realtime Solr needs to cache and merge bitsets at the SegmentReader level, and Bobo needs to be upgraded to work with Lucene 2.9's searching at the segment level (currently it uses a MultiSearcher). Distributed search on either should be fairly straightforward? On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote: > It seems BOBO-Browse is alternate faceting engine; would be interesting to > compare performance with SOLR... Distributed? > > > -Original Message- > From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] > Sent: August-12-09 6:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > For your fields with many terms you may want to try Bobo > http://code.google.com/p/bobo-browse/ which could work well with your > case. > > > > >
RE: facet performance tips
Interesting, it has "BoboRequestHandler implements SolrRequestHandler" - easy to try it; and shards support [Fuad Efendi] It seems BOBO-Browse is alternate faceting engine; would be interesting to compare performance with SOLR... Distributed? [Jason Rutherglen] For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case.
RE: facet performance tips
It seems BOBO-Browse is alternate faceting engine; would be interesting to compare performance with SOLR... Distributed? -Original Message- From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] Sent: August-12-09 6:12 PM To: solr-user@lucene.apache.org Subject: Re: facet performance tips For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case.
RE: facet performance tips
I took 1.4 from trunk three days ago, it seems Ok for production (at least for my Master instance which is doing writes-only). I use the same config files. 500 000 terms are Ok too; I am using several millions with pre-1.3 SOLR taken from trunk. However, do not try to "facet" (probably outdated term after SOLR-475) on generic queries such as [* TO *] (with huge resultset). For smaller query results (100,000 instead of 100,000,000) "counting terms" is fast enough (few milliseconds at http://www.tokenizer.org) -Original Message- From: Jérôme Etévé [mailto:jerome.et...@gmail.com] Sent: August-13-09 5:38 AM To: solr-user@lucene.apache.org Subject: Re: facet performance tips Thanks everyone for your advices. I increased my filterCache, and the faceting performances improved greatly. My faceted field can have at the moment ~4 different terms, so I did set a filterCache size of 5 and it works very well. However, I'm planning to increase the number of terms to maybe around 500 000, so I guess this approach won't work anymore, as I doubt a 500 000 sized fieldCache would work. So I guess my best move would be to upgrade to the soon to be 1.4 version of solr to benefit from its new faceting method. I know this is a bit off-topic, but do you have a rough idea about when 1.4 will be an official release? As well, is the current trunk OK for production? Is it compatible with 1.3 configuration files? Thanks ! Jerome. 2009/8/13 Stephen Duncan Jr : > Note that depending on the profile of your field (full text and how many > unique terms on average per document), the improvements from 1.4 may not > apply, as you may exceed the limits of the new faceting technique in Solr > 1.4. > -Stephen > > On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > >> Yes, increasing the filterCache size will help with Solr 1.3 performance. >> >> Do note that trunk (soon Solr 1.4) has dramatically improved faceting >> performance. >> >>Erik >> >> >> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: >> >> Hi everyone, >>> >>> I'm using some faceting on a solr index containing ~ 160K documents. >>> I perform facets on multivalued string fields. The number of possible >>> different values is quite large. >>> >>> Enabling facets degrades the performance by a factor 3. >>> >>> Because I'm using solr 1.3, I guess the facetting makes use of the >>> filter cache to work. My filterCache is set >>> to a size of 2048. I also noticed in my solr stats a very small ratio >>> of cache hit (~ 0.01%). >>> >>> Can it be the reason why the faceting is slow? Does it make sense to >>> increase the filterCache size so it matches more or less the number >>> of different possible values for the faceted fields? Would that not >>> make the memory usage explode? >>> >>> Thanks for your help ! >>> >>> -- >>> Jerome Eteve. >>> >>> Chat with me live at http://www.eteve.net >>> >>> jer...@eteve.net >>> >> >> > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: facet performance tips
Thanks everyone for your advices. I increased my filterCache, and the faceting performances improved greatly. My faceted field can have at the moment ~4 different terms, so I did set a filterCache size of 5 and it works very well. However, I'm planning to increase the number of terms to maybe around 500 000, so I guess this approach won't work anymore, as I doubt a 500 000 sized fieldCache would work. So I guess my best move would be to upgrade to the soon to be 1.4 version of solr to benefit from its new faceting method. I know this is a bit off-topic, but do you have a rough idea about when 1.4 will be an official release? As well, is the current trunk OK for production? Is it compatible with 1.3 configuration files? Thanks ! Jerome. 2009/8/13 Stephen Duncan Jr : > Note that depending on the profile of your field (full text and how many > unique terms on average per document), the improvements from 1.4 may not > apply, as you may exceed the limits of the new faceting technique in Solr > 1.4. > -Stephen > > On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > >> Yes, increasing the filterCache size will help with Solr 1.3 performance. >> >> Do note that trunk (soon Solr 1.4) has dramatically improved faceting >> performance. >> >>Erik >> >> >> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: >> >> Hi everyone, >>> >>> I'm using some faceting on a solr index containing ~ 160K documents. >>> I perform facets on multivalued string fields. The number of possible >>> different values is quite large. >>> >>> Enabling facets degrades the performance by a factor 3. >>> >>> Because I'm using solr 1.3, I guess the facetting makes use of the >>> filter cache to work. My filterCache is set >>> to a size of 2048. I also noticed in my solr stats a very small ratio >>> of cache hit (~ 0.01%). >>> >>> Can it be the reason why the faceting is slow? Does it make sense to >>> increase the filterCache size so it matches more or less the number >>> of different possible values for the faceted fields? Would that not >>> make the memory usage explode? >>> >>> Thanks for your help ! >>> >>> -- >>> Jerome Eteve. >>> >>> Chat with me live at http://www.eteve.net >>> >>> jer...@eteve.net >>> >> >> > > > -- > Stephen Duncan Jr > www.stephenduncanjr.com > -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: facet performance tips
Note that depending on the profile of your field (full text and how many unique terms on average per document), the improvements from 1.4 may not apply, as you may exceed the limits of the new faceting technique in Solr 1.4. -Stephen On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher wrote: > Yes, increasing the filterCache size will help with Solr 1.3 performance. > > Do note that trunk (soon Solr 1.4) has dramatically improved faceting > performance. > >Erik > > > On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > > Hi everyone, >> >> I'm using some faceting on a solr index containing ~ 160K documents. >> I perform facets on multivalued string fields. The number of possible >> different values is quite large. >> >> Enabling facets degrades the performance by a factor 3. >> >> Because I'm using solr 1.3, I guess the facetting makes use of the >> filter cache to work. My filterCache is set >> to a size of 2048. I also noticed in my solr stats a very small ratio >> of cache hit (~ 0.01%). >> >> Can it be the reason why the faceting is slow? Does it make sense to >> increase the filterCache size so it matches more or less the number >> of different possible values for the faceted fields? Would that not >> make the memory usage explode? >> >> Thanks for your help ! >> >> -- >> Jerome Eteve. >> >> Chat with me live at http://www.eteve.net >> >> jer...@eteve.net >> > > -- Stephen Duncan Jr www.stephenduncanjr.com
Re: facet performance tips
For your fields with many terms you may want to try Bobo http://code.google.com/p/bobo-browse/ which could work well with your case. On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendi wrote: > I am currently faceting on tokenized multi-valued field at > http://www.tokenizer.org (25 mlns simple docs) > > It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and > non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667) > > Average "faceting" on query results: 0.2 - 0.3 seconds; without those > patches - 20-50 seconds. > > I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and > to compare results... > > > > > P.S. > Avoid faceting on a field with heavy distribution of terms (such as few > millions of terms in my case); It won't work in SOLR 1.3. > > TIP: use non-tokenized single-valued field for faceting, such as > non-tokenized "country" field. > > > > P.P.S. > Would be nice to load/stress > http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against > putting CPU in a spin loop ConcurrentHashMap. > > > > -Original Message- > From: Erik Hatcher [mailto:ehatc...@apache.org] > Sent: August-12-09 2:12 PM > To: solr-user@lucene.apache.org > Subject: Re: facet performance tips > > Yes, increasing the filterCache size will help with Solr 1.3 > performance. > > Do note that trunk (soon Solr 1.4) has dramatically improved faceting > performance. > > Erik > > On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > >> Hi everyone, >> >> I'm using some faceting on a solr index containing ~ 160K documents. >> I perform facets on multivalued string fields. The number of possible >> different values is quite large. >> >> Enabling facets degrades the performance by a factor 3. >> >> Because I'm using solr 1.3, I guess the facetting makes use of the >> filter cache to work. My filterCache is set >> to a size of 2048. I also noticed in my solr stats a very small ratio >> of cache hit (~ 0.01%). >> >> Can it be the reason why the faceting is slow? Does it make sense to >> increase the filterCache size so it matches more or less the number >> of different possible values for the faceted fields? Would that not >> make the memory usage explode? >> >> Thanks for your help ! >> >> -- >> Jerome Eteve. >> >> Chat with me live at http://www.eteve.net >> >> jer...@eteve.net > > > >
RE: facet performance tips
I am currently faceting on tokenized multi-valued field at http://www.tokenizer.org (25 mlns simple docs) It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667) Average "faceting" on query results: 0.2 - 0.3 seconds; without those patches - 20-50 seconds. I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and to compare results... P.S. Avoid faceting on a field with heavy distribution of terms (such as few millions of terms in my case); It won't work in SOLR 1.3. TIP: use non-tokenized single-valued field for faceting, such as non-tokenized "country" field. P.P.S. Would be nice to load/stress http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against putting CPU in a spin loop ConcurrentHashMap. -Original Message- From: Erik Hatcher [mailto:ehatc...@apache.org] Sent: August-12-09 2:12 PM To: solr-user@lucene.apache.org Subject: Re: facet performance tips Yes, increasing the filterCache size will help with Solr 1.3 performance. Do note that trunk (soon Solr 1.4) has dramatically improved faceting performance. Erik On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: > Hi everyone, > > I'm using some faceting on a solr index containing ~ 160K documents. > I perform facets on multivalued string fields. The number of possible > different values is quite large. > > Enabling facets degrades the performance by a factor 3. > > Because I'm using solr 1.3, I guess the facetting makes use of the > filter cache to work. My filterCache is set > to a size of 2048. I also noticed in my solr stats a very small ratio > of cache hit (~ 0.01%). > > Can it be the reason why the faceting is slow? Does it make sense to > increase the filterCache size so it matches more or less the number > of different possible values for the faceted fields? Would that not > make the memory usage explode? > > Thanks for your help ! > > -- > Jerome Eteve. > > Chat with me live at http://www.eteve.net > > jer...@eteve.net
Re: facet performance tips
Yes, increasing the filterCache size will help with Solr 1.3 performance. Do note that trunk (soon Solr 1.4) has dramatically improved faceting performance. Erik On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote: Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
RE: facet performance tips
Jerome, Yes you need to increase the filterCache size to something close to unique number of facet elements. But also consider the RAM required to accommodate the increase. I did see a significant performance gain by increasing the filterCache size Thanks, Kalyan Manepalli -Original Message- From: Jérôme Etévé [mailto:jerome.et...@gmail.com] Sent: Wednesday, August 12, 2009 12:31 PM To: solr-user@lucene.apache.org Subject: facet performance tips Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
facet performance tips
Hi everyone, I'm using some faceting on a solr index containing ~ 160K documents. I perform facets on multivalued string fields. The number of possible different values is quite large. Enabling facets degrades the performance by a factor 3. Because I'm using solr 1.3, I guess the facetting makes use of the filter cache to work. My filterCache is set to a size of 2048. I also noticed in my solr stats a very small ratio of cache hit (~ 0.01%). Can it be the reason why the faceting is slow? Does it make sense to increase the filterCache size so it matches more or less the number of different possible values for the faceted fields? Would that not make the memory usage explode? Thanks for your help ! -- Jerome Eteve. Chat with me live at http://www.eteve.net jer...@eteve.net
Re: Facet Performance
Hoss, This is still extremely interesting area for possible improvements; I simply don't want the topic to die http://www.nabble.com/Facet-Performance-td7746964.html http://issues.apache.org/jira/browse/SOLR-665 http://issues.apache.org/jira/browse/SOLR-667 http://issues.apache.org/jira/browse/SOLR-669 I am currently using faceting on single-valued _tokenized_ field with huge amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds average response time (for faceted queries only!) I think we can use additional cache for facet results (to store calculated values!); Lucene's FieldCache can be used only for non-tokenized single-valued non-bollean fields -Fuad hossman_lucene wrote: > > > : Unfortunately which strategy will be chosen is currently undocumented > : and control is a bit oblique: If the field is tokenized or multivalued > : or Boolean, the FilterQuery method will be used; otherwise the > : FieldCache method. I expect I or others will improve that shortly. > > Bear in mind, what's provide out of the box is "SimpleFacets" ... it's > designed to meet simple faceting needs ... when you start talking about > 100s or thousands of constraints per facet, you are getting outside the > scope of what it was intended to serve efficiently. > > At a certain point the only practical thing to do is write a custom > request handler that makes the best choices for your data. > > For the record: a really simple patch someone could submit would be to > make add an optional field based param indicating which type of faceting > (termenum/fieldcache) should be used to generate the list of terms and > then make SimpleFacets.getFacetFieldCounts use that and call the > apprpriate method insteado calling getTermCounts -- that way you could > force one or the other if you know it's better for your data/query. > > > > -Hoss > > > -- View this message in context: http://www.nabble.com/Facet-Performance-tp7746964p18756500.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: Facet Performance
Erik Hatcher wrote: On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote: My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. When encountering difficult issues, I like to think in terms of the user interface. Surely you're not presenting 400k+ authors to the users in one shot. In Collex, we have put an AJAX drop-down that shows the author facet (we call it name on the UI, with various roles like author, painter, etc). You can see this in action here: In our data, we don't have unique authors for each records ... so let's say out of the 500,000 records ... we have 200,000 authors. What I am trying to display is the top 10 authors from the results of a search. So I do a search for title:"Gone with the wind" and I would like to see the top 10 matching authors from these results. But no worries, I have written my own facet handler and I am now back to under a second with faceting! Thanks for everyone's help and keep up the good work! Andrew
Re: Facet Performance
: Unfortunately which strategy will be chosen is currently undocumented : and control is a bit oblique: If the field is tokenized or multivalued : or Boolean, the FilterQuery method will be used; otherwise the : FieldCache method. I expect I or others will improve that shortly. Bear in mind, what's provide out of the box is "SimpleFacets" ... it's designed to meet simple faceting needs ... when you start talking about 100s or thousands of constraints per facet, you are getting outside the scope of what it was intended to serve efficiently. At a certain point the only practical thing to do is write a custom request handler that makes the best choices for your data. For the record: a really simple patch someone could submit would be to make add an optional field based param indicating which type of faceting (termenum/fieldcache) should be used to generate the list of terms and then make SimpleFacets.getFacetFieldCounts use that and call the apprpriate method insteado calling getTermCounts -- that way you could force one or the other if you know it's better for your data/query. -Hoss
Re: Facet Performance
On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote: My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. When encountering difficult issues, I like to think in terms of the user interface. Surely you're not presenting 400k+ authors to the users in one shot. In Collex, we have put an AJAX drop-down that shows the author facet (we call it name on the UI, with various roles like author, painter, etc). You can see this in action here: http://www.nines.org/collex type in "da" into the name for example. I developed a custom request handler in Solr for returning these types of suggest interfaces complete with facet counts. My code is very specific to our fields, so its not usable in a general sense, but maybe this gives you some ideas on where to go with these large sets of facet values. Erik
Re: Facet Performance
J.J. Larrea wrote: Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly. Good to hear, cause I can't really get away with not having a multi-valued field for author. Im really excited by solr and really impressed so far. Thanks! Andrew
Re: Facet Performance
On 12/8/06, J.J. Larrea <[EMAIL PROTECTED]> wrote: Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. If anyone had time some of this could be documented here: http://wiki.apache.org/solr/SimpleFacetParameters The wiki is open to all. Or perhaps a new top level FacetedSearching page that references SimpleFacetParameters -Yonik
Re: Facet Performance
Andrew Nagy, ditto on what Yonik said. Here is some further elaboration: I am doing much the same thing (faceting on Author etc.). When my Author field was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it wasn't actually tokenized, the faceting code chose the QueryFilter approach, and faceting on Author for 100k+ document took about 4 seconds. When I changed the field to "string" e.g. solr.StrField, the faceting code recognized it as untokenized and used the FieldCache approach. Times have dropped to about 120ms for the first query (when the FieldCache is generated) and < 10ms for subsequent queries returning a few thousand results. Quite a difference. The strategy must be chosen on a field-by-field basis. While QueryFilter is excellent for fields with a small set of enumerated values such as Language or Format, it is inappropriate for large value sets such as Author. Unfortunately which strategy will be chosen is currently undocumented and control is a bit oblique: If the field is tokenized or multivalued or Boolean, the FilterQuery method will be used; otherwise the FieldCache method. I expect I or others will improve that shortly. - J.J. At 2:58 PM -0500 12/8/06, Yonik Seeley wrote: >Right, if any of these are tokenized, then you could make them >non-tokenized (use "string" type). If they really need to be >tokenized (author for example), then you could use copyField to make >another copy to a non-tokenized field that you can use for faceting. > >After that, as Hoss suggests, run a single faceting query with all 4 >fields and look at the filterCache statistics. Take the "lookups" >number and multiply it by, say, 1.5 to leave some room for future >growth, and use that as your cache size. You probably want to bump up >both initialSize and autowarmCount as well. > >The first query will still be slow. The second should be relatively fast. >You may hit an OOM error. Increase the JVM heap size if this happens. > >-Yonik
Re: Facet Performance
Yonik Seeley wrote: Are they multivalued, and do they need to be. Anything that is of type "string" and not multivalued will use the lucene FieldCache rather than the filterCache. The author field is multivalued. Will this be a strong performance issue? I could make multiple author fields as to not have the multivalued field and then only facet on the first author. Thanks Andrew
Re: Facet Performance
On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: Chris Hostetter wrote: >: Could you suggest a better configuration based on this? > >If that's what your stats look like after a single request, then i would >guess you would need to make your cache size at least 1.6 million in order >for it to be of any use in improving your facet speed. > > Would this have any strong impacts on my system? Should I just set it to an even 2 million to allow for growth? Change the following in solrconfig.xml, and you should be fine with a higher setting. true to false That will prevent the filtercache from being used for anything but filters and faceting, so if you set it to high, it won't be utilized anyway. >: My data is 492,000 records of book data. I am faceting on 4 fields: >: author, subject, language, format. >: Format and language are fairly simple as their are only a few unique >: terms. Author and subject however are much different in that there are >: thousands of unique terms. > >by the looks of it, you have a lot more then a few thousand unique terms >in those two fields ... are you tokenizing on these fields? that's >probably not what you want for ields you're going to facet on. > > All of these fields are set as "string" in my schema Are they multivalued, and do they need to be. Anything that is of type "string" and not multivalued will use the lucene FieldCache rather than the filterCache. -Yonik
Re: Facet Performance
Chris Hostetter wrote: : Could you suggest a better configuration based on this? If that's what your stats look like after a single request, then i would guess you would need to make your cache size at least 1.6 million in order for it to be of any use in improving your facet speed. Would this have any strong impacts on my system? Should I just set it to an even 2 million to allow for growth? : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. All of these fields are set as "string" in my schema, so if I understand the fields correctly, they are not being tokenized. I also have an author field that is set as "text" for searching. Thanks Andrew
Re: Facet Performance
On 12/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote: : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. Right, if any of these are tokenized, then you could make them non-tokenized (use "string" type). If they really need to be tokenized (author for example), then you could use copyField to make another copy to a non-tokenized field that you can use for faceting. After that, as Hoss suggests, run a single faceting query with all 4 fields and look at the filterCache statistics. Take the "lookups" number and multiply it by, say, 1.5 to leave some room for future growth, and use that as your cache size. You probably want to bump up both initialSize and autowarmCount as well. The first query will still be slow. The second should be relatively fast. You may hit an OOM error. Increase the JVM heap size if this happens. -Yonik
Re: Facet Performance
: Here are the stats, Im still a newbie to SOLR, so Im not totally sure : what this all means: : lookups : 1530036 : hits : 2 : hitratio : 0.00 : inserts : 1530035 : evictions : 1504435 : size : 25600 those numbers are telling you that your cache is capable of holding 25,600 items. you have attempted to lookup something in the cache 1,530,036 times, and only 2 of those times did you get a hit. you have added 1,530,035 items to the cache, and 1,504,435 items have been removed from your cache to make room for newer items. in short: your cache isn't really helping you at all. : Could you suggest a better configuration based on this? If that's what your stats look like after a single request, then i would guess you would need to make your cache size at least 1.6 million in order for it to be of any use in improving your facet speed. : My data is 492,000 records of book data. I am faceting on 4 fields: : author, subject, language, format. : Format and language are fairly simple as their are only a few unique : terms. Author and subject however are much different in that there are : thousands of unique terms. by the looks of it, you have a lot more then a few thousand unique terms in those two fields ... are you tokenizing on these fields? that's probably not what you want for ields you're going to facet on. -Hoss
Re: Facet Performance
Yonik Seeley wrote: On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. The first time or subsequent times? Is your filterCache big enough yet? What do you see for evictions and hit ratio? Here are the stats, Im still a newbie to SOLR, so Im not totally sure what this all means: lookups : 1530036 hits : 2 hitratio : 0.00 inserts : 1530035 evictions : 1504435 size : 25600 cumulative_lookups : 1530036 cumulative_hits : 2 cumulative_hitratio : 0.00 cumulative_inserts : 1530035 cumulative_evictions : 1504435 Could you suggest a better configuration based on this? Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? Faceting isn't something that will always be fast... one often needs to design things in a way that it can be fast. Can you give some examples of your faceted queries? Can you show the field and fieldtype definitions for the fields you are faceting on? For each field that you are faceting on, how many different terms are in it? My data is 492,000 records of book data. I am faceting on 4 fields: author, subject, language, format. Format and language are fairly simple as their are only a few unique terms. Author and subject however are much different in that there are thousands of unique terms. Thanks for your help! Andrew
Re: Facet Performance
On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote: I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. The first time or subsequent times? Is your filterCache big enough yet? What do you see for evictions and hit ratio? Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? Faceting isn't something that will always be fast... one often needs to design things in a way that it can be fast. Can you give some examples of your faceted queries? Can you show the field and fieldtype definitions for the fields you are faceting on? For each field that you are faceting on, how many different terms are in it? I figure I could run the query twice, once limited to 20 records and then again with the limit set to the total number of records and develop my own facets. I have infact done this before with a different back-end and my code is processed in under .01 seconds. Why is faceting so slow? It's computationally expensive to get exact facet counts for a large number of hits, and that is what the current faceting code is designed to do. No single method will be appropriate *and* fast for all scenarios. Another method that hasn't been implemented is some statistical faceting based on the top hits, using stored fields or stored term vectors. -Yonik
Re: Facet Performance
Yonik Seeley wrote: 1) facet on single-valued strings if you can 2) if you can't do (1) then enlarge the fieldcache so that the number of filters (one per possible term in the field you are filtering on) can fit. I changed the filterCache to the following: However a search that normally takes .04s is taking 74 seconds once I use the facets since I am faceting on 4 fields. Can you suggest a better configuration that would solve this performance issue, or should I not use faceting? I figure I could run the query twice, once limited to 20 records and then again with the limit set to the total number of records and develop my own facets. I have infact done this before with a different back-end and my code is processed in under .01 seconds. Why is faceting so slow? Andrew