subject:"Re\: Facet Performance"

Re: Facet Performance

2008-07-31 Thread Funtick


Hoss,

This is still extremely interesting area for possible improvements; I simply
don't want the topic to die 
http://www.nabble.com/Facet-Performance-td7746964.html

http://issues.apache.org/jira/browse/SOLR-665
http://issues.apache.org/jira/browse/SOLR-667
http://issues.apache.org/jira/browse/SOLR-669

I am currently using faceting on single-valued _tokenized_ field with huge
amount of documents; _unsynchronized_ version of FIFOCache; 1.5 seconds
average response time (for faceted queries only!)

I think we can use additional cache for facet results (to store calculated
values!); Lucene's FieldCache can be used only for non-tokenized
single-valued non-bollean fields

-Fuad



hossman_lucene wrote:
> 
> 
> : Unfortunately which strategy will be chosen is currently undocumented
> : and control is a bit oblique:  If the field is tokenized or multivalued
> : or Boolean, the FilterQuery method will be used; otherwise the
> : FieldCache method.  I expect I or others will improve that shortly.
> 
> Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
> designed to meet simple faceting needs ... when you start talking about
> 100s or thousands of constraints per facet, you are getting outside the
> scope of what it was intended to serve efficiently.
> 
> At a certain point the only practical thing to do is write a custom
> request handler that makes the best choices for your data.
> 
> For the record: a really simple patch someone could submit would be to
> make add an optional field based param indicating which type of faceting
> (termenum/fieldcache) should be used to generate the list of terms and
> then make SimpleFacets.getFacetFieldCounts use that and call the
> apprpriate method insteado calling getTermCounts -- that way you could
> force one or the other if you know it's better for your data/query.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Facet-Performance-tp7746964p18756500.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Facet Performance

2006-12-07 Thread Yonik Seeley


On 12/7/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

In September there was a thread [1] on this list about heterogeneous
facets and their performance.  I am having a similar issue and am
unclear as the resolution of this thread.

I performed a search against my dataset (492,000 records) and got the
results I am looking for in .3 seconds.  I then set facet to true and
got results in 16 seconds and the facets include data that is not in my
result set, it is from the entire set.  How do I limit the faceting to
my results set and speed up the results?


1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.
3) facet counts are limited to the results of the query, filtered by
any filters.   Is there a reason you think they are not?

-Yonik

Re: Facet Performance

2006-12-07 Thread Andrew Nagy


Yonik Seeley wrote:


1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.


I wll try this out.


3) facet counts are limited to the results of the query, filtered by
any filters.   Is there a reason you think they are not?


No, you are right.  I was thrown off at 1st.

On complaint about the faceting though:  Why is the element that is 
returned called "1st".  This seems like a poor choice for an element 
name.  Why not just name the element what is in the "name" attribute?  
It would make parsing much easier!


Thanks!
Andrew

Re: Facet Performance

2006-12-07 Thread Yonik Seeley


On 12/7/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

On complaint about the faceting though:  Why is the element that is
returned called "1st".


I think maybe you are seeing lst (it starts with an L, not a one).
It is short for NamedList, an ordered list who's elements are named.


This seems like a poor choice for an element
name.  Why not just name the element what is in the "name" attribute?
It would make parsing much easier!


When the XML was first conceived, there was a preference for limiting
the number of tags.
The structure could have been inverted so that


-Yonik

Re: Facet Performance

2006-12-07 Thread Chris Hostetter


: > This seems like a poor choice for an element
: > name.  Why not just name the element what is in the "name" attribute?
: > It would make parsing much easier!
:
: When the XML was first conceived, there was a preference for limiting
: the number of tags.
: The structure could have been inverted so that
: 

...but then we couldn't support arbitrary field names, and it would be
impossible to validate the XML docs independent of hte schema, see this
previous explanation...

http://www.nabble.com/Default-XML-Output-Schema-tf2312439.html#a643



-Hoss

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Yonik Seeley wrote:


1) facet on single-valued strings if you can
2) if you can't do (1) then enlarge the fieldcache so that the number
of filters (one per possible term in the field you are filtering on)
can fit.


I changed the filterCache to the following:
   

However a search that normally takes .04s is taking 74 seconds once I 
use the facets since I am faceting on 4 fields.


Can you suggest a better configuration that would solve this performance 
issue, or should I not use faceting?
I figure I could run the query twice, once limited to 20 records and 
then again with the limit set to the total number of records and develop 
my own facets.  I have infact done this before with a different back-end 
and my code is processed in under .01 seconds.


Why is faceting so slow?

Andrew

Re: Facet Performance

2006-12-08 Thread Yonik Seeley


On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

I changed the filterCache to the following:


However a search that normally takes .04s is taking 74 seconds once I
use the facets since I am faceting on 4 fields.


The first time or subsequent times?
Is your filterCache big enough yet?  What do you see for evictions and
hit ratio?


Can you suggest a better configuration that would solve this performance
issue, or should I not use faceting?


Faceting isn't something that will always be fast... one often needs
to design things in a way that it can be fast.

Can you give some examples of your faceted queries?
Can you show the field and fieldtype definitions for the fields you
are faceting on?
For each field that you are faceting on, how many different terms are in it?


I figure I could run the query twice, once limited to 20 records and
then again with the limit set to the total number of records and develop
my own facets.  I have infact done this before with a different back-end
and my code is processed in under .01 seconds.

Why is faceting so slow?


It's computationally expensive to get exact facet counts for a large
number of hits, and that is what the current faceting code is designed
to do.  No single method will be appropriate *and* fast for all
scenarios.

Another method that hasn't been implemented is some statistical
faceting based on the top hits, using stored fields or stored term
vectors.

-Yonik

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Yonik Seeley wrote:


On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:


I changed the filterCache to the following:


However a search that normally takes .04s is taking 74 seconds once I
use the facets since I am faceting on 4 fields.



The first time or subsequent times?
Is your filterCache big enough yet?  What do you see for evictions and
hit ratio?


Here are the stats, Im still a newbie to SOLR, so Im not totally sure 
what this all means:

lookups : 1530036
hits : 2
hitratio : 0.00
inserts : 1530035
evictions : 1504435
size : 25600
cumulative_lookups : 1530036
cumulative_hits : 2
cumulative_hitratio : 0.00
cumulative_inserts : 1530035
cumulative_evictions : 1504435

Could you suggest a better configuration based on this?




Can you suggest a better configuration that would solve this performance
issue, or should I not use faceting?



Faceting isn't something that will always be fast... one often needs
to design things in a way that it can be fast.

Can you give some examples of your faceted queries?
Can you show the field and fieldtype definitions for the fields you
are faceting on?
For each field that you are faceting on, how many different terms are 
in it?


My data is 492,000 records of book data.  I am faceting on 4 fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few unique 
terms.  Author and subject however are much different in that there are 
thousands of unique terms.


Thanks for your help!
Andrew

Re: Facet Performance

2006-12-08 Thread Chris Hostetter

: Here are the stats, Im still a newbie to SOLR, so Im not totally sure
: what this all means:
: lookups : 1530036
: hits : 2
: hitratio : 0.00
: inserts : 1530035
: evictions : 1504435
: size : 25600

those numbers are telling you that your cache is capable of holding 25,600
items.  you have attempted to lookup something in the cache 1,530,036
times, and only 2 of those times did you get a hit.  you have
added 1,530,035 items to the cache, and 1,504,435 items have been removed
from your cache to make room for newer items.

in short: your cache isn't really helping you at all.

: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.

: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.



-Hoss

Re: Facet Performance

2006-12-08 Thread Yonik Seeley


On 12/8/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:

: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.


Right, if any of these are tokenized, then you could make them
non-tokenized (use "string" type).  If they really need to be
tokenized (author for example), then you could use copyField to make
another copy to a non-tokenized field that you can use for faceting.

After that, as Hoss suggests, run a single faceting query with all 4
fields and look at the filterCache statistics.  Take the "lookups"
number and multiply it by, say, 1.5 to leave some room for future
growth, and use that as your cache size.  You probably want to bump up
both initialSize and autowarmCount as well.

The first query will still be slow.  The second should be relatively fast.
You may hit an OOM error.  Increase the JVM heap size if this happens.

-Yonik

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Chris Hostetter wrote:


: Could you suggest a better configuration based on this?

If that's what your stats look like after a single request, then i would
guess you would need to make your cache size at least 1.6 million in order
for it to be of any use in improving your facet speed.
 

Would this have any strong impacts on my system?  Should I just set it 
to an even 2 million to allow for growth?



: My data is 492,000 records of book data.  I am faceting on 4 fields:
: author, subject, language, format.
: Format and language are fairly simple as their are only a few unique
: terms.  Author and subject however are much different in that there are
: thousands of unique terms.

by the looks of it, you have a lot more then a few thousand unique terms
in those two fields ... are you tokenizing on these fields?  that's
probably not what you want for ields you're going to facet on.
 

All of these fields are set as "string" in my schema, so if I understand 
the fields correctly, they are not being tokenized.  I also have an 
author field that is set as "text" for searching.


Thanks
Andrew

Re: Facet Performance

2006-12-08 Thread Yonik Seeley

On 12/8/06, Andrew Nagy <[EMAIL PROTECTED]> wrote:

Chris Hostetter wrote:

>: Could you suggest a better configuration based on this?
>
>If that's what your stats look like after a single request, then i would
>guess you would need to make your cache size at least 1.6 million in order
>for it to be of any use in improving your facet speed.
>
>
Would this have any strong impacts on my system?  Should I just set it
to an even 2 million to allow for growth?

Change the following in solrconfig.xml, and you should be fine with a
higher setting.
true
to
false

That will prevent the filtercache from being used for anything but
filters and faceting, so if you set it to high, it won't be utilized
anyway.

>: My data is 492,000 records of book data.  I am faceting on 4 fields:
>: author, subject, language, format.
>: Format and language are fairly simple as their are only a few unique
>: terms.  Author and subject however are much different in that there are
>: thousands of unique terms.
>
>by the looks of it, you have a lot more then a few thousand unique terms
>in those two fields ... are you tokenizing on these fields?  that's
>probably not what you want for ields you're going to facet on.
>
>
All of these fields are set as "string" in my schema

Are they multivalued, and do they need to be.
Anything that is of type "string" and not multivalued will use the
lucene FieldCache rather than the filterCache.

-Yonik

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Yonik Seeley wrote:


Are they multivalued, and do they need to be.
Anything that is of type "string" and not multivalued will use the
lucene FieldCache rather than the filterCache.


The author field is multivalued.  Will this be a strong performance issue?

I could make multiple author fields as to not have the multivalued field 
and then only facet on the first author.


Thanks
Andrew

Re: Facet Performance

2006-12-08 Thread J.J. Larrea

Andrew Nagy, ditto on what Yonik said.  Here is some further elaboration:

I am doing much the same thing (faceting on Author etc.). When my Author field 
was defined as a solr.TextField, even using solr.KeywordTokenizerFactory so it 
wasn't actually tokenized, the faceting code chose the QueryFilter approach, 
and faceting on Author for 100k+ document took about 4 seconds.

When I changed the field to "string" e.g. solr.StrField, the faceting code 
recognized it as untokenized and used the FieldCache approach.  Times have 
dropped to about 120ms for the first query (when the FieldCache is generated) 
and < 10ms for subsequent queries returning a few thousand results.  Quite a 
difference.

The strategy must be chosen on a field-by-field basis.  While QueryFilter is 
excellent for fields with a small set of enumerated values such as Language or 
Format, it is inappropriate for large value sets such as Author.

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.

- J.J.

At 2:58 PM -0500 12/8/06, Yonik Seeley wrote:
>Right, if any of these are tokenized, then you could make them
>non-tokenized (use "string" type).  If they really need to be
>tokenized (author for example), then you could use copyField to make
>another copy to a non-tokenized field that you can use for faceting.
>
>After that, as Hoss suggests, run a single faceting query with all 4
>fields and look at the filterCache statistics.  Take the "lookups"
>number and multiply it by, say, 1.5 to leave some room for future
>growth, and use that as your cache size.  You probably want to bump up
>both initialSize and autowarmCount as well.
>
>The first query will still be slow.  The second should be relatively fast.
>You may hit an OOM error.  Increase the JVM heap size if this happens.
>
>-Yonik

Re: Facet Performance

2006-12-08 Thread Yonik Seeley


On 12/8/06, J.J. Larrea <[EMAIL PROTECTED]> wrote:

Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.


If anyone had time some of this could be documented here:
http://wiki.apache.org/solr/SimpleFacetParameters
The wiki is open to all.

Or perhaps a new top level FacetedSearching page that references
SimpleFacetParameters

-Yonik

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


J.J. Larrea wrote:


Unfortunately which strategy will be chosen is currently undocumented and 
control is a bit oblique:  If the field is tokenized or multivalued or Boolean, 
the FilterQuery method will be used; otherwise the FieldCache method.  I expect 
I or others will improve that shortly.
 

Good to hear, cause I can't really get away with not having a 
multi-valued field for author.


Im really excited by solr and really impressed so far.

Thanks!
Andrew

Re: Facet Performance

2006-12-08 Thread Erik Hatcher



On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:
My data is 492,000 records of book data.  I am faceting on 4  
fields: author, subject, language, format.
Format and language are fairly simple as their are only a few  
unique terms.  Author and subject however are much different in  
that there are thousands of unique terms.


When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:


http://www.nines.org/collex

type in "da" into the name for example.  I developed a custom request  
handler in Solr for returning these types of suggest interfaces  
complete with facet counts.  My code is very specific to our fields,  
so its not usable in a general sense, but maybe this gives you some  
ideas on where to go with these large sets of facet values.


Erik

Re: Facet Performance

2006-12-08 Thread Chris Hostetter


: Unfortunately which strategy will be chosen is currently undocumented
: and control is a bit oblique:  If the field is tokenized or multivalued
: or Boolean, the FilterQuery method will be used; otherwise the
: FieldCache method.  I expect I or others will improve that shortly.

Bear in mind, what's provide out of the box is "SimpleFacets" ... it's
designed to meet simple faceting needs ... when you start talking about
100s or thousands of constraints per facet, you are getting outside the
scope of what it was intended to serve efficiently.

At a certain point the only practical thing to do is write a custom
request handler that makes the best choices for your data.

For the record: a really simple patch someone could submit would be to
make add an optional field based param indicating which type of faceting
(termenum/fieldcache) should be used to generate the list of terms and
then make SimpleFacets.getFacetFieldCounts use that and call the
apprpriate method insteado calling getTermCounts -- that way you could
force one or the other if you know it's better for your data/query.



-Hoss

Re: Facet Performance

2006-12-08 Thread Andrew Nagy


Erik Hatcher wrote:


On Dec 8, 2006, at 2:15 PM, Andrew Nagy wrote:

My data is 492,000 records of book data.  I am faceting on 4  fields: 
author, subject, language, format.
Format and language are fairly simple as their are only a few  unique 
terms.  Author and subject however are much different in  that there 
are thousands of unique terms.



When encountering difficult issues, I like to think in terms of the  
user interface.  Surely you're not presenting 400k+ authors to the  
users in one shot.  In Collex, we have put an AJAX drop-down that  
shows the author facet (we call it name on the UI, with various roles  
like author, painter, etc).  You can see this in action here:


In our data, we don't have unique authors for each records ... so let's 
say out of the 500,000 records ... we have 200,000 authors.  What I am 
trying to display is the top 10 authors from the results of a search.  
So I do a search for title:"Gone with the wind" and I would like to see 
the top 10 matching authors from these results.


But no worries, I have written my own facet handler and I am now back to 
under a second with faceting!


Thanks for everyone's help and keep up the good work!

Andrew

RE: facet performance tips

2009-08-12 Thread Manepalli, Kalyan

Jerome,
Yes you need to increase the filterCache size to something close to 
unique number of facet elements. But also consider the RAM required to 
accommodate the increase. 
I did see a significant performance gain by increasing the filterCache size

Thanks,
Kalyan Manepalli

-Original Message-
From: Jérôme Etévé [mailto:jerome.et...@gmail.com] 
Sent: Wednesday, August 12, 2009 12:31 PM
To: solr-user@lucene.apache.org
Subject: facet performance tips

Hi everyone,

  I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

RE: facet performance tips

2009-08-12 Thread Fuad Efendi

I am currently faceting on tokenized multi-valued field at
http://www.tokenizer.org (25 mlns simple docs)

It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)

Average "faceting" on query results: 0.2 - 0.3 seconds; without those
patches - 20-50 seconds.

I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and
to compare results...

P.S.
Avoid faceting on a field with heavy distribution of terms (such as few
millions of terms in my case); It won't work in SOLR 1.3.

TIP: use non-tokenized single-valued field for faceting, such as
non-tokenized "country" field.

P.P.S.
Would be nice to load/stress
http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
putting CPU in a spin loop ConcurrentHashMap.

-Original Message-
From: Erik Hatcher [mailto:ehatc...@apache.org] 
Sent: August-12-09 2:12 PM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

Yes, increasing the filterCache size will help with Solr 1.3  
performance.

Do note that trunk (soon Solr 1.4) has dramatically improved faceting  
performance.

Erik

On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:

> Hi everyone,
>
>  I'm using some faceting on a solr index containing ~ 160K documents.
> I perform facets on multivalued string fields. The number of possible
> different values is quite large.
>
> Enabling facets degrades the performance by a factor 3.
>
> Because I'm using solr 1.3, I guess the facetting makes use of the
> filter cache to work. My filterCache is set
> to a size of 2048. I also noticed in my solr stats a very small ratio
> of cache hit (~ 0.01%).
>
> Can it be the reason why the faceting is slow? Does it make sense to
> increase the filterCache size so it matches more or less the number
> of different possible values for the faceted fields? Would that not
> make the memory usage explode?
>
> Thanks for your help !
>
> -- 
> Jerome Eteve.
>
> Chat with me live at http://www.eteve.net
>
> jer...@eteve.net

Re: facet performance tips

2009-08-12 Thread Erik Hatcher

Yes, increasing the filterCache size will help with Solr 1.3  
performance.


Do note that trunk (soon Solr 1.4) has dramatically improved faceting  
performance.


Erik

On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:


Hi everyone,

 I'm using some faceting on a solr index containing ~ 160K documents.
I perform facets on multivalued string fields. The number of possible
different values is quite large.

Enabling facets degrades the performance by a factor 3.

Because I'm using solr 1.3, I guess the facetting makes use of the
filter cache to work. My filterCache is set
to a size of 2048. I also noticed in my solr stats a very small ratio
of cache hit (~ 0.01%).

Can it be the reason why the faceting is slow? Does it make sense to
increase the filterCache size so it matches more or less the number
of different possible values for the faceted fields? Would that not
make the memory usage explode?

Thanks for your help !

--
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

Re: facet performance tips

2009-08-12 Thread Jason Rutherglen

For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

On Wed, Aug 12, 2009 at 12:02 PM, Fuad Efendi wrote:
> I am currently faceting on tokenized multi-valued field at
> http://www.tokenizer.org (25 mlns simple docs)
>
> It uses some home-made quick fixes similar to SOLR-475 (SOLR-711) and
> non-synchronized cache (similar to LingPipe's FastCache, SOLR-665, SOLR-667)
>
> Average "faceting" on query results: 0.2 - 0.3 seconds; without those
> patches - 20-50 seconds.
>
> I am going to upgrade to SOLR-1.4 from trunk (with SOLR-475 & SOLR-667) and
> to compare results...
>
>
>
>
> P.S.
> Avoid faceting on a field with heavy distribution of terms (such as few
> millions of terms in my case); It won't work in SOLR 1.3.
>
> TIP: use non-tokenized single-valued field for faceting, such as
> non-tokenized "country" field.
>
>
>
> P.P.S.
> Would be nice to load/stress
> http://alias-i.com/lingpipe/docs/api/com/aliasi/util/FastCache.html against
> putting CPU in a spin loop ConcurrentHashMap.
>
>
>
> -Original Message-----
> From: Erik Hatcher [mailto:ehatc...@apache.org]
> Sent: August-12-09 2:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: facet performance tips
>
> Yes, increasing the filterCache size will help with Solr 1.3
> performance.
>
> Do note that trunk (soon Solr 1.4) has dramatically improved faceting
> performance.
>
>        Erik
>
> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:
>
>> Hi everyone,
>>
>>  I'm using some faceting on a solr index containing ~ 160K documents.
>> I perform facets on multivalued string fields. The number of possible
>> different values is quite large.
>>
>> Enabling facets degrades the performance by a factor 3.
>>
>> Because I'm using solr 1.3, I guess the facetting makes use of the
>> filter cache to work. My filterCache is set
>> to a size of 2048. I also noticed in my solr stats a very small ratio
>> of cache hit (~ 0.01%).
>>
>> Can it be the reason why the faceting is slow? Does it make sense to
>> increase the filterCache size so it matches more or less the number
>> of different possible values for the faceted fields? Would that not
>> make the memory usage explode?
>>
>> Thanks for your help !
>>
>> --
>> Jerome Eteve.
>>
>> Chat with me live at http://www.eteve.net
>>
>> jer...@eteve.net
>
>
>
>

Re: facet performance tips

2009-08-12 Thread Stephen Duncan Jr

Note that depending on the profile of your field (full text and how many
unique terms on average per document), the improvements from 1.4 may not
apply, as you may exceed the limits of the new faceting technique in Solr
1.4.
-Stephen

On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher  wrote:

> Yes, increasing the filterCache size will help with Solr 1.3 performance.
>
> Do note that trunk (soon Solr 1.4) has dramatically improved faceting
> performance.
>
>Erik
>
>
> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:
>
>  Hi everyone,
>>
>>  I'm using some faceting on a solr index containing ~ 160K documents.
>> I perform facets on multivalued string fields. The number of possible
>> different values is quite large.
>>
>> Enabling facets degrades the performance by a factor 3.
>>
>> Because I'm using solr 1.3, I guess the facetting makes use of the
>> filter cache to work. My filterCache is set
>> to a size of 2048. I also noticed in my solr stats a very small ratio
>> of cache hit (~ 0.01%).
>>
>> Can it be the reason why the faceting is slow? Does it make sense to
>> increase the filterCache size so it matches more or less the number
>> of different possible values for the faceted fields? Would that not
>> make the memory usage explode?
>>
>> Thanks for your help !
>>
>> --
>> Jerome Eteve.
>>
>> Chat with me live at http://www.eteve.net
>>
>> jer...@eteve.net
>>
>
>


-- 
Stephen Duncan Jr
www.stephenduncanjr.com

Re: facet performance tips

2009-08-13 Thread Jérôme Etévé

Thanks everyone for your advices.

I increased my filterCache, and the faceting performances improved greatly.

My faceted field can have at the moment ~4 different terms, so I
did set a filterCache size of 5 and it works very well.

However, I'm planning to increase the number of terms to maybe around
500 000, so I guess this approach won't work anymore, as I doubt a 500
000 sized fieldCache would work.

So I guess my best move would be to upgrade to the soon to be 1.4
version of solr to benefit from its new faceting method.

I know this is a bit off-topic, but do you have a rough idea about
when 1.4 will be an official release?
As well, is the current trunk OK for production? Is it compatible with
1.3 configuration files?

Thanks !

Jerome.

2009/8/13 Stephen Duncan Jr :
> Note that depending on the profile of your field (full text and how many
> unique terms on average per document), the improvements from 1.4 may not
> apply, as you may exceed the limits of the new faceting technique in Solr
> 1.4.
> -Stephen
>
> On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher  wrote:
>
>> Yes, increasing the filterCache size will help with Solr 1.3 performance.
>>
>> Do note that trunk (soon Solr 1.4) has dramatically improved faceting
>> performance.
>>
>>Erik
>>
>>
>> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:
>>
>>  Hi everyone,
>>>
>>>  I'm using some faceting on a solr index containing ~ 160K documents.
>>> I perform facets on multivalued string fields. The number of possible
>>> different values is quite large.
>>>
>>> Enabling facets degrades the performance by a factor 3.
>>>
>>> Because I'm using solr 1.3, I guess the facetting makes use of the
>>> filter cache to work. My filterCache is set
>>> to a size of 2048. I also noticed in my solr stats a very small ratio
>>> of cache hit (~ 0.01%).
>>>
>>> Can it be the reason why the faceting is slow? Does it make sense to
>>> increase the filterCache size so it matches more or less the number
>>> of different possible values for the faceted fields? Would that not
>>> make the memory usage explode?
>>>
>>> Thanks for your help !
>>>
>>> --
>>> Jerome Eteve.
>>>
>>> Chat with me live at http://www.eteve.net
>>>
>>> jer...@eteve.net
>>>
>>
>>
>
>
> --
> Stephen Duncan Jr
> www.stephenduncanjr.com
>

-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

I took 1.4 from trunk three days ago, it seems Ok for production (at least for 
my Master instance which is doing writes-only). I use the same config files.

500 000 terms are Ok too; I am using several millions with pre-1.3 SOLR taken 
from trunk.

However, do not try to "facet" (probably outdated term after SOLR-475) on 
generic queries such as [* TO *] (with huge resultset). For smaller query 
results (100,000 instead of 100,000,000) "counting terms" is fast enough (few 
milliseconds at http://www.tokenizer.org)

-Original Message-
From: Jérôme Etévé [mailto:jerome.et...@gmail.com] 
Sent: August-13-09 5:38 AM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

Thanks everyone for your advices.

I increased my filterCache, and the faceting performances improved greatly.

My faceted field can have at the moment ~4 different terms, so I
did set a filterCache size of 5 and it works very well.

However, I'm planning to increase the number of terms to maybe around
500 000, so I guess this approach won't work anymore, as I doubt a 500
000 sized fieldCache would work.

So I guess my best move would be to upgrade to the soon to be 1.4
version of solr to benefit from its new faceting method.

I know this is a bit off-topic, but do you have a rough idea about
when 1.4 will be an official release?
As well, is the current trunk OK for production? Is it compatible with
1.3 configuration files?

Thanks !

Jerome.

2009/8/13 Stephen Duncan Jr :
> Note that depending on the profile of your field (full text and how many
> unique terms on average per document), the improvements from 1.4 may not
> apply, as you may exceed the limits of the new faceting technique in Solr
> 1.4.
> -Stephen
>
> On Wed, Aug 12, 2009 at 2:12 PM, Erik Hatcher  wrote:
>
>> Yes, increasing the filterCache size will help with Solr 1.3 performance.
>>
>> Do note that trunk (soon Solr 1.4) has dramatically improved faceting
>> performance.
>>
>>Erik
>>
>>
>> On Aug 12, 2009, at 1:30 PM, Jérôme Etévé wrote:
>>
>>  Hi everyone,
>>>
>>>  I'm using some faceting on a solr index containing ~ 160K documents.
>>> I perform facets on multivalued string fields. The number of possible
>>> different values is quite large.
>>>
>>> Enabling facets degrades the performance by a factor 3.
>>>
>>> Because I'm using solr 1.3, I guess the facetting makes use of the
>>> filter cache to work. My filterCache is set
>>> to a size of 2048. I also noticed in my solr stats a very small ratio
>>> of cache hit (~ 0.01%).
>>>
>>> Can it be the reason why the faceting is slow? Does it make sense to
>>> increase the filterCache size so it matches more or less the number
>>> of different possible values for the faceted fields? Would that not
>>> make the memory usage explode?
>>>
>>> Thanks for your help !
>>>
>>> --
>>> Jerome Eteve.
>>>
>>> Chat with me live at http://www.eteve.net
>>>
>>> jer...@eteve.net
>>>
>>
>>
>
>
> --
> Stephen Duncan Jr
> www.stephenduncanjr.com
>

-- 
Jerome Eteve.

Chat with me live at http://www.eteve.net

jer...@eteve.net

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

It seems BOBO-Browse is alternate faceting engine; would be interesting to
compare performance with SOLR... Distributed?

-Original Message-
From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com] 
Sent: August-12-09 6:12 PM
To: solr-user@lucene.apache.org
Subject: Re: facet performance tips

For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

Interesting, it has "BoboRequestHandler implements SolrRequestHandler"
- easy to try it; and shards support



[Fuad Efendi] It seems BOBO-Browse is alternate faceting engine; would be
interesting to
compare performance with SOLR... Distributed?


[Jason Rutherglen] For your fields with many terms you may want to try Bobo
http://code.google.com/p/bobo-browse/ which could work well with your
case.

Re: facet performance tips

2009-08-13 Thread Jason Rutherglen

Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote:
> It seems BOBO-Browse is alternate faceting engine; would be interesting to
> compare performance with SOLR... Distributed?
>
>
> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: August-12-09 6:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: facet performance tips
>
> For your fields with many terms you may want to try Bobo
> http://code.google.com/p/bobo-browse/ which could work well with your
> case.
>
>
>
>
>

RE: facet performance tips

2009-08-13 Thread Fuad Efendi

SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to
be); check this
http://issues.apache.org/jira/browse/SOLR-475
(and probably http://issues.apache.org/jira/browse/SOLR-711)

-Original Message-
From: Jason Rutherglen 

Yeah we need a performance comparison, I haven't had time to put
one together. If/when I do I'll compare Bobo performance against
Solr bitset intersection based facets, compare memory
consumption.

For near realtime Solr needs to cache and merge bitsets at the
SegmentReader level, and Bobo needs to be upgraded to work with
Lucene 2.9's searching at the segment level (currently it uses a
MultiSearcher).

Distributed search on either should be fairly straightforward?

On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote:
> It seems BOBO-Browse is alternate faceting engine; would be interesting to
> compare performance with SOLR... Distributed?
>
>
> -Original Message-
> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
> Sent: August-12-09 6:12 PM
> To: solr-user@lucene.apache.org
> Subject: Re: facet performance tips
>
> For your fields with many terms you may want to try Bobo
> http://code.google.com/p/bobo-browse/ which could work well with your
> case.
>
>
>
>
>

Re: facet performance tips

2009-08-13 Thread Jason Rutherglen

Right, I haven't used SOLR-475 yet and am more familiar with
Bobo. I believe there are differences but I haven't gone into
them yet. As I'm using Solr 1.4 now, maybe I'll test the
UnInvertedField modality.

Feel free to report back results as I don't think I've seen much
yet?

On Thu, Aug 13, 2009 at 10:51 AM, Fuad Efendi wrote:
> SOLR-1.4-trunk uses terms counting instead of bitset intersects (seems to
> be); check this
> http://issues.apache.org/jira/browse/SOLR-475
> (and probably http://issues.apache.org/jira/browse/SOLR-711)
>
> -Original Message-
> From: Jason Rutherglen
>
> Yeah we need a performance comparison, I haven't had time to put
> one together. If/when I do I'll compare Bobo performance against
> Solr bitset intersection based facets, compare memory
> consumption.
>
> For near realtime Solr needs to cache and merge bitsets at the
> SegmentReader level, and Bobo needs to be upgraded to work with
> Lucene 2.9's searching at the segment level (currently it uses a
> MultiSearcher).
>
> Distributed search on either should be fairly straightforward?
>
> On Thu, Aug 13, 2009 at 9:55 AM, Fuad Efendi wrote:
>> It seems BOBO-Browse is alternate faceting engine; would be interesting to
>> compare performance with SOLR... Distributed?
>>
>>
>> -Original Message-
>> From: Jason Rutherglen [mailto:jason.rutherg...@gmail.com]
>> Sent: August-12-09 6:12 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: facet performance tips
>>
>> For your fields with many terms you may want to try Bobo
>> http://code.google.com/p/bobo-browse/ which could work well with your
>> case.
>>
>>
>>
>>
>>
>
>
>

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault

Just a little follow-up - I did a little more testing, and the query 
takes 20 seconds no matter what - If there's one document in the results 
set, or if I do a query that returns all 13 documents.


It seems something isn't right... it looks like solr is doing faceted 
search on the whole index no matter what's the result set when doing 
facets on a string field. I must be doing something wrong?


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Michael Imbeault wrote:
Been playing around with the news 'facets search' and it works very 
well, but it's really slow for some particular applications. I've been 
trying to use it to display the most frequent authors of articles; 
this is from a huge (15 millions articles) database and names of 
authors are rare and heterogeneous. On a query that takes (without 
facets) 0.1 seconds, it jumps to ~20 seconds with just 1% of the 
documents indexed (I've been getting java.lang.OutOfMemoryError with 
the full index). ~40 seconds for a faceted search on 2 (string) 
fields. Range queries on a slong field is more acceptable (even with a 
dozen of them, query time is still in the subsecond range).


I'm I trying to do something which isn't what faceted search was made 
for? It would be understandable, after all, I guess the facets engine 
has to check very doc in the index and sort... which shouldn't yield 
good performance no matter what, sadly.


Is there any other way I could achieve what I'm trying to do? Just a 
list of the most frequent (top 5) authors present in the results of a 
query.


Thanks,

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Yonik Seeley


On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Been playing around with the news 'facets search' and it works very
well, but it's really slow for some particular applications. I've been
trying to use it to display the most frequent authors of articles


I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Yonik Seeley


On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault


Yonik Seeley wrote:

I noticed this too, and have been thinking about ways to fix it.
The root of the problem is that lucene, like all full-text search
engines, uses inverted indicies.  It's fast and easy to get all
documents for a particular term, but getting all terms for a document
documents is either not possible, or not fast (assuming many documents
match a query).
Yeah that's what I've been thinking; the index isn't built to handle 
such searches, sadly :( It would be very nice to be able to rapidly 
search by most frequent author, journal, etc.

For cases like "author", if there is only one value per document, then
a possible fix is to use the field cache.  If there can be multiple
occurrences, there doesn't seem to be a good way that preserves exact
counts, except maybe if the number of documents matching a query is
low.

I have one value per document (I have fields for authors, last_author 
and first_author, and I'm doing faceted search on first and last authors 
fields). How would I use the field cache to fix my problem? Also, would 
it be better to store a unique number (for each possible author) in an 
int field along with the string, and do the faceted searching on the int 
field? Would this be faster / require less memory? I guess that yes, and 
I'll test that when I have the time.



Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik
So more memory would fix the problem? Also, I was under the impression 
that it was only searching / sorting for authors that it knows are in 
the result set... in the case of only one document (1 result), it seems 
strange that it takes the same time as for 130 000 results. It should 
just check the results, see that there's only one author, and return 
that? And in the case of 2 documents, just sort 2 authors (or 1 if 
they're the same)? I understand your answer (it does intersections), but 
I wonder why its intersecting from the whole document set at first, and 
not docs_matching_query like you said.


Thanks for the support,

Michael

Re: Facet performance with heterogeneous 'facets'?

2006-09-18 Thread Michael Imbeault


Another followup: I bumped all the caches in solrconfig.xml to

 size="1600384"
 initialSize="400096"
 autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and 
first author fields, + 12 range date facets, sub 0.3 seconds for 
queries). I'll check on the full index tomorrow (it's indexing right 
now, 400docs/sec!). However, I still don't have an idea what are these 
values representing, and how I should estimate what values I should set 
them to. Originally I thought it was the size of the cache in kb, and 
someone on the list told me it was number of items, but I don't quite 
get it. Better documentation on that would be welcomed :)


Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset is 
too large...


Thanks,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Just a little follow-up - I did a little more testing, and the query
takes 20 seconds no matter what - If there's one document in the results
set, or if I do a query that returns all 13 documents.


Yes, currently the same strategy is always used.
  intersection_count(docs_matching_query, docs_matching_author1)
  intersection_count(docs_matching_query, docs_matching_author2)
  intersection_count(docs_matching_query, docs_matching_author3)
  etc...

Normally, the docsets will be cached, but since the number of authors
is greater than the size of the filtercache, the effective cache hit
rate will be 0%

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Yonik Seeley wrote:
> For cases like "author", if there is only one value per document, then
> a possible fix is to use the field cache.  If there can be multiple
> occurrences, there doesn't seem to be a good way that preserves exact
> counts, except maybe if the number of documents matching a query is
> low.
>
I have one value per document (I have fields for authors, last_author
and first_author, and I'm doing faceted search on first and last authors
fields). How would I use the field cache to fix my problem?

Unless you want to dive into Solr development, you don't :-)
It requires extensive changes to the faceting code and doing things a
different way in some cases.

The FieldCache is the fastest way to "uninvert" single valued
fields... it's currently only used for Sorting, where one needs to
quickly know the field value given the document id.
The downside is high memory use, and that it's not a general
solution... it can't handle fields with multiple tokens (tokenized
fields or multi-valued fields).

So the strategy would be to step through the documents, get the value
for the field from the FieldCache, increment a counter for that value,
then find the top counters when we are done.

Also, would
it be better to store a unique number (for each possible author) in an
int field along with the string, and do the faceted searching on the int
field?

It won't really help.  It wouldn't be faster, and it would require
only slightly less memory.

>> Just a little follow-up - I did a little more testing, and the query
>> takes 20 seconds no matter what - If there's one document in the results
>> set, or if I do a query that returns all 13 documents.
>
> Yes, currently the same strategy is always used.
>   intersection_count(docs_matching_query, docs_matching_author1)
>   intersection_count(docs_matching_query, docs_matching_author2)
>   intersection_count(docs_matching_query, docs_matching_author3)
>   etc...
>
> Normally, the docsets will be cached, but since the number of authors
> is greater than the size of the filtercache, the effective cache hit
> rate will be 0%
>
> -Yonik
So more memory would fix the problem?

Yes, if your collection size isn't that large...  it's not a practical
solution for many cases though.

Also, I was under the impression
that it was only searching / sorting for authors that it knows are in
the result set...

That's the problem... it's not necessarily easy to know *what* authors
are in the result set.  If we could quickly determine that, we could
just count them and not do any intersections or anything at all.

 in the case of only one document (1 result), it seems
strange that it takes the same time as for 130 000 results. It should
just check the results, see that there's only one author, and return
that? And in the case of 2 documents, just sort 2 authors (or 1 if
they're the same)? I understand your answer (it does intersections), but
I wonder why its intersecting from the whole document set at first, and
not docs_matching_query like you said.

It is just intersecting docs_matching_query.  The problem is that it's
intersecting that set with all possible author sets since it doesn't
know ahead of time what authors are in the docs that match the query.

There could be optimizations when docs_matching_query.size() is small,
so we start somehow with terms in the documents rather than terms in
the index.  That requires termvectors to be stored (medium speed), or
requires that the field be stored and that we re-analyze it (very
slow).

More optimization of special cases hasn't been done simply because no
one has done it yet... (as you note, faceting is a new feature).

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Joachim Martin


Michael Imbeault wrote:

Also, is there any plans to add an option not to run a facet search if 
the result set is too big? To avoid 40 seconds queries if the docset 
is too large...



You could run one query with facet=false, check the result size and then 
run it again (should be fast because it is cached) with 
facet=true&rows=0 to get facet results only.


I would think that the decision to run/not run facets would be highly 
custom to your collection and not easily developed as a configurable 
feature.


--Joachim

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley


I just updated the comments in solrconfig.xml:

  
   

On 9/18/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Another followup: I bumped all the caches in solrconfig.xml to

  size="1600384"
  initialSize="400096"
  autowarmCount="400096"

It seemed to fix the problem on a very small index (facets on last and
first author fields, + 12 range date facets, sub 0.3 seconds for
queries). I'll check on the full index tomorrow (it's indexing right
now, 400docs/sec!). However, I still don't have an idea what are these
values representing, and how I should estimate what values I should set
them to. Originally I thought it was the size of the cache in kb, and
someone on the list told me it was number of items, but I don't quite
get it. Better documentation on that would be welcomed :)

Also, is there any plans to add an option not to run a facet search if
the result set is too big? To avoid 40 seconds queries if the docset is
too large...


I'd like to speed up certain corner cases, but you can always set
timeouts in whatever frontend is making the request to Solr too.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is) we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).  when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list (since they haven't even written that many documents)



-Hoss

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Yonik Seeley


On 9/19/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?

You'll probably see a sharp increase in performacne if you have a single
untokenized author field containing hte full name and you facet on that --
there will be a lot less unique terms to use when computing DocSets and
intersections.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index is ...
once you made that cahnge did your filterCache hitrate increase? .. do you
have any evictions (you can check on the "Statistics" patge)

: > Also, I was under the impression
: > that it was only searching / sorting for authors that it knows are in
: > the result set...
:
: That's the problem... it's not necessarily easy to know *what* authors
: are in the result set.  If we could quickly determine that, we could
: just count them and not do any intersections or anything at all.

another way to look at it is that by looking at all the authors, the work
done for generating the facet counts for query A can be completely reused
for the next query B -- presuming your filterCache is large enough to hold
all of the author filters.

: There could be optimizations when docs_matching_query.size() is small,
: so we start somehow with terms in the documents rather than terms in
: the index.  That requires termvectors to be stored (medium speed), or
: requires that the field be stored and that we re-analyze it (very
: slow).
:
: More optimization of special cases hasn't been done simply because no
: one has done it yet... (as you note, faceting is a new feature).

the optimization optimization i anticipated from teh begining, would
probably be usefull in the situation Michael is describing ... if there is
a "long tail" oif authors (and in my experience, there typically is)



we
can cache an ordered list of the top N most prolific authors, along with
the count of how many documents they have in the index (this info is easy
to getfrom TermEnum.docFreq).


Yeah, I've thought about a fieldInfoCache too.  It could also cache
the total number of terms in order to make decisions about what
faceting strategy to follow.


when we facet on the authors, we start with
that list and go in order, generating their facet constraint count using
the DocSet intersection just like we currently do ... if we reach our
facet.limit before we reach the end of hte list and the lowest constraint
count is higher then the total doc count of the last author in the list,
then we know we don't need to bother testing any other Author, because no
other author an possibly have a higher facet constraint count then the
ones on our list


This works OK if the intersection counts are high (as a percentage of
the facet sets).  I'm not sure how often this will be the case though.

Another tradeoff is to allow getting inexact counts with multi-token fields by:
- simply faceting on the most popular values
  OR
- do some sort of statistical sampling by reading term vectors for a
fraction of the matching docs.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter


: > when we facet on the authors, we start with
: > that list and go in order, generating their facet constraint count using
: > the DocSet intersection just like we currently do ... if we reach our
: > facet.limit before we reach the end of hte list and the lowest constraint
: > count is higher then the total doc count of the last author in the list,
: > then we know we don't need to bother testing any other Author, because no
: > other author an possibly have a higher facet constraint count then the
: > ones on our list
:
: This works OK if the intersection counts are high (as a percentage of
: the facet sets).  I'm not sure how often this will be the case though.

well, keep in mind "N" could be very big, big enough to store the full
list of Terms sorted in docFreq order (it shouldn't take up much space
since it's just hte Term and an int)e ... for any query that returns a
"large" number of results, you probably won't need to reach the end of the
list before you can tell that all the remaining Terms have a lower docFreq
then the current last constraint count in your facet.limit list.  For
queries that return a "small" number of results, it wouldn't be as
usefull, but thats where a switch could be fliped to start with the values
mapped to hte docs (using FieldCache -- assuming single-value fields)

: Another tradeoff is to allow getting inexact counts with multi-token fields 
by:
:  - simply faceting on the most popular values
:OR
:  - do some sort of statistical sampling by reading term vectors for a
: fraction of the matching docs.

i loath inexact counts ... i think of them as "Astrology" to the Astronomy
of true Faceted Searching ... but i'm sure they would be "good enough" for
some peoples use cases.



-Hoss

Re: Facet performance with heterogeneous 'facets'?

2006-09-19 Thread Chris Hostetter


: I just updated the comments in solrconfig.xml:

I've tweaked the SolrCaching wiki page to include some of this info as
well, feel free to add any additional info you think would be helpful to
other people (or ask any qestions about it if any of it still doesn't seem
clear to you)...

http://wiki.apache.org/solr/SolrCaching

: > now, 400docs/sec!). However, I still don't have an idea what are these
: > values representing, and how I should estimate what values I should set
: > them to. Originally I thought it was the size of the cache in kb, and
: > someone on the list told me it was number of items, but I don't quite
: > get it. Better documentation on that would be welcomed :)



-Hoss

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault


Thanks for all the great answers.


Quick Question: did you say you are faceting on the first name field
seperately from the last name field? ... why?
You misunderstood. I'm doing faceting on first author, and last author 
of the list. Life science papers have authors list, and the first one is 
usually the guy who did most of the work, and the last one is usually 
the boss of the lab. I already have untokenized author fields for that 
using copyField.

Second: you mentioned increasing hte size of your filterCache
significantly, but we don't really know how heterogenous your index 
is ...
once you made that cahnge did your filterCache hitrate increase? .. 
do you

have any evictions (you can check on the "Statistics" page)
It was at the default (16000) and it hit the ceiling so to speak. I did 
maxSize=1600 (for testing purpose) and now size : 17038 and 0 
evictions. For a single facet field (journal name) with a limit of 5 and 
12 faceted query fields (range on publication date), I now have 0.5 
seconds search, which is not too bad. The filtercache size is pretty 
much constant no matter how many queries I do.


However, if I try to add another facet field (such as first_author), 
something strange happens. 99% CPU, the filter cache is filling up 
really fast, hitratio goes to hell, no disk activity, and it can stay 
that way for at least 30 minutes (didn't test longer, no point really). 
It turns out that journal_name has 17038 different tokens, which is 
manageable, but first_author has > 400 000. I don't think this will ever 
yield good performance, so i might only do journal_name facets.


Any reasons why facets tries to preload every term in the field?

I have noticed that facets are not cached. Facets off, cached query take 
0.01 seconds. Facet on, uncached and cached queries take 0.7 seconds. 
Any plans for a facets cache? I know that facets is still a very early 
feature, but its already awesome; my application is maybe irrealistic.


Thanks,
Michael

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley


On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has > 400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.


Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault

Dude, stop being so awesome (and the whole Solr team). Seriously! Every 
problem / request (MoreLikeThis class, change AND/OR preference 
programatically, etc) I've submitted to this mailing list has received a 
quick, more-than-I-ever-expected answer.


I'll subscribe to the dev list (been reading it off and on), but I'm 
afraid I couldn't code my way of a paper bag in Java. I'll contribute to 
the Solr wiki (the SolrPHP part in particular) as soon as I can. Thats 
the least I can do!


Btw, Any plans for a facets cache?

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

It turns out that journal_name has 17038 different tokens, which is
manageable, but first_author has > 400 000. I don't think this will ever
yield good performance, so i might only do journal_name facets.


Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):

http://www.nabble.com/big-faceting-speedup-for-single-valued-fields-tf2308153.html 



-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley


On 9/21/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Btw, Any plans for a facets cache?


Maybe a partial one (like caching top terms to implement some other
optimizations).  My general philosophy on caching in Solr has been to
cache things the client can't: elemental things, or *parts* of
requests to make many different requests faster (most
bang-for-the-buck).

Caching complete requests/responses is generally less useful since it
requires even more memory, has a worse hit ratio, and can be done
anyway by the client or a separate process like squid.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Yonik Seeley


On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-21 Thread Michael Imbeault

I upgraded to the most recent Solr build (9-22) and sadly it's still 
really slow. 800 seconds query with a single facet on first_author, 15 
millions documents total, the query return 180. Maybe i'm doing 
something wrong? Also, this is on my personal desktop; not on a server. 
Still, I'm getting 0.1 seconds queries without facets, so I don't think 
thats the cause. In the admin panel i can still see the filtercache 
doing millions of lookups (and tons of evictions once it hits the maxsize).


Here's the field i'm using in schema.xml :


This is the query :
q="hiv red 
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false


I'll do more testing on the weekend,

Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212



Yonik Seeley wrote:

On 9/21/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Hang in there Michael, a fix is on the way for your scenario (and
subscribe to solr-dev if you want to stay on the bleeding edge):


OK, the optimization has been checked in.  You can checkout from svn
and build Solr, or wait for the 9-22 nightly build (after 8:30 EDT).
I'd be interested in hearing your results with it.

The first facet request on a field will take longer than subsequent
ones because the FieldCache entry is loaded on demand.  You can use a
firstSearcher/newSearcher hook in solrconfig.xml to send a facet
request so that a real user would never see this slower query.

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-22 Thread Yonik Seeley


On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

I upgraded to the most recent Solr build (9-22) and sadly it's still
really slow. 800 seconds query with a single facet on first_author, 15
millions documents total, the query return 180. Maybe i'm doing
something wrong? Also, this is on my personal desktop; not on a server.
Still, I'm getting 0.1 seconds queries without facets, so I don't think
thats the cause. In the admin panel i can still see the filtercache
doing millions of lookups (and tons of evictions once it hits the maxsize).


The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.


Here's the field i'm using in schema.xml :



That looks fine...


This is the query :
q="hiv red 
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false


That looks OK too.
I assume that you didn't change the fieldtype definition for "string",
and that the schema has version="1.1"?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-22 Thread Michael Imbeault

Excellent news; as you guessed, my schema was (for some reason) set to 
version 1.0. This also caused some of the problems I had with the 
original SolrPHP (parsing the wrong response).


But better yet, the 800 seconds query is now running in 0.5-2 seconds! 
Amazing optimization! I can now do faceting on journal title (17 000 
different titles) and last author (>400 000 authors), + 12 date range 
queries, in a very reasonable time (considering im on a test windows 
desktop box and not a server).


The only problem is if I add first author, I get a 
java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will 
get away on a server with more than the current 500 megs I can allocate 
to Tomcat.


Michael Imbeault
CHUL Research Center (CHUQ)
2705 boul. Laurier
Ste-Foy, QC, Canada, G1V 4G2
Tel: (418) 654-2705, Fax: (418) 654-2212

Yonik Seeley wrote:

On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

I upgraded to the most recent Solr build (9-22) and sadly it's still
really slow. 800 seconds query with a single facet on first_author, 15
millions documents total, the query return 180. Maybe i'm doing
something wrong? Also, this is on my personal desktop; not on a server.
Still, I'm getting 0.1 seconds queries without facets, so I don't think
thats the cause. In the admin panel i can still see the filtercache
doing millions of lookups (and tons of evictions once it hits the 
maxsize).


The fact that you see all the filtercache usage means that the
optimization didn't kick in for some reason.


Here's the field i'm using in schema.xml :



That looks fine...


This is the query :
q="hiv red 
blood"&start=0&rows=20&fl=article_title+authors+journal_iso+pubdate+pmid+score&qt=standard&facet=true&facet.field=first_author&facet.limit=5&facet.missing=false&facet.zeros=false 



That looks OK too.
I assume that you didn't change the fieldtype definition for "string",
and that the schema has version="1.1"?  Before 1.1, all fields were
assumed to be multiValued (there was no checking or enforcement).

-Yonik

Re: Facet performance with heterogeneous 'facets'?

2006-09-22 Thread Yonik Seeley


On 9/22/06, Michael Imbeault <[EMAIL PROTECTED]> wrote:

Excellent news; as you guessed, my schema was (for some reason) set to
version 1.0.


Yeah, I just realized that having "version" right next to "name" would
lead people to think it's "their" version number, when it's really
Solr's version number.  I've added a comment to the example schema to
clarify that.


But better yet, the 800 seconds query is now running in 0.5-2 seconds!
Amazing optimization! I can now do faceting on journal title (17 000
different titles) and last author (>400 000 authors), + 12 date range
queries, in a very reasonable time (considering im on a test windows
desktop box and not a server).

The only problem is if I add first author, I get a
java.lang.OutOfMemoryError: Java heap space. I'm sure this problem will
get away on a server with more than the current 500 megs I can allocate
to Tomcat.


Yes, the Lucene FieldCache takes up a lot of memory.  It basically
holds the entire field in a non-inverted form:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/FieldCache.StringIndex.html

It's currently also used for sorting, which also needs fast
document->fieldvalue lookups, rather than the inverted
term->documents_containing_that_term

-Yonik

52 matches

Mail list logo