Re: facet.method enum vs fc

2013-04-19 Thread Joel Bernstein
Faceting on a high cardinality string field, like url, on a 120 million
record index is going to be very memory intensive.

You will very likely need to shard the index to get the performance that
you need.

In Solr 4.2, you can make the url field a Disk based DocValue and shift the
memory from Solr to the file system cache. But to run efficiently this is
still going to take a lot of memory in the OS file cache.




On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 20G is allocated to Solr already.

 Ming


 On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk
 wrote:

  On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
   I am doing faceting on an index of 120M documents,
   on the field of url[...]
 
  I would guess that you would need 3-4GB for that.
  How much memory do you allocate to Solr?
 
  - Toke Eskildsen
 
 




-- 
Joel Bernstein
Professional Services LucidWorks


Re: facet.method enum vs fc

2013-04-19 Thread Mingfeng Yang
Joel,

Thanks for your kind reply.   The problem is solved with sharding and using
facet.method=enum.  I am curious about  what's the different between enum
and fc, so that enum works but fc does not.   Do you know something about
this?

Thank you!

Regards,
Ming


On Fri, Apr 19, 2013 at 6:18 AM, Joel Bernstein joels...@gmail.com wrote:

 Faceting on a high cardinality string field, like url, on a 120 million
 record index is going to be very memory intensive.

 You will very likely need to shard the index to get the performance that
 you need.

 In Solr 4.2, you can make the url field a Disk based DocValue and shift the
 memory from Solr to the file system cache. But to run efficiently this is
 still going to take a lot of memory in the OS file cache.




 On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  20G is allocated to Solr already.
 
  Ming
 
 
  On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk
  wrote:
 
   On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
I am doing faceting on an index of 120M documents,
on the field of url[...]
  
   I would guess that you would need 3-4GB for that.
   How much memory do you allocate to Solr?
  
   - Toke Eskildsen
  
  
 



 --
 Joel Bernstein
 Professional Services LucidWorks



Re: facet.method enum vs fc

2013-04-19 Thread Chris Hostetter

: Thanks for your kind reply.   The problem is solved with sharding and using
: facet.method=enum.  I am curious about  what's the different between enum
: and fc, so that enum works but fc does not.   Do you know something about
: this?

method=fc/fcs uses the field caches (or uninverted fields if they are 
multivalued) to build a large data structure that is reusable across 
many requests and allows faceting happen very quickly even when the 
number of terms is large.

enum causes solr to walk the term enum for the field and generate a DocSet 
for each term which is then intersected with the main results -- basically 
doing facet.field just like facet.query iwth simple term queries.

these DocSets from using facet.method=enum will be cached in the 
filterCache, so there is some performance savings there if/when people 
filter on these facet constraints, but the regular rules about cache 
evicitions apply.

So in a situation where the heap size is big enough not to matter 
method=fc should be faster and take up less ram then if you size your 
filterCache big enough to hold all of the DocSets involved if you use 
method=enum to not have cache evictions.  

In most cases, the only motivation for using method=enum is if you know 
the cardinality of your set of constraints is relatively small and fixed 
(ie: there are only 50 states in the US, so you might find that faceting 
on a state field with method=enum is just as fast as using method=fc and 
takes less ram -- this is why boolean fields default to method=enum, the 
cardinality is garunteed to be 2).  But in some less common cases, you 
might care more about saving ram then speed, or you might be trying to 
facet on huge index with fields containing lots of terms (ie: full text) 
so that method=fc just wont work with any concievable amount of ram, so it 
could make sense to use method=enum with filterCache disabled.


-Hoss


Re: facet.method enum vs fc

2013-04-18 Thread Toke Eskildsen
On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
 I am doing faceting on an index of 120M documents, 
 on the field of url[...]

I would guess that you would need 3-4GB for that.
How much memory do you allocate to Solr?

- Toke Eskildsen



Re: facet.method enum vs fc

2013-04-18 Thread Mingfeng Yang
20G is allocated to Solr already.

Ming


On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen 
t...@statsbiblioteket.dkwrote:

 On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote:
  I am doing faceting on an index of 120M documents,
  on the field of url[...]

 I would guess that you would need 3-4GB for that.
 How much memory do you allocate to Solr?

 - Toke Eskildsen




Re: facet.method enum vs fc

2013-04-17 Thread Timothy Potter
What are your results when using facet.method=fcs?


On Wed, Apr 17, 2013 at 12:06 PM, Mingfeng Yang mfy...@wisewindow.comwrote:

 I am doing faceting on an index of 120M documents, on the field of url,
 using the following two queries.  Note that the only difference of the two
 queries is that first one uses default facet.method, and the second one
 uses face.method=enum.   ( each document in the index contains a review we
 extracted from internet with multiple fields, and url field stands for the
 link to the original web pages.  The matching document size is like 5.3
 million. )


 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0


 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum

 The first method gives me outofmemory error( ERROR 500: Java heap space
  java.lang.OutOfMemoryError: Java heap space), but the second one runs fine
 though very slow (163 seconds)

 According to the wiki and solr documentation, the default facet.method=fc
 uses less memory than facet.method=enum, isn't it?

 Thanks,
 Ming



Re: facet.method enum vs fc

2013-04-17 Thread Mingfeng Yang
Does Solr 3.6 has facet.method=fcs?   I tried anyway, and got

ERROR 500: GC overhead limit exceeded  java.lang.OutOfMemoryError: GC
overhead limit exceeded.


On Wed, Apr 17, 2013 at 12:38 PM, Timothy Potter thelabd...@gmail.comwrote:

 What are your results when using facet.method=fcs?


 On Wed, Apr 17, 2013 at 12:06 PM, Mingfeng Yang mfy...@wisewindow.com
 wrote:

  I am doing faceting on an index of 120M documents, on the field of url,
  using the following two queries.  Note that the only difference of the
 two
  queries is that first one uses default facet.method, and the second one
  uses face.method=enum.   ( each document in the index contains a review
 we
  extracted from internet with multiple fields, and url field stands for
 the
  link to the original web pages.  The matching document size is like 5.3
  million. )
 
 
 
 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0
 
 
 
 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum
 
  The first method gives me outofmemory error( ERROR 500: Java heap space
   java.lang.OutOfMemoryError: Java heap space), but the second one runs
 fine
  though very slow (163 seconds)
 
  According to the wiki and solr documentation, the default facet.method=fc
  uses less memory than facet.method=enum, isn't it?
 
  Thanks,
  Ming
 



Re: facet.method: enum vs. fc

2010-10-11 Thread Erick Erickson
Yep, that was probably the best choice

It's a classic time/space tradeoff. The enum method creates a bitset for
#each#
unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring
some overhead here). So if your facet field has 10 unique values, and 8M
documents,
you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so
on. But
this is very, very fast.

fc on the other hand, eats up cache for storing the string value for each
unique value,
plus various counter arrays (several bytes/doc). For most cases, it will use
less memory
than enum, but will be slower.

I'd stick with fc for the time being and think about enum if 1 you have a
good idea of
what the number of unique terms is or 2 you start to need to finely tune
your speed.

HTH
Erick

On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna 
castagna.li...@googlemail.com wrote:

 Hi,
 I am using Solr v1.4 and I am not sure which facet.method I should use.

 What should I use if I do not know in advance if the number of values
 for a given field will be high or low?

 What are the pros/cons of using facet.method=enum vs. facet.method=fc?

 When should I use enum vs. fc?

 I have found some comments and suggestions here:

  enum enumerates all terms in a field, calculating the set intersection
  of documents that match the term with documents that match the query.
  This was the default (and only) method for faceting multi-valued fields
  prior to Solr 1.4.
  fc (stands for field cache), the facet counts are calculated by
  iterating over documents that match the query and summing the terms
  that appear in each document. This was the default method for single
  valued fields prior to Solr 1.4.
  The default value is fc (except for BoolField) since it tends to use
  less memory and is faster when a field has many unique terms in the
  index.
  -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

  facet.method=enum [...] this is excellent for fields where there is
  a small set of distinct values. The average number of values per
  document does not matter.
  facet.method=fc [...] this is excellent for situations where the
  number of indexed values for the field is high, but the number of
  values per document is low. For multi-valued fields, a hybrid approach
  is used that uses term filters from the filterCache for terms that
  match many documents.
  -- http://wiki.apache.org/solr/SolrFacetingOverview

  If you are faceting on a field that you know only has a small number
  of values (say less than 50), then it is advisable to explicitly set
  this to enum. When faceting on multiple fields, remember to set this
  for the specific fields desired and not universally for all facets.
  The request handler configuration is a good place to put this.
  -- Book: Solr 1.4 Enterprise Search Server, pag. 148

 This is the part of the Solr code which deals with the facet.method
 parameter:

  if (enumMethod) {
counts = getFacetTermEnumCounts([...]);
  } else {
if (multiToken) {
  UnInvertedField uif = [...]
  counts = uif.getCounts([...]);
} else {
  [...]
  if (per_segment) {
[...]
counts = ps.getFacetCounts([...]);
  } else {
counts = getFieldCacheCounts([...]);
  }
}
  }
  --
 https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java

 See also:

  -
 http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values

 At the end, since I do not know in advance the number of different
 values for my fields I went for facet.method=fc, does this seems
 reasonable to you?

 Thank you,
 Paolo



Re: facet.method: enum vs. fc

2010-10-11 Thread Paolo Castagna

Thank you Erick, your explanation was helpful.
I'll stick with fc and come back to this later if I need further tuning.

Paolo

Erick Erickson wrote:

Yep, that was probably the best choice

It's a classic time/space tradeoff. The enum method creates a bitset for
#each#
unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring
some overhead here). So if your facet field has 10 unique values, and 8M
documents,
you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so
on. But
this is very, very fast.

fc on the other hand, eats up cache for storing the string value for each
unique value,
plus various counter arrays (several bytes/doc). For most cases, it will use
less memory
than enum, but will be slower.

I'd stick with fc for the time being and think about enum if 1 you have a
good idea of
what the number of unique terms is or 2 you start to need to finely tune
your speed.

HTH
Erick

On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna 
castagna.li...@googlemail.com wrote:


Hi,
I am using Solr v1.4 and I am not sure which facet.method I should use.

What should I use if I do not know in advance if the number of values
for a given field will be high or low?

What are the pros/cons of using facet.method=enum vs. facet.method=fc?

When should I use enum vs. fc?

I have found some comments and suggestions here:

 enum enumerates all terms in a field, calculating the set intersection
 of documents that match the term with documents that match the query.
 This was the default (and only) method for faceting multi-valued fields
 prior to Solr 1.4.
 fc (stands for field cache), the facet counts are calculated by
 iterating over documents that match the query and summing the terms
 that appear in each document. This was the default method for single
 valued fields prior to Solr 1.4.
 The default value is fc (except for BoolField) since it tends to use
 less memory and is faster when a field has many unique terms in the
 index.
 -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method

 facet.method=enum [...] this is excellent for fields where there is
 a small set of distinct values. The average number of values per
 document does not matter.
 facet.method=fc [...] this is excellent for situations where the
 number of indexed values for the field is high, but the number of
 values per document is low. For multi-valued fields, a hybrid approach
 is used that uses term filters from the filterCache for terms that
 match many documents.
 -- http://wiki.apache.org/solr/SolrFacetingOverview

 If you are faceting on a field that you know only has a small number
 of values (say less than 50), then it is advisable to explicitly set
 this to enum. When faceting on multiple fields, remember to set this
 for the specific fields desired and not universally for all facets.
 The request handler configuration is a good place to put this.
 -- Book: Solr 1.4 Enterprise Search Server, pag. 148

This is the part of the Solr code which deals with the facet.method
parameter:

 if (enumMethod) {
   counts = getFacetTermEnumCounts([...]);
 } else {
   if (multiToken) {
 UnInvertedField uif = [...]
 counts = uif.getCounts([...]);
   } else {
 [...]
 if (per_segment) {
   [...]
   counts = ps.getFacetCounts([...]);
 } else {
   counts = getFieldCacheCounts([...]);
 }
   }
 }
 --
https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java

See also:

 -
http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values

At the end, since I do not know in advance the number of different
values for my fields I went for facet.method=fc, does this seems
reasonable to you?

Thank you,
Paolo