Re: facet.method enum vs fc
Faceting on a high cardinality string field, like url, on a 120 million record index is going to be very memory intensive. You will very likely need to shard the index to get the performance that you need. In Solr 4.2, you can make the url field a Disk based DocValue and shift the memory from Solr to the file system cache. But to run efficiently this is still going to take a lot of memory in the OS file cache. On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.comwrote: 20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen -- Joel Bernstein Professional Services LucidWorks
Re: facet.method enum vs fc
Joel, Thanks for your kind reply. The problem is solved with sharding and using facet.method=enum. I am curious about what's the different between enum and fc, so that enum works but fc does not. Do you know something about this? Thank you! Regards, Ming On Fri, Apr 19, 2013 at 6:18 AM, Joel Bernstein joels...@gmail.com wrote: Faceting on a high cardinality string field, like url, on a 120 million record index is going to be very memory intensive. You will very likely need to shard the index to get the performance that you need. In Solr 4.2, you can make the url field a Disk based DocValue and shift the memory from Solr to the file system cache. But to run efficiently this is still going to take a lot of memory in the OS file cache. On Thu, Apr 18, 2013 at 12:00 PM, Mingfeng Yang mfy...@wisewindow.com wrote: 20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dk wrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen -- Joel Bernstein Professional Services LucidWorks
Re: facet.method enum vs fc
: Thanks for your kind reply. The problem is solved with sharding and using : facet.method=enum. I am curious about what's the different between enum : and fc, so that enum works but fc does not. Do you know something about : this? method=fc/fcs uses the field caches (or uninverted fields if they are multivalued) to build a large data structure that is reusable across many requests and allows faceting happen very quickly even when the number of terms is large. enum causes solr to walk the term enum for the field and generate a DocSet for each term which is then intersected with the main results -- basically doing facet.field just like facet.query iwth simple term queries. these DocSets from using facet.method=enum will be cached in the filterCache, so there is some performance savings there if/when people filter on these facet constraints, but the regular rules about cache evicitions apply. So in a situation where the heap size is big enough not to matter method=fc should be faster and take up less ram then if you size your filterCache big enough to hold all of the DocSets involved if you use method=enum to not have cache evictions. In most cases, the only motivation for using method=enum is if you know the cardinality of your set of constraints is relatively small and fixed (ie: there are only 50 states in the US, so you might find that faceting on a state field with method=enum is just as fast as using method=fc and takes less ram -- this is why boolean fields default to method=enum, the cardinality is garunteed to be 2). But in some less common cases, you might care more about saving ram then speed, or you might be trying to facet on huge index with fields containing lots of terms (ie: full text) so that method=fc just wont work with any concievable amount of ram, so it could make sense to use method=enum with filterCache disabled. -Hoss
Re: facet.method enum vs fc
On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen
Re: facet.method enum vs fc
20G is allocated to Solr already. Ming On Wed, Apr 17, 2013 at 11:56 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Wed, 2013-04-17 at 20:06 +0200, Mingfeng Yang wrote: I am doing faceting on an index of 120M documents, on the field of url[...] I would guess that you would need 3-4GB for that. How much memory do you allocate to Solr? - Toke Eskildsen
Re: facet.method enum vs fc
What are your results when using facet.method=fcs? On Wed, Apr 17, 2013 at 12:06 PM, Mingfeng Yang mfy...@wisewindow.comwrote: I am doing faceting on an index of 120M documents, on the field of url, using the following two queries. Note that the only difference of the two queries is that first one uses default facet.method, and the second one uses face.method=enum. ( each document in the index contains a review we extracted from internet with multiple fields, and url field stands for the link to the original web pages. The matching document size is like 5.3 million. ) http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum The first method gives me outofmemory error( ERROR 500: Java heap space java.lang.OutOfMemoryError: Java heap space), but the second one runs fine though very slow (163 seconds) According to the wiki and solr documentation, the default facet.method=fc uses less memory than facet.method=enum, isn't it? Thanks, Ming
Re: facet.method enum vs fc
Does Solr 3.6 has facet.method=fcs? I tried anyway, and got ERROR 500: GC overhead limit exceeded java.lang.OutOfMemoryError: GC overhead limit exceeded. On Wed, Apr 17, 2013 at 12:38 PM, Timothy Potter thelabd...@gmail.comwrote: What are your results when using facet.method=fcs? On Wed, Apr 17, 2013 at 12:06 PM, Mingfeng Yang mfy...@wisewindow.com wrote: I am doing faceting on an index of 120M documents, on the field of url, using the following two queries. Note that the only difference of the two queries is that first one uses default facet.method, and the second one uses face.method=enum. ( each document in the index contains a review we extracted from internet with multiple fields, and url field stands for the link to the original web pages. The matching document size is like 5.3 million. ) http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0 http://autos-solr-api.wisewindow.com:8995/solr/select?q=*:*indent=onversion=2.2fq=language:englishstart=0rows=1facet.mincount=1facet=truewt=jsonfq=search_source:%22Video%22sort=date%20descfl=topicfacet.limit=25facet.field=urlfacet.offset=0facet.method=enum The first method gives me outofmemory error( ERROR 500: Java heap space java.lang.OutOfMemoryError: Java heap space), but the second one runs fine though very slow (163 seconds) According to the wiki and solr documentation, the default facet.method=fc uses less memory than facet.method=enum, isn't it? Thanks, Ming
Re: facet.method: enum vs. fc
Yep, that was probably the best choice It's a classic time/space tradeoff. The enum method creates a bitset for #each# unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring some overhead here). So if your facet field has 10 unique values, and 8M documents, you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so on. But this is very, very fast. fc on the other hand, eats up cache for storing the string value for each unique value, plus various counter arrays (several bytes/doc). For most cases, it will use less memory than enum, but will be slower. I'd stick with fc for the time being and think about enum if 1 you have a good idea of what the number of unique terms is or 2 you start to need to finely tune your speed. HTH Erick On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna castagna.li...@googlemail.com wrote: Hi, I am using Solr v1.4 and I am not sure which facet.method I should use. What should I use if I do not know in advance if the number of values for a given field will be high or low? What are the pros/cons of using facet.method=enum vs. facet.method=fc? When should I use enum vs. fc? I have found some comments and suggestions here: enum enumerates all terms in a field, calculating the set intersection of documents that match the term with documents that match the query. This was the default (and only) method for faceting multi-valued fields prior to Solr 1.4. fc (stands for field cache), the facet counts are calculated by iterating over documents that match the query and summing the terms that appear in each document. This was the default method for single valued fields prior to Solr 1.4. The default value is fc (except for BoolField) since it tends to use less memory and is faster when a field has many unique terms in the index. -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method facet.method=enum [...] this is excellent for fields where there is a small set of distinct values. The average number of values per document does not matter. facet.method=fc [...] this is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. -- http://wiki.apache.org/solr/SolrFacetingOverview If you are faceting on a field that you know only has a small number of values (say less than 50), then it is advisable to explicitly set this to enum. When faceting on multiple fields, remember to set this for the specific fields desired and not universally for all facets. The request handler configuration is a good place to put this. -- Book: Solr 1.4 Enterprise Search Server, pag. 148 This is the part of the Solr code which deals with the facet.method parameter: if (enumMethod) { counts = getFacetTermEnumCounts([...]); } else { if (multiToken) { UnInvertedField uif = [...] counts = uif.getCounts([...]); } else { [...] if (per_segment) { [...] counts = ps.getFacetCounts([...]); } else { counts = getFieldCacheCounts([...]); } } } -- https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java See also: - http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values At the end, since I do not know in advance the number of different values for my fields I went for facet.method=fc, does this seems reasonable to you? Thank you, Paolo
Re: facet.method: enum vs. fc
Thank you Erick, your explanation was helpful. I'll stick with fc and come back to this later if I need further tuning. Paolo Erick Erickson wrote: Yep, that was probably the best choice It's a classic time/space tradeoff. The enum method creates a bitset for #each# unique facet value. The bit set is (maxdocs / 8) bytes in size (I'm ignoring some overhead here). So if your facet field has 10 unique values, and 8M documents, you'll use up 10M bytes or so. 20 unique values will use up 20M bytes and so on. But this is very, very fast. fc on the other hand, eats up cache for storing the string value for each unique value, plus various counter arrays (several bytes/doc). For most cases, it will use less memory than enum, but will be slower. I'd stick with fc for the time being and think about enum if 1 you have a good idea of what the number of unique terms is or 2 you start to need to finely tune your speed. HTH Erick On Mon, Oct 11, 2010 at 11:30 AM, Paolo Castagna castagna.li...@googlemail.com wrote: Hi, I am using Solr v1.4 and I am not sure which facet.method I should use. What should I use if I do not know in advance if the number of values for a given field will be high or low? What are the pros/cons of using facet.method=enum vs. facet.method=fc? When should I use enum vs. fc? I have found some comments and suggestions here: enum enumerates all terms in a field, calculating the set intersection of documents that match the term with documents that match the query. This was the default (and only) method for faceting multi-valued fields prior to Solr 1.4. fc (stands for field cache), the facet counts are calculated by iterating over documents that match the query and summing the terms that appear in each document. This was the default method for single valued fields prior to Solr 1.4. The default value is fc (except for BoolField) since it tends to use less memory and is faster when a field has many unique terms in the index. -- http://wiki.apache.org/solr/SimpleFacetParameters#facet.method facet.method=enum [...] this is excellent for fields where there is a small set of distinct values. The average number of values per document does not matter. facet.method=fc [...] this is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. -- http://wiki.apache.org/solr/SolrFacetingOverview If you are faceting on a field that you know only has a small number of values (say less than 50), then it is advisable to explicitly set this to enum. When faceting on multiple fields, remember to set this for the specific fields desired and not universally for all facets. The request handler configuration is a good place to put this. -- Book: Solr 1.4 Enterprise Search Server, pag. 148 This is the part of the Solr code which deals with the facet.method parameter: if (enumMethod) { counts = getFacetTermEnumCounts([...]); } else { if (multiToken) { UnInvertedField uif = [...] counts = uif.getCounts([...]); } else { [...] if (per_segment) { [...] counts = ps.getFacetCounts([...]); } else { counts = getFieldCacheCounts([...]); } } } -- https://svn.apache.org/repos/asf/lucene/dev/trunk/solr/src/java/org/apache/solr/request/SimpleFacets.java See also: - http://stackoverflow.com/questions/2902680/how-well-does-solr-scale-over-large-number-of-facet-values At the end, since I do not know in advance the number of different values for my fields I went for facet.method=fc, does this seems reasonable to you? Thank you, Paolo