Re: Terms Facet never returns a response when the number of documents increases past tens of thousands.
I built a fresh server with Elasticsearch V0.90.9 and all my problems went away without making a single change to my index or code. It's running on a much smaller server ( CPU and MEMORY ) than my previous install as well with no problems. The new install even loaded an index that was originally created by a server running 0.2.x from Amazon S3. This is great. Elasticsearch impresses and surprises me again. I will probably still investigate converting the text i'm Facetting on to integer id's that I'll convert back to strings for users on the app end of things. It sounds like this will further increase performance. On Thursday, December 19, 2013 7:00:13 AM UTC-8, Alexander Reelsen wrote: > > Hey, > > can you test with a more recent version of elasticsearch first? There were > some dramatic improvements regarding facetting. > Also, you should explain your setup a bit more. Facetting can need a lot > of memory with lots of documents as it uses so-called fielddata, so you > should configure and monitor elasticsearch appropriately. > > See > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data > > > --Alex > > > On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones > > wrote: > >> I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x >> Intel Xeon E5-2680 v2 processors and 15GB of memory. >> >> My Terms Facet queries work great as long as the number of documents in >> the index is small ( eg. less than 20,000 ). When the system hits more, >> pushing into the hundreds of thousands or millions of documents, my Terms >> Facets never return results. Watching the server, I initially see a few >> Java processes using a lot of CPU, but within a few seconds, this is >> reduced to a half dozen processes each using ~2% cpu. I never see memory >> usage increase on the server as a result of these queries. When these >> queries fail to return results, they also sometimes seem to "freeze" >> Elasticsearch and I often have to restart the ES server or even reboot the >> physical server to get ES back online for other simple queries. >> >> The fields I'm trying to facet exist for nearly every document and can >> have anywhere from 0 to hundreds of different values across the dataset. >> All values are text strings and I'm using a custom analyzer that reduces >> them to lowercase. I realize that increasing the number of potential >> values in a field will dramatically increase the resources needed for the >> Terms Facet Query. In testing, I would expect some of the smaller fields >> should work fine even at scale with millions of documents. >> >> >> >> Questions: >> >> 1.) My test field ( industries ), can have no more than 32 unique values. >> Each document could have none or all 32 values. Each value can be from 10 >> to 100 characters of text. This Terms Facet never returns a result at >> scale. Any thoughts on what is happening? Is my setup flawed? >> >> 2. Will I ever be able to run a facet on a field that can have millions >> of unique text values? I have some data analysis cases like this where I'd >> like to use Elasticsearch Facetting. >> >> 3.) Would reducing the fields I'm faceting on to integers ( and then >> translating back to text outside ES ) make a big difference in performance >> and required resources? >> >> >> >> Test Query: >> >> curl -X POST " >> http://remote_host:9200/companies/company/_search?pretty=true"; -d ' >> { >> "query" : { >> "match_all" : { } >> }, >> "facets" : { >> "industries" : { >> "terms" : { >> "field" : "industries.term.keyword_lowercase", >> "size" : 100 >> } >> } >> }, >> "size" : 0 >> } >> ' >> >> >> >> >> Index Configuration: >> >> { >> "index" : { >> "number_of_shards" : 5, >> "number_of_replicas" : 1, >> "analysis" : { >> "analyzer" : { >> "default" : { >> "tokenizer" : "standard", >> "filter" : ["standard", "word_delimiter", "lowercase", "stop"] >> }, >> "html_strip" : { >> "tokenizer" : "standard", >> "filter" : ["standard", "word_delimiter", "lowercase", "stop"], >> "char_filter" : "html_strip" >> }, >> "keyword_lowercase" : { >> "tokenizer" : "keyword", >> "filter" : "lowercase" >> } >> } >> } >> } >> } >> >> >> >> >> Company Document Mapping: >> >> ** i've removed irrelevant fields >> >> { >> "company" : { >> "type" : "object", >> "include_in_all" : false, >> "path" : "full", >> "dynamic" : "strict", >> "properties" : { >> "name" : { >> "type" : "multi_field", >> "fields" : { >> "name" : { "type" : "string", "index" : "analyzed", "include_in_all" : >> "true", "boost" : 10.0 }, >> "keyword_lowercase" : { "type" : "string", "index" : "analyzed", >> "analyzer" : "keyword_lowercase", "include_in_all" : "false" } >> } >> }, >> "description" : { "type
Re: Terms Facet never returns a response when the number of documents increases past tens of thousands.
I completely agree about upgrading. Elasticsearch 0.90 introduced numerous memory improvements, including one issue that directly affects you. With the previous versions (0.20 and prior), high cardinality fields, such as your industry field, would use inefficient data structures to load the faceted values. Your situation would be greatly improved with 0.90+. Easily. In terms of your last point, I would use numerical values whenever possible. My taxonomy is known in advance, so I do client side lookups for numerical values. I do not have statistics for how much of an improvement it is, but you can do the simple math of how much can be saved . In Java, ints are 4 bytes, while each character in UTF-8 (assuming you are using unicode) can be 1 to 6 bytes. It all adds up. Cheers, Ivan On Thu, Dec 19, 2013 at 7:00 AM, Alexander Reelsen wrote: > Hey, > > can you test with a more recent version of elasticsearch first? There were > some dramatic improvements regarding facetting. > Also, you should explain your setup a bit more. Facetting can need a lot > of memory with lots of documents as it uses so-called fielddata, so you > should configure and monitor elasticsearch appropriately. > > See > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html > > http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data > > > --Alex > > > On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones wrote: > >> I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x >> Intel Xeon E5-2680 v2 processors and 15GB of memory. >> >> My Terms Facet queries work great as long as the number of documents in >> the index is small ( eg. less than 20,000 ). When the system hits more, >> pushing into the hundreds of thousands or millions of documents, my Terms >> Facets never return results. Watching the server, I initially see a few >> Java processes using a lot of CPU, but within a few seconds, this is >> reduced to a half dozen processes each using ~2% cpu. I never see memory >> usage increase on the server as a result of these queries. When these >> queries fail to return results, they also sometimes seem to "freeze" >> Elasticsearch and I often have to restart the ES server or even reboot the >> physical server to get ES back online for other simple queries. >> >> The fields I'm trying to facet exist for nearly every document and can >> have anywhere from 0 to hundreds of different values across the dataset. >> All values are text strings and I'm using a custom analyzer that reduces >> them to lowercase. I realize that increasing the number of potential >> values in a field will dramatically increase the resources needed for the >> Terms Facet Query. In testing, I would expect some of the smaller fields >> should work fine even at scale with millions of documents. >> >> >> >> Questions: >> >> 1.) My test field ( industries ), can have no more than 32 unique values. >> Each document could have none or all 32 values. Each value can be from 10 >> to 100 characters of text. This Terms Facet never returns a result at >> scale. Any thoughts on what is happening? Is my setup flawed? >> >> 2. Will I ever be able to run a facet on a field that can have millions >> of unique text values? I have some data analysis cases like this where I'd >> like to use Elasticsearch Facetting. >> >> 3.) Would reducing the fields I'm faceting on to integers ( and then >> translating back to text outside ES ) make a big difference in performance >> and required resources? >> >> >> >> Test Query: >> >> curl -X POST " >> http://remote_host:9200/companies/company/_search?pretty=true"; -d ' >> { >> "query" : { >> "match_all" : { } >> }, >> "facets" : { >> "industries" : { >> "terms" : { >> "field" : "industries.term.keyword_lowercase", >> "size" : 100 >> } >> } >> }, >> "size" : 0 >> } >> ' >> >> >> >> >> Index Configuration: >> >> { >> "index" : { >> "number_of_shards" : 5, >> "number_of_replicas" : 1, >> "analysis" : { >> "analyzer" : { >> "default" : { >> "tokenizer" : "standard", >> "filter" : ["standard", "word_delimiter", "lowercase", "stop"] >> }, >> "html_strip" : { >> "tokenizer" : "standard", >> "filter" : ["standard", "word_delimiter", "lowercase", "stop"], >> "char_filter" : "html_strip" >> }, >> "keyword_lowercase" : { >> "tokenizer" : "keyword", >> "filter" : "lowercase" >> } >> } >> } >> } >> } >> >> >> >> >> Company Document Mapping: >> >> ** i've removed irrelevant fields >> >> { >> "company" : { >> "type" : "object", >> "include_in_all" : false, >> "path" : "full", >> "dynamic" : "strict", >> "properties" : { >> "name" : { >> "type" : "multi_field", >> "fields" : { >> "name" : { "type" : "string", "index" : "analyzed", "include_in_all" : >> "true", "boost" : 10.0 }, >> "keyword_lowercase" : { "type" : "string", "index" : "analyzed
Re: Terms Facet never returns a response when the number of documents increases past tens of thousands.
Hey, can you test with a more recent version of elasticsearch first? There were some dramatic improvements regarding facetting. Also, you should explain your setup a bit more. Facetting can need a lot of memory with lots of documents as it uses so-called fielddata, so you should configure and monitor elasticsearch appropriately. See http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data --Alex On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones wrote: > I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x > Intel Xeon E5-2680 v2 processors and 15GB of memory. > > My Terms Facet queries work great as long as the number of documents in > the index is small ( eg. less than 20,000 ). When the system hits more, > pushing into the hundreds of thousands or millions of documents, my Terms > Facets never return results. Watching the server, I initially see a few > Java processes using a lot of CPU, but within a few seconds, this is > reduced to a half dozen processes each using ~2% cpu. I never see memory > usage increase on the server as a result of these queries. When these > queries fail to return results, they also sometimes seem to "freeze" > Elasticsearch and I often have to restart the ES server or even reboot the > physical server to get ES back online for other simple queries. > > The fields I'm trying to facet exist for nearly every document and can > have anywhere from 0 to hundreds of different values across the dataset. > All values are text strings and I'm using a custom analyzer that reduces > them to lowercase. I realize that increasing the number of potential > values in a field will dramatically increase the resources needed for the > Terms Facet Query. In testing, I would expect some of the smaller fields > should work fine even at scale with millions of documents. > > > > Questions: > > 1.) My test field ( industries ), can have no more than 32 unique values. > Each document could have none or all 32 values. Each value can be from 10 > to 100 characters of text. This Terms Facet never returns a result at > scale. Any thoughts on what is happening? Is my setup flawed? > > 2. Will I ever be able to run a facet on a field that can have millions of > unique text values? I have some data analysis cases like this where I'd > like to use Elasticsearch Facetting. > > 3.) Would reducing the fields I'm faceting on to integers ( and then > translating back to text outside ES ) make a big difference in performance > and required resources? > > > > Test Query: > > curl -X POST " > http://remote_host:9200/companies/company/_search?pretty=true"; -d ' > { > "query" : { > "match_all" : { } > }, > "facets" : { > "industries" : { > "terms" : { > "field" : "industries.term.keyword_lowercase", > "size" : 100 > } > } > }, > "size" : 0 > } > ' > > > > > Index Configuration: > > { > "index" : { > "number_of_shards" : 5, > "number_of_replicas" : 1, > "analysis" : { > "analyzer" : { > "default" : { > "tokenizer" : "standard", > "filter" : ["standard", "word_delimiter", "lowercase", "stop"] > }, > "html_strip" : { > "tokenizer" : "standard", > "filter" : ["standard", "word_delimiter", "lowercase", "stop"], > "char_filter" : "html_strip" > }, > "keyword_lowercase" : { > "tokenizer" : "keyword", > "filter" : "lowercase" > } > } > } > } > } > > > > > Company Document Mapping: > > ** i've removed irrelevant fields > > { > "company" : { > "type" : "object", > "include_in_all" : false, > "path" : "full", > "dynamic" : "strict", > "properties" : { > "name" : { > "type" : "multi_field", > "fields" : { > "name" : { "type" : "string", "index" : "analyzed", "include_in_all" : > "true", "boost" : 10.0 }, > "keyword_lowercase" : { "type" : "string", "index" : "analyzed", > "analyzer" : "keyword_lowercase", "include_in_all" : "false" } > } > }, > "description" : { "type" : "string", "index" : "analyzed", > "include_in_all" : "true", "boost" : 6.0 }, > "industries" : { > "type" : "nested", > "include_in_root" : true, > "properties" : { > "term" : { > "type" : "multi_field", > "fields" : { > "term" : { "type" : "string", "index" : "analyzed", "include_in_all" : > true, "boost" : 3.0 }, > "keyword_lowercase" : { "type" : "string", "index" : "analyzed", > "analyzer" : "keyword_lowercase" } > } > }, > "description" : { "type" : "string", "index" : "analyzed", > "include_in_all" : true }, > "score" : { "type" : "integer" }, > "verified" : { "type" : "boolean" } > } > } > } > } > } > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearch+unsubscr...@googlegroups.com. > To view this discussion on the web visit > http
Terms Facet never returns a response when the number of documents increases past tens of thousands.
I'm using the Terms Facet with Elasticsearch V0.20.2. The server has 8 x Intel Xeon E5-2680 v2 processors and 15GB of memory. My Terms Facet queries work great as long as the number of documents in the index is small ( eg. less than 20,000 ). When the system hits more, pushing into the hundreds of thousands or millions of documents, my Terms Facets never return results. Watching the server, I initially see a few Java processes using a lot of CPU, but within a few seconds, this is reduced to a half dozen processes each using ~2% cpu. I never see memory usage increase on the server as a result of these queries. When these queries fail to return results, they also sometimes seem to "freeze" Elasticsearch and I often have to restart the ES server or even reboot the physical server to get ES back online for other simple queries. The fields I'm trying to facet exist for nearly every document and can have anywhere from 0 to hundreds of different values across the dataset. All values are text strings and I'm using a custom analyzer that reduces them to lowercase. I realize that increasing the number of potential values in a field will dramatically increase the resources needed for the Terms Facet Query. In testing, I would expect some of the smaller fields should work fine even at scale with millions of documents. Questions: 1.) My test field ( industries ), can have no more than 32 unique values. Each document could have none or all 32 values. Each value can be from 10 to 100 characters of text. This Terms Facet never returns a result at scale. Any thoughts on what is happening? Is my setup flawed? 2. Will I ever be able to run a facet on a field that can have millions of unique text values? I have some data analysis cases like this where I'd like to use Elasticsearch Facetting. 3.) Would reducing the fields I'm faceting on to integers ( and then translating back to text outside ES ) make a big difference in performance and required resources? Test Query: curl -X POST "http://remote_host:9200/companies/company/_search?pretty=true"; -d ' { "query" : { "match_all" : { } }, "facets" : { "industries" : { "terms" : { "field" : "industries.term.keyword_lowercase", "size" : 100 } } }, "size" : 0 } ' Index Configuration: { "index" : { "number_of_shards" : 5, "number_of_replicas" : 1, "analysis" : { "analyzer" : { "default" : { "tokenizer" : "standard", "filter" : ["standard", "word_delimiter", "lowercase", "stop"] }, "html_strip" : { "tokenizer" : "standard", "filter" : ["standard", "word_delimiter", "lowercase", "stop"], "char_filter" : "html_strip" }, "keyword_lowercase" : { "tokenizer" : "keyword", "filter" : "lowercase" } } } } } Company Document Mapping: ** i've removed irrelevant fields { "company" : { "type" : "object", "include_in_all" : false, "path" : "full", "dynamic" : "strict", "properties" : { "name" : { "type" : "multi_field", "fields" : { "name" : { "type" : "string", "index" : "analyzed", "include_in_all" : "true", "boost" : 10.0 }, "keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer" : "keyword_lowercase", "include_in_all" : "false" } } }, "description" : { "type" : "string", "index" : "analyzed", "include_in_all" : "true", "boost" : 6.0 }, "industries" : { "type" : "nested", "include_in_root" : true, "properties" : { "term" : { "type" : "multi_field", "fields" : { "term" : { "type" : "string", "index" : "analyzed", "include_in_all" : true, "boost" : 3.0 }, "keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer" : "keyword_lowercase" } } }, "description" : { "type" : "string", "index" : "analyzed", "include_in_all" : true }, "score" : { "type" : "integer" }, "verified" : { "type" : "boolean" } } } } } } -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.