Re: Terms Facet never returns a response when the number of documents increases past tens of thousands.

2014-01-06 Thread Brian Jones
I built a fresh server with Elasticsearch V0.90.9 and all my problems went 
away without making a single change to my index or code.  It's running on a 
much smaller server ( CPU and MEMORY ) than my previous install as well 
with no problems.

The new install even loaded an index that was originally created by a 
server running 0.2.x from Amazon S3.

This is great.  Elasticsearch impresses and surprises me again.

I will probably still investigate converting the text i'm Facetting on to 
integer id's that I'll convert back to strings for users on the app end of 
things.  It sounds like this will further increase performance.





On Thursday, December 19, 2013 7:00:13 AM UTC-8, Alexander Reelsen wrote:
>
> Hey,
>
> can you test with a more recent version of elasticsearch first? There were 
> some dramatic improvements regarding facetting.
> Also, you should explain your setup a bit more. Facetting can need a lot 
> of memory with lots of documents as it uses so-called fielddata, so you 
> should configure and monitor elasticsearch appropriately.
>
> See
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data
>
>
> --Alex
>
>
> On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones 
> > wrote:
>
>> I'm using the Terms Facet with Elasticsearch V0.20.2.  The server has 8 x 
>> Intel Xeon E5-2680 v2 processors and 15GB of memory.
>>
>> My Terms Facet queries work great as long as the number of documents in 
>> the index is small ( eg. less than 20,000 ).  When the system hits more, 
>> pushing into the hundreds of thousands or millions of documents, my Terms 
>> Facets never return results.  Watching the server, I initially see a few 
>> Java processes using a lot of CPU, but within a few seconds, this is 
>> reduced to a half dozen processes each using ~2% cpu.  I never see memory 
>> usage increase on the server as a result of these queries.  When these 
>> queries fail to return results, they also sometimes seem to "freeze" 
>> Elasticsearch and I often have to restart the ES server or even reboot the 
>> physical server to get ES back online for other simple queries.
>>
>> The fields I'm trying to facet exist for nearly every document and can 
>> have anywhere from 0 to hundreds of different values across the dataset. 
>>  All values are text strings and I'm using a custom analyzer that reduces 
>> them to lowercase.  I realize that increasing the number of potential 
>> values in a field will dramatically increase the resources needed for the 
>> Terms Facet Query.  In testing, I would expect some of the smaller fields 
>> should work fine even at scale with millions of documents.
>>
>>
>>
>> Questions:
>>
>> 1.) My test field ( industries ), can have no more than 32 unique values. 
>>  Each document could have none or all 32 values.  Each value can be from 10 
>> to 100 characters of text.  This Terms Facet never returns a result at 
>> scale.  Any thoughts on what is happening?  Is my setup flawed?
>>
>> 2. Will I ever be able to run a facet on a field that can have millions 
>> of unique text values?  I have some data analysis cases like this where I'd 
>> like to use Elasticsearch Facetting.
>>
>> 3.) Would reducing the fields I'm faceting on to integers ( and then 
>> translating back to text outside ES ) make a big difference in performance 
>> and required resources?
>>
>>
>>
>> Test Query:
>>
>> curl -X POST "
>> http://remote_host:9200/companies/company/_search?pretty=true"; -d '
>> {
>> "query" : {
>> "match_all" : {  }
>> },
>> "facets" : {
>> "industries" : {
>> "terms" : {
>> "field" : "industries.term.keyword_lowercase",
>> "size" : 100
>> }
>> }
>> },
>> "size" : 0
>> }
>> '
>>
>>
>>
>>
>> Index Configuration:
>>
>> {
>> "index" : {
>>  "number_of_shards" : 5,
>> "number_of_replicas" : 1,
>> "analysis" : {
>>  "analyzer" : {
>> "default" : {
>> "tokenizer" : "standard",
>>  "filter" : ["standard", "word_delimiter", "lowercase", "stop"]
>> },
>>  "html_strip" : {
>> "tokenizer" : "standard",
>> "filter" : ["standard", "word_delimiter", "lowercase", "stop"],
>>  "char_filter" : "html_strip"
>> },
>> "keyword_lowercase" : {
>>  "tokenizer" : "keyword",
>> "filter" : "lowercase"
>> }
>>  }
>> }
>> }
>> }
>>
>>
>>
>>
>> Company Document Mapping:
>>
>> ** i've removed irrelevant fields
>>
>> {
>> "company" : {
>> "type" : "object",
>> "include_in_all" : false,
>>  "path" : "full",
>> "dynamic" : "strict",
>> "properties" : {
>>  "name" : {
>> "type" : "multi_field",
>> "fields" : {
>>  "name" : { "type" : "string", "index" : "analyzed", "include_in_all" : 
>> "true", "boost" : 10.0 },
>>  "keyword_lowercase" : { "type" : "string", "index" : "analyzed", 
>> "analyzer" : "keyword_lowercase", "include_in_all" : "false" }
>>  }
>> },
>> "description" : { "type

Re: Terms Facet never returns a response when the number of documents increases past tens of thousands.

2013-12-19 Thread Ivan Brusic
I completely agree about upgrading. Elasticsearch 0.90 introduced numerous
memory improvements, including one issue that directly affects you. With
the previous versions (0.20 and prior), high cardinality fields, such as
your industry field, would use inefficient data structures to load the
faceted values. Your situation would be greatly improved with 0.90+. Easily.

In terms of your last point, I would use numerical values whenever
possible. My taxonomy is known in advance, so I do client side lookups for
numerical values. I do not have statistics for how much of an improvement
it is, but you can do the simple math of how much can be saved . In Java,
ints are 4 bytes, while each character in UTF-8 (assuming you are using
unicode) can be 1 to 6 bytes. It all adds up.

Cheers,

Ivan


On Thu, Dec 19, 2013 at 7:00 AM, Alexander Reelsen  wrote:

> Hey,
>
> can you test with a more recent version of elasticsearch first? There were
> some dramatic improvements regarding facetting.
> Also, you should explain your setup a bit more. Facetting can need a lot
> of memory with lots of documents as it uses so-called fielddata, so you
> should configure and monitor elasticsearch appropriately.
>
> See
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
>
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data
>
>
> --Alex
>
>
> On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones wrote:
>
>> I'm using the Terms Facet with Elasticsearch V0.20.2.  The server has 8 x
>> Intel Xeon E5-2680 v2 processors and 15GB of memory.
>>
>> My Terms Facet queries work great as long as the number of documents in
>> the index is small ( eg. less than 20,000 ).  When the system hits more,
>> pushing into the hundreds of thousands or millions of documents, my Terms
>> Facets never return results.  Watching the server, I initially see a few
>> Java processes using a lot of CPU, but within a few seconds, this is
>> reduced to a half dozen processes each using ~2% cpu.  I never see memory
>> usage increase on the server as a result of these queries.  When these
>> queries fail to return results, they also sometimes seem to "freeze"
>> Elasticsearch and I often have to restart the ES server or even reboot the
>> physical server to get ES back online for other simple queries.
>>
>> The fields I'm trying to facet exist for nearly every document and can
>> have anywhere from 0 to hundreds of different values across the dataset.
>>  All values are text strings and I'm using a custom analyzer that reduces
>> them to lowercase.  I realize that increasing the number of potential
>> values in a field will dramatically increase the resources needed for the
>> Terms Facet Query.  In testing, I would expect some of the smaller fields
>> should work fine even at scale with millions of documents.
>>
>>
>>
>> Questions:
>>
>> 1.) My test field ( industries ), can have no more than 32 unique values.
>>  Each document could have none or all 32 values.  Each value can be from 10
>> to 100 characters of text.  This Terms Facet never returns a result at
>> scale.  Any thoughts on what is happening?  Is my setup flawed?
>>
>> 2. Will I ever be able to run a facet on a field that can have millions
>> of unique text values?  I have some data analysis cases like this where I'd
>> like to use Elasticsearch Facetting.
>>
>> 3.) Would reducing the fields I'm faceting on to integers ( and then
>> translating back to text outside ES ) make a big difference in performance
>> and required resources?
>>
>>
>>
>> Test Query:
>>
>> curl -X POST "
>> http://remote_host:9200/companies/company/_search?pretty=true"; -d '
>> {
>> "query" : {
>> "match_all" : {  }
>> },
>> "facets" : {
>> "industries" : {
>> "terms" : {
>> "field" : "industries.term.keyword_lowercase",
>> "size" : 100
>> }
>> }
>> },
>> "size" : 0
>> }
>> '
>>
>>
>>
>>
>> Index Configuration:
>>
>> {
>> "index" : {
>>  "number_of_shards" : 5,
>> "number_of_replicas" : 1,
>> "analysis" : {
>>  "analyzer" : {
>> "default" : {
>> "tokenizer" : "standard",
>>  "filter" : ["standard", "word_delimiter", "lowercase", "stop"]
>> },
>>  "html_strip" : {
>> "tokenizer" : "standard",
>> "filter" : ["standard", "word_delimiter", "lowercase", "stop"],
>>  "char_filter" : "html_strip"
>> },
>> "keyword_lowercase" : {
>>  "tokenizer" : "keyword",
>> "filter" : "lowercase"
>> }
>>  }
>> }
>> }
>> }
>>
>>
>>
>>
>> Company Document Mapping:
>>
>> ** i've removed irrelevant fields
>>
>> {
>> "company" : {
>> "type" : "object",
>> "include_in_all" : false,
>>  "path" : "full",
>> "dynamic" : "strict",
>> "properties" : {
>>  "name" : {
>> "type" : "multi_field",
>> "fields" : {
>>  "name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
>> "true", "boost" : 10.0 },
>>  "keyword_lowercase" : { "type" : "string", "index" : "analyzed

Re: Terms Facet never returns a response when the number of documents increases past tens of thousands.

2013-12-19 Thread Alexander Reelsen
Hey,

can you test with a more recent version of elasticsearch first? There were
some dramatic improvements regarding facetting.
Also, you should explain your setup a bit more. Facetting can need a lot of
memory with lots of documents as it uses so-called fielddata, so you should
configure and monitor elasticsearch appropriately.

See
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-stats.html#field-data


--Alex


On Wed, Dec 18, 2013 at 10:51 PM, Brian Jones  wrote:

> I'm using the Terms Facet with Elasticsearch V0.20.2.  The server has 8 x
> Intel Xeon E5-2680 v2 processors and 15GB of memory.
>
> My Terms Facet queries work great as long as the number of documents in
> the index is small ( eg. less than 20,000 ).  When the system hits more,
> pushing into the hundreds of thousands or millions of documents, my Terms
> Facets never return results.  Watching the server, I initially see a few
> Java processes using a lot of CPU, but within a few seconds, this is
> reduced to a half dozen processes each using ~2% cpu.  I never see memory
> usage increase on the server as a result of these queries.  When these
> queries fail to return results, they also sometimes seem to "freeze"
> Elasticsearch and I often have to restart the ES server or even reboot the
> physical server to get ES back online for other simple queries.
>
> The fields I'm trying to facet exist for nearly every document and can
> have anywhere from 0 to hundreds of different values across the dataset.
>  All values are text strings and I'm using a custom analyzer that reduces
> them to lowercase.  I realize that increasing the number of potential
> values in a field will dramatically increase the resources needed for the
> Terms Facet Query.  In testing, I would expect some of the smaller fields
> should work fine even at scale with millions of documents.
>
>
>
> Questions:
>
> 1.) My test field ( industries ), can have no more than 32 unique values.
>  Each document could have none or all 32 values.  Each value can be from 10
> to 100 characters of text.  This Terms Facet never returns a result at
> scale.  Any thoughts on what is happening?  Is my setup flawed?
>
> 2. Will I ever be able to run a facet on a field that can have millions of
> unique text values?  I have some data analysis cases like this where I'd
> like to use Elasticsearch Facetting.
>
> 3.) Would reducing the fields I'm faceting on to integers ( and then
> translating back to text outside ES ) make a big difference in performance
> and required resources?
>
>
>
> Test Query:
>
> curl -X POST "
> http://remote_host:9200/companies/company/_search?pretty=true"; -d '
> {
> "query" : {
> "match_all" : {  }
> },
> "facets" : {
> "industries" : {
> "terms" : {
> "field" : "industries.term.keyword_lowercase",
> "size" : 100
> }
> }
> },
> "size" : 0
> }
> '
>
>
>
>
> Index Configuration:
>
> {
> "index" : {
> "number_of_shards" : 5,
> "number_of_replicas" : 1,
> "analysis" : {
> "analyzer" : {
> "default" : {
> "tokenizer" : "standard",
> "filter" : ["standard", "word_delimiter", "lowercase", "stop"]
> },
> "html_strip" : {
> "tokenizer" : "standard",
> "filter" : ["standard", "word_delimiter", "lowercase", "stop"],
> "char_filter" : "html_strip"
> },
> "keyword_lowercase" : {
> "tokenizer" : "keyword",
> "filter" : "lowercase"
> }
> }
> }
> }
> }
>
>
>
>
> Company Document Mapping:
>
> ** i've removed irrelevant fields
>
> {
> "company" : {
> "type" : "object",
> "include_in_all" : false,
> "path" : "full",
> "dynamic" : "strict",
> "properties" : {
> "name" : {
> "type" : "multi_field",
> "fields" : {
> "name" : { "type" : "string", "index" : "analyzed", "include_in_all" :
> "true", "boost" : 10.0 },
> "keyword_lowercase" : { "type" : "string", "index" : "analyzed",
> "analyzer" : "keyword_lowercase", "include_in_all" : "false" }
> }
> },
> "description" : { "type" : "string", "index" : "analyzed",
> "include_in_all" : "true", "boost" : 6.0 },
> "industries" : {
> "type" : "nested",
> "include_in_root" : true,
> "properties" : {
> "term" : {
> "type" : "multi_field",
> "fields" : {
> "term" : { "type" : "string", "index" : "analyzed", "include_in_all" :
> true, "boost" : 3.0 },
> "keyword_lowercase" : { "type" : "string", "index" : "analyzed",
> "analyzer" : "keyword_lowercase" }
> }
> },
> "description" : { "type" : "string", "index" : "analyzed",
> "include_in_all" : true },
> "score" : { "type" : "integer" },
> "verified" : { "type" : "boolean" }
> }
> }
> }
> }
> }
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> http

Terms Facet never returns a response when the number of documents increases past tens of thousands.

2013-12-18 Thread Brian Jones
I'm using the Terms Facet with Elasticsearch V0.20.2.  The server has 8 x 
Intel Xeon E5-2680 v2 processors and 15GB of memory.

My Terms Facet queries work great as long as the number of documents in the 
index is small ( eg. less than 20,000 ).  When the system hits more, 
pushing into the hundreds of thousands or millions of documents, my Terms 
Facets never return results.  Watching the server, I initially see a few 
Java processes using a lot of CPU, but within a few seconds, this is 
reduced to a half dozen processes each using ~2% cpu.  I never see memory 
usage increase on the server as a result of these queries.  When these 
queries fail to return results, they also sometimes seem to "freeze" 
Elasticsearch and I often have to restart the ES server or even reboot the 
physical server to get ES back online for other simple queries.

The fields I'm trying to facet exist for nearly every document and can have 
anywhere from 0 to hundreds of different values across the dataset.  All 
values are text strings and I'm using a custom analyzer that reduces them 
to lowercase.  I realize that increasing the number of potential values in 
a field will dramatically increase the resources needed for the Terms Facet 
Query.  In testing, I would expect some of the smaller fields should work 
fine even at scale with millions of documents.



Questions:

1.) My test field ( industries ), can have no more than 32 unique values. 
 Each document could have none or all 32 values.  Each value can be from 10 
to 100 characters of text.  This Terms Facet never returns a result at 
scale.  Any thoughts on what is happening?  Is my setup flawed?

2. Will I ever be able to run a facet on a field that can have millions of 
unique text values?  I have some data analysis cases like this where I'd 
like to use Elasticsearch Facetting.

3.) Would reducing the fields I'm faceting on to integers ( and then 
translating back to text outside ES ) make a big difference in performance 
and required resources?



Test Query:

curl -X POST "http://remote_host:9200/companies/company/_search?pretty=true"; 
-d '
{
"query" : {
"match_all" : {  }
},
"facets" : {
"industries" : {
"terms" : {
"field" : "industries.term.keyword_lowercase",
"size" : 100
}
}
},
"size" : 0
}
'




Index Configuration:

{
"index" : {
"number_of_shards" : 5,
"number_of_replicas" : 1,
"analysis" : {
"analyzer" : {
"default" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"]
},
"html_strip" : {
"tokenizer" : "standard",
"filter" : ["standard", "word_delimiter", "lowercase", "stop"],
"char_filter" : "html_strip"
},
"keyword_lowercase" : {
"tokenizer" : "keyword",
"filter" : "lowercase"
}
}
}
}
}




Company Document Mapping:

** i've removed irrelevant fields

{
"company" : {
"type" : "object",
"include_in_all" : false,
"path" : "full",
"dynamic" : "strict",
"properties" : {
"name" : {
"type" : "multi_field",
"fields" : {
"name" : { "type" : "string", "index" : "analyzed", "include_in_all" : 
"true", "boost" : 10.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer" 
: "keyword_lowercase", "include_in_all" : "false" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all" 
: "true", "boost" : 6.0 },
"industries" : {
"type" : "nested",
"include_in_root" : true,
"properties" : {
"term" : {
"type" : "multi_field",
"fields" : {
"term" : { "type" : "string", "index" : "analyzed", "include_in_all" : 
true, "boost" : 3.0 },
"keyword_lowercase" : { "type" : "string", "index" : "analyzed", "analyzer" 
: "keyword_lowercase" }
}
},
"description" : { "type" : "string", "index" : "analyzed", "include_in_all" 
: true },
"score" : { "type" : "integer" },
"verified" : { "type" : "boolean" }
}
}
}
}
}

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3e608b31-8569-49d3-b9fa-20d3a1e4a597%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.