Re: Size of index to use shard
@Erick: Thanks for the detailed explanation. On this note, we have 75GB for *.fdt and *.fdx out of 99GB index. The search is still not that fast, if cache size is small. But giving more cache led to OOMs. Partitioning to shards is not an option either, as at the moment we try to run as less machines as possible. @Vadim: Thanks for the info! For the 6GB of heap size I assume you cache are not that big? We had filterCache (used heavily compared to other cache types in facet and non-facet queries according to our measurements) in the order of 20 thousand entries and heap size 22GB and observed OOM. So we decided to lower the cache params down substantially. Dmitry On Tue, Jan 24, 2012 at 10:25 PM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: @Erick thanks:) i´m with you with your opinion. my load tests show the same. @Dmitry my docs are small too, i think about 3-15KB per doc. i update my index all the time and i have an average of 20-50 requests per minute (20% facet queries, 80% large boolean queries with wildcard/fuzzy) . How much docs at a time= depends from choosed filters, from 10 to all 100Mio. I work with very small caches (strangely, but if my index is under 100GB i need larger caches, over 100GB smaller caches..) My JVM has 6GB, 18GB for I/O. With few updates a day i would configure very big caches, like Tim Burton (see HathiTrust´s Blog) Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Thanks for the explanation Erick :) 2012/1/24, Erick Erickson erickerick...@gmail.com: Talking about index size can be very misleading. Take a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names. Note that the *.fdt and *.fdx files are used to for stored fields, i.e. the verbatim copy of data put in the index when you specify stored=true. These files have virtually no impact on search speed. So, if your *.fdx and *.fdt files are 90G out of a 100G index it is a much different thing than if these files are 10G out of a 100G index. And this doesn't even mention the peculiarities of your query mix. Nor does it say a thing about whether your cheapest alternative is to add more memory. Anderson's method is about the only reliable one, you just have to test with your index and real queries. At some point, you'll find your tipping point, typically when you come under memory pressure. And it's a balancing act between how much memory you allocate to the JVM and how much you leave for the op system. Bottom line: No hard and fast numbers. And you should periodically re-test the empirical numbers you *do* arrive at... Best Erick On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Apparently, not so easy to determine when to break the content into pieces. I'll investigate further about the amount of documents, the size of each document and what kind of search is being used. It seems, I will have to do a load test to identify the cutoff point to begin using the strategy of shards. Thanks 2012/1/24, Dmitry Kan dmitry@gmail.com: Hi, The article you gave mentions 13GB of index size. It is quite small index from our perspective. We have noticed, that at least solr 3.4 has some sort of choking point with respect to growing index size. It just becomes substantially slower than what we need (a query on avg taking more than 3-4 seconds) once index size crosses a magic level (about 80GB following our practical observations). We try to keep our indices at around 60-70GB for fast searches and above 100GB for slow ones. We also route majority of user queries to fast indices. Yes, caching may help, but not necessarily we can afford adding more RAM for bigger indices. BTW, our documents are very small, thus in 100GB index we can have around 200 mil. documents. It would be interesting to see, how you manage to ensure q-times under 1 sec with an index of 250GB? How many documents / facets do you ask max. at a time? FYI, we ask for a thousand of facets in one go. Regards, Dmitry On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30
Re: Size of index to use shard
Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks
Re: Size of index to use shard
Hi, The article you gave mentions 13GB of index size. It is quite small index from our perspective. We have noticed, that at least solr 3.4 has some sort of choking point with respect to growing index size. It just becomes substantially slower than what we need (a query on avg taking more than 3-4 seconds) once index size crosses a magic level (about 80GB following our practical observations). We try to keep our indices at around 60-70GB for fast searches and above 100GB for slow ones. We also route majority of user queries to fast indices. Yes, caching may help, but not necessarily we can afford adding more RAM for bigger indices. BTW, our documents are very small, thus in 100GB index we can have around 200 mil. documents. It would be interesting to see, how you manage to ensure q-times under 1 sec with an index of 250GB? How many documents / facets do you ask max. at a time? FYI, we ask for a thousand of facets in one go. Regards, Dmitry On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks
Re: Size of index to use shard
Apparently, not so easy to determine when to break the content into pieces. I'll investigate further about the amount of documents, the size of each document and what kind of search is being used. It seems, I will have to do a load test to identify the cutoff point to begin using the strategy of shards. Thanks 2012/1/24, Dmitry Kan dmitry@gmail.com: Hi, The article you gave mentions 13GB of index size. It is quite small index from our perspective. We have noticed, that at least solr 3.4 has some sort of choking point with respect to growing index size. It just becomes substantially slower than what we need (a query on avg taking more than 3-4 seconds) once index size crosses a magic level (about 80GB following our practical observations). We try to keep our indices at around 60-70GB for fast searches and above 100GB for slow ones. We also route majority of user queries to fast indices. Yes, caching may help, but not necessarily we can afford adding more RAM for bigger indices. BTW, our documents are very small, thus in 100GB index we can have around 200 mil. documents. It would be interesting to see, how you manage to ensure q-times under 1 sec with an index of 250GB? How many documents / facets do you ask max. at a time? FYI, we ask for a thousand of facets in one go. Regards, Dmitry On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks
Re: Size of index to use shard
Talking about index size can be very misleading. Take a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names. Note that the *.fdt and *.fdx files are used to for stored fields, i.e. the verbatim copy of data put in the index when you specify stored=true. These files have virtually no impact on search speed. So, if your *.fdx and *.fdt files are 90G out of a 100G index it is a much different thing than if these files are 10G out of a 100G index. And this doesn't even mention the peculiarities of your query mix. Nor does it say a thing about whether your cheapest alternative is to add more memory. Anderson's method is about the only reliable one, you just have to test with your index and real queries. At some point, you'll find your tipping point, typically when you come under memory pressure. And it's a balancing act between how much memory you allocate to the JVM and how much you leave for the op system. Bottom line: No hard and fast numbers. And you should periodically re-test the empirical numbers you *do* arrive at... Best Erick On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Apparently, not so easy to determine when to break the content into pieces. I'll investigate further about the amount of documents, the size of each document and what kind of search is being used. It seems, I will have to do a load test to identify the cutoff point to begin using the strategy of shards. Thanks 2012/1/24, Dmitry Kan dmitry@gmail.com: Hi, The article you gave mentions 13GB of index size. It is quite small index from our perspective. We have noticed, that at least solr 3.4 has some sort of choking point with respect to growing index size. It just becomes substantially slower than what we need (a query on avg taking more than 3-4 seconds) once index size crosses a magic level (about 80GB following our practical observations). We try to keep our indices at around 60-70GB for fast searches and above 100GB for slow ones. We also route majority of user queries to fast indices. Yes, caching may help, but not necessarily we can afford adding more RAM for bigger indices. BTW, our documents are very small, thus in 100GB index we can have around 200 mil. documents. It would be interesting to see, how you manage to ensure q-times under 1 sec with an index of 250GB? How many documents / facets do you ask max. at a time? FYI, we ask for a thousand of facets in one go. Regards, Dmitry On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks
Re: Size of index to use shard
Thanks for the explanation Erick :) 2012/1/24, Erick Erickson erickerick...@gmail.com: Talking about index size can be very misleading. Take a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names. Note that the *.fdt and *.fdx files are used to for stored fields, i.e. the verbatim copy of data put in the index when you specify stored=true. These files have virtually no impact on search speed. So, if your *.fdx and *.fdt files are 90G out of a 100G index it is a much different thing than if these files are 10G out of a 100G index. And this doesn't even mention the peculiarities of your query mix. Nor does it say a thing about whether your cheapest alternative is to add more memory. Anderson's method is about the only reliable one, you just have to test with your index and real queries. At some point, you'll find your tipping point, typically when you come under memory pressure. And it's a balancing act between how much memory you allocate to the JVM and how much you leave for the op system. Bottom line: No hard and fast numbers. And you should periodically re-test the empirical numbers you *do* arrive at... Best Erick On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Apparently, not so easy to determine when to break the content into pieces. I'll investigate further about the amount of documents, the size of each document and what kind of search is being used. It seems, I will have to do a load test to identify the cutoff point to begin using the strategy of shards. Thanks 2012/1/24, Dmitry Kan dmitry@gmail.com: Hi, The article you gave mentions 13GB of index size. It is quite small index from our perspective. We have noticed, that at least solr 3.4 has some sort of choking point with respect to growing index size. It just becomes substantially slower than what we need (a query on avg taking more than 3-4 seconds) once index size crosses a magic level (about 80GB following our practical observations). We try to keep our indices at around 60-70GB for fast searches and above 100GB for slow ones. We also route majority of user queries to fast indices. Yes, caching may help, but not necessarily we can afford adding more RAM for bigger indices. BTW, our documents are very small, thus in 100GB index we can have around 200 mil. documents. It would be interesting to see, how you manage to ensure q-times under 1 sec with an index of 250GB? How many documents / facets do you ask max. at a time? FYI, we ask for a thousand of facets in one go. Regards, Dmitry On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks
Re: Size of index to use shard
@Erick thanks:) i´m with you with your opinion. my load tests show the same. @Dmitry my docs are small too, i think about 3-15KB per doc. i update my index all the time and i have an average of 20-50 requests per minute (20% facet queries, 80% large boolean queries with wildcard/fuzzy) . How much docs at a time= depends from choosed filters, from 10 to all 100Mio. I work with very small caches (strangely, but if my index is under 100GB i need larger caches, over 100GB smaller caches..) My JVM has 6GB, 18GB for I/O. With few updates a day i would configure very big caches, like Tim Burton (see HathiTrust´s Blog) Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Thanks for the explanation Erick :) 2012/1/24, Erick Erickson erickerick...@gmail.com: Talking about index size can be very misleading. Take a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names. Note that the *.fdt and *.fdx files are used to for stored fields, i.e. the verbatim copy of data put in the index when you specify stored=true. These files have virtually no impact on search speed. So, if your *.fdx and *.fdt files are 90G out of a 100G index it is a much different thing than if these files are 10G out of a 100G index. And this doesn't even mention the peculiarities of your query mix. Nor does it say a thing about whether your cheapest alternative is to add more memory. Anderson's method is about the only reliable one, you just have to test with your index and real queries. At some point, you'll find your tipping point, typically when you come under memory pressure. And it's a balancing act between how much memory you allocate to the JVM and how much you leave for the op system. Bottom line: No hard and fast numbers. And you should periodically re-test the empirical numbers you *do* arrive at... Best Erick On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos anderson.v...@gmail.com wrote: Apparently, not so easy to determine when to break the content into pieces. I'll investigate further about the amount of documents, the size of each document and what kind of search is being used. It seems, I will have to do a load test to identify the cutoff point to begin using the strategy of shards. Thanks 2012/1/24, Dmitry Kan dmitry@gmail.com: Hi, The article you gave mentions 13GB of index size. It is quite small index from our perspective. We have noticed, that at least solr 3.4 has some sort of choking point with respect to growing index size. It just becomes substantially slower than what we need (a query on avg taking more than 3-4 seconds) once index size crosses a magic level (about 80GB following our practical observations). We try to keep our indices at around 60-70GB for fast searches and above 100GB for slow ones. We also route majority of user queries to fast indices. Yes, caching may help, but not necessarily we can afford adding more RAM for bigger indices. BTW, our documents are very small, thus in 100GB index we can have around 200 mil. documents. It would be interesting to see, how you manage to ensure q-times under 1 sec with an index of 250GB? How many documents / facets do you ask max. at a time? FYI, we ask for a thousand of facets in one go. Regards, Dmitry On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann v.kisselm...@googlemail.com wrote: Hi, it depends from your hardware. Read this: http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/ Think about your cache-config (few updates, big caches) and a good HW-infrastructure. In my case i can handle a 250GB index with 100mil. docs on a I7 machine with RAID10 and 24GB RAM = q-times under 1 sec. Regards Vadim 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com: Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks
Size of index to use shard
Hi Has some size of index (or number of docs) that is necessary to break the index in shards? I have a index with 100GB of size. This index increase 10GB per year. (I don't have information how many docs they have) and the docs never will be deleted. Thinking in 30 years, the index will be with 400GB of size. I think is not required to break in shard, because i not consider this like a large index. Am I correct? What's is a real large index Thanks