Re: Size of index to use shard

2012-01-26 Thread Dmitry Kan
@Erick:
Thanks for the detailed explanation. On this note, we have 75GB for *.fdt
and *.fdx out of 99GB index. The search is still not that fast, if cache
size is small. But giving more cache led to OOMs. Partitioning to shards is
not an option either, as at the moment we try to run as less machines as
possible.

@Vadim:
Thanks for the info! For the 6GB of heap size I assume you cache are not
that big? We had filterCache (used heavily compared to other cache types in
facet and non-facet queries according to our measurements) in the order of
20 thousand entries and heap size 22GB and observed OOM. So we decided to
lower the cache params down substantially.

Dmitry

On Tue, Jan 24, 2012 at 10:25 PM, Vadim Kisselmann 
v.kisselm...@googlemail.com wrote:

 @Erick
 thanks:)
 i´m with you with your opinion.
 my load tests show the same.

 @Dmitry
 my docs are small too, i think about 3-15KB per doc.
 i update my index all the time and i have an average of 20-50 requests
 per minute (20% facet queries, 80% large boolean queries with
 wildcard/fuzzy) . How much docs at a time= depends from choosed
 filters, from 10 to all 100Mio.
 I work with very small caches (strangely, but if my index is under
 100GB i need larger caches, over 100GB smaller caches..)
 My JVM has 6GB, 18GB for I/O.
 With few updates a day i would configure very big caches, like Tim
 Burton (see HathiTrust´s Blog)

 Regards Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Thanks for the explanation Erick :)
 
  2012/1/24, Erick Erickson erickerick...@gmail.com:
  Talking about index size can be very misleading. Take
  a look at
 http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
  Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
  the verbatim copy of data put in the index when you specify
  stored=true. These files have virtually no impact on search
  speed.
 
  So, if your *.fdx and *.fdt files are 90G out of a 100G index
  it is a much different thing than if these files are 10G out of
  a 100G index.
 
  And this doesn't even mention the peculiarities of your query mix.
  Nor does it say a thing about whether your cheapest alternative
  is to add more memory.
 
  Anderson's method is about the only reliable one, you just have
  to test with your index and real queries. At some point, you'll
  find your tipping point, typically when you come under memory
  pressure. And it's a balancing act between how much memory
  you allocate to the JVM and how much you leave for the op
  system.
 
  Bottom line: No hard and fast numbers. And you should periodically
  re-test the empirical numbers you *do* arrive at...
 
  Best
  Erick
 
  On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
  anderson.v...@gmail.com wrote:
  Apparently, not so easy to determine when to break the content into
  pieces. I'll investigate further about the amount of documents, the
  size of each document and what kind of search is being used. It seems,
  I will have to do a load test to identify the cutoff point to begin
  using the strategy of shards.
 
  Thanks
 
  2012/1/24, Dmitry Kan dmitry@gmail.com:
  Hi,
 
  The article you gave mentions 13GB of index size. It is quite small
 index
  from our perspective. We have noticed, that at least solr 3.4 has some
  sort
  of choking point with respect to growing index size. It just becomes
  substantially slower than what we need (a query on avg taking more
 than
  3-4
  seconds) once index size crosses a magic level (about 80GB following
 our
  practical observations). We try to keep our indices at around 60-70GB
 for
  fast searches and above 100GB for slow ones. We also route majority of
  user
  queries to fast indices. Yes, caching may help, but not necessarily we
  can
  afford adding more RAM for bigger indices. BTW, our documents are very
  small, thus in 100GB index we can have around 200 mil. documents. It
  would
  be interesting to see, how you manage to ensure q-times under 1 sec
 with
  an
  index of 250GB? How many documents / facets do you ask max. at a time?
  FYI,
  we ask for a thousand of facets in one go.
 
  Regards,
  Dmitry
 
  On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
  v.kisselm...@googlemail.com wrote:
 
  Hi,
  it depends from your hardware.
  Read this:
 
 
 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
  Think about your cache-config (few updates, big caches) and a good
  HW-infrastructure.
  In my case i can handle a 250GB index with 100mil. docs on a I7
  machine with RAID10 and 24GB RAM = q-times under 1 sec.
  Regards
  Vadim
 
 
 
  2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
   Hi
   Has some size of index (or number of docs) that is necessary to
 break
   the index in shards?
   I have a index with 100GB of size. This index increase 10GB per
 year.
   (I don't have information how many docs they have) and the docs
 never
   will be deleted.  Thinking in 30 

Re: Size of index to use shard

2012-01-24 Thread Vadim Kisselmann
Hi,
it depends from your hardware.
Read this:
http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
Think about your cache-config (few updates, big caches) and a good
HW-infrastructure.
In my case i can handle a 250GB index with 100mil. docs on a I7
machine with RAID10 and 24GB RAM = q-times under 1 sec.
Regards
Vadim



2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
 Hi
 Has some size of index (or number of docs) that is necessary to break
 the index in shards?
 I have a index with 100GB of size. This index increase 10GB per year.
 (I don't have information how many docs they have) and the docs never
 will be deleted.  Thinking in 30 years, the index will be with 400GB
 of size.

 I think  is not required to break in shard, because i not consider
 this like a large index. Am I correct? What's is a real large
 index


 Thanks


Re: Size of index to use shard

2012-01-24 Thread Dmitry Kan
Hi,

The article you gave mentions 13GB of index size. It is quite small index
from our perspective. We have noticed, that at least solr 3.4 has some sort
of choking point with respect to growing index size. It just becomes
substantially slower than what we need (a query on avg taking more than 3-4
seconds) once index size crosses a magic level (about 80GB following our
practical observations). We try to keep our indices at around 60-70GB for
fast searches and above 100GB for slow ones. We also route majority of user
queries to fast indices. Yes, caching may help, but not necessarily we can
afford adding more RAM for bigger indices. BTW, our documents are very
small, thus in 100GB index we can have around 200 mil. documents. It would
be interesting to see, how you manage to ensure q-times under 1 sec with an
index of 250GB? How many documents / facets do you ask max. at a time? FYI,
we ask for a thousand of facets in one go.

Regards,
Dmitry

On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks



Re: Size of index to use shard

2012-01-24 Thread Anderson vasconcelos
Apparently, not so easy to determine when to break the content into
pieces. I'll investigate further about the amount of documents, the
size of each document and what kind of search is being used. It seems,
I will have to do a load test to identify the cutoff point to begin
using the strategy of shards.

Thanks

2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of user
 queries to fast indices. Yes, caching may help, but not necessarily we can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It would
 be interesting to see, how you manage to ensure q-times under 1 sec with an
 index of 250GB? How many documents / facets do you ask max. at a time? FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks




Re: Size of index to use shard

2012-01-24 Thread Erick Erickson
Talking about index size can be very misleading. Take
a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
the verbatim copy of data put in the index when you specify
stored=true. These files have virtually no impact on search
speed.

So, if your *.fdx and *.fdt files are 90G out of a 100G index
it is a much different thing than if these files are 10G out of
a 100G index.

And this doesn't even mention the peculiarities of your query mix.
Nor does it say a thing about whether your cheapest alternative
is to add more memory.

Anderson's method is about the only reliable one, you just have
to test with your index and real queries. At some point, you'll
find your tipping point, typically when you come under memory
pressure. And it's a balancing act between how much memory
you allocate to the JVM and how much you leave for the op
system.

Bottom line: No hard and fast numbers. And you should periodically
re-test the empirical numbers you *do* arrive at...

Best
Erick

On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
anderson.v...@gmail.com wrote:
 Apparently, not so easy to determine when to break the content into
 pieces. I'll investigate further about the amount of documents, the
 size of each document and what kind of search is being used. It seems,
 I will have to do a load test to identify the cutoff point to begin
 using the strategy of shards.

 Thanks

 2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of user
 queries to fast indices. Yes, caching may help, but not necessarily we can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It would
 be interesting to see, how you manage to ensure q-times under 1 sec with an
 index of 250GB? How many documents / facets do you ask max. at a time? FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks




Re: Size of index to use shard

2012-01-24 Thread Anderson vasconcelos
Thanks for the explanation Erick :)

2012/1/24, Erick Erickson erickerick...@gmail.com:
 Talking about index size can be very misleading. Take
 a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
 Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
 the verbatim copy of data put in the index when you specify
 stored=true. These files have virtually no impact on search
 speed.

 So, if your *.fdx and *.fdt files are 90G out of a 100G index
 it is a much different thing than if these files are 10G out of
 a 100G index.

 And this doesn't even mention the peculiarities of your query mix.
 Nor does it say a thing about whether your cheapest alternative
 is to add more memory.

 Anderson's method is about the only reliable one, you just have
 to test with your index and real queries. At some point, you'll
 find your tipping point, typically when you come under memory
 pressure. And it's a balancing act between how much memory
 you allocate to the JVM and how much you leave for the op
 system.

 Bottom line: No hard and fast numbers. And you should periodically
 re-test the empirical numbers you *do* arrive at...

 Best
 Erick

 On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
 anderson.v...@gmail.com wrote:
 Apparently, not so easy to determine when to break the content into
 pieces. I'll investigate further about the amount of documents, the
 size of each document and what kind of search is being used. It seems,
 I will have to do a load test to identify the cutoff point to begin
 using the strategy of shards.

 Thanks

 2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some
 sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than
 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of
 user
 queries to fast indices. Yes, caching may help, but not necessarily we
 can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It
 would
 be interesting to see, how you manage to ensure q-times under 1 sec with
 an
 index of 250GB? How many documents / facets do you ask max. at a time?
 FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks





Re: Size of index to use shard

2012-01-24 Thread Vadim Kisselmann
@Erick
thanks:)
i´m with you with your opinion.
my load tests show the same.

@Dmitry
my docs are small too, i think about 3-15KB per doc.
i update my index all the time and i have an average of 20-50 requests
per minute (20% facet queries, 80% large boolean queries with
wildcard/fuzzy) . How much docs at a time= depends from choosed
filters, from 10 to all 100Mio.
I work with very small caches (strangely, but if my index is under
100GB i need larger caches, over 100GB smaller caches..)
My JVM has 6GB, 18GB for I/O.
With few updates a day i would configure very big caches, like Tim
Burton (see HathiTrust´s Blog)

Regards Vadim



2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
 Thanks for the explanation Erick :)

 2012/1/24, Erick Erickson erickerick...@gmail.com:
 Talking about index size can be very misleading. Take
 a look at http://lucene.apache.org/java/3_5_0/fileformats.html#file-names.
 Note that the *.fdt and *.fdx files are used to for stored fields, i.e.
 the verbatim copy of data put in the index when you specify
 stored=true. These files have virtually no impact on search
 speed.

 So, if your *.fdx and *.fdt files are 90G out of a 100G index
 it is a much different thing than if these files are 10G out of
 a 100G index.

 And this doesn't even mention the peculiarities of your query mix.
 Nor does it say a thing about whether your cheapest alternative
 is to add more memory.

 Anderson's method is about the only reliable one, you just have
 to test with your index and real queries. At some point, you'll
 find your tipping point, typically when you come under memory
 pressure. And it's a balancing act between how much memory
 you allocate to the JVM and how much you leave for the op
 system.

 Bottom line: No hard and fast numbers. And you should periodically
 re-test the empirical numbers you *do* arrive at...

 Best
 Erick

 On Tue, Jan 24, 2012 at 5:31 AM, Anderson vasconcelos
 anderson.v...@gmail.com wrote:
 Apparently, not so easy to determine when to break the content into
 pieces. I'll investigate further about the amount of documents, the
 size of each document and what kind of search is being used. It seems,
 I will have to do a load test to identify the cutoff point to begin
 using the strategy of shards.

 Thanks

 2012/1/24, Dmitry Kan dmitry@gmail.com:
 Hi,

 The article you gave mentions 13GB of index size. It is quite small index
 from our perspective. We have noticed, that at least solr 3.4 has some
 sort
 of choking point with respect to growing index size. It just becomes
 substantially slower than what we need (a query on avg taking more than
 3-4
 seconds) once index size crosses a magic level (about 80GB following our
 practical observations). We try to keep our indices at around 60-70GB for
 fast searches and above 100GB for slow ones. We also route majority of
 user
 queries to fast indices. Yes, caching may help, but not necessarily we
 can
 afford adding more RAM for bigger indices. BTW, our documents are very
 small, thus in 100GB index we can have around 200 mil. documents. It
 would
 be interesting to see, how you manage to ensure q-times under 1 sec with
 an
 index of 250GB? How many documents / facets do you ask max. at a time?
 FYI,
 we ask for a thousand of facets in one go.

 Regards,
 Dmitry

 On Tue, Jan 24, 2012 at 10:30 AM, Vadim Kisselmann 
 v.kisselm...@googlemail.com wrote:

 Hi,
 it depends from your hardware.
 Read this:

 http://www.derivante.com/2009/05/05/solr-performance-benchmarks-single-vs-multi-core-index-shards/
 Think about your cache-config (few updates, big caches) and a good
 HW-infrastructure.
 In my case i can handle a 250GB index with 100mil. docs on a I7
 machine with RAID10 and 24GB RAM = q-times under 1 sec.
 Regards
 Vadim



 2012/1/24 Anderson vasconcelos anderson.v...@gmail.com:
  Hi
  Has some size of index (or number of docs) that is necessary to break
  the index in shards?
  I have a index with 100GB of size. This index increase 10GB per year.
  (I don't have information how many docs they have) and the docs never
  will be deleted.  Thinking in 30 years, the index will be with 400GB
  of size.
 
  I think  is not required to break in shard, because i not consider
  this like a large index. Am I correct? What's is a real large
  index
 
 
  Thanks





Size of index to use shard

2012-01-23 Thread Anderson vasconcelos
Hi
Has some size of index (or number of docs) that is necessary to break
the index in shards?
I have a index with 100GB of size. This index increase 10GB per year.
(I don't have information how many docs they have) and the docs never
will be deleted.  Thinking in 30 years, the index will be with 400GB
of size.

I think  is not required to break in shard, because i not consider
this like a large index. Am I correct? What's is a real large
index


Thanks