On 4/9/2013 3:50 PM, Furkan KAMACI wrote:
Hi Shawn;
You say that:
*... your documents are about 50KB each. That would translate to an index
that's at least 25GB*
I know we can not say an exact size but what is the approximately ratio of
document size / index size according to your experiences?
If you store the fields, that is actual size plus a small amount of
overhead. Starting with Solr 4.1, stored fields are compressed. I
believe that it uses LZ4 compression. Some people store all fields,
some people store only a few or one - an ID field. The size of stored
fields does have an impact on how much OS disk cache you need, but not
as much as the other parts of an index.
It's been my experience that termvectors take up almost as much space as
stored data for the same fields, and sometimes more. Starting with Solr
4.2, termvectors are also compressed.
Adding docValues (new in 4.2) to the schema will also make the index
larger. The requirements here are similar to stored fields. I do not
know whether this data gets compressed, but I don't think it does.
As for the indexed data, this is where I am less clear about the storage
ratios, but I think you can count on it needing almost as much space as
the original data. If the schema uses types or filters that produce a
lot of information, the indexed data might be larger than the original
input. Examples of data explosions in a schema: trie fields with a
non-zero precisionStep, the edgengram filter, the shingle filter.
Thanks,
Shawn