On 4/9/2013 3:50 PM, Furkan KAMACI wrote:
Hi Shawn;

You say that:

*... your documents are about 50KB each.  That would translate to an index
that's at least 25GB*

I know we can not say an exact size but what is the approximately ratio of
document size / index size according to your experiences?

If you store the fields, that is actual size plus a small amount of overhead. Starting with Solr 4.1, stored fields are compressed. I believe that it uses LZ4 compression. Some people store all fields, some people store only a few or one - an ID field. The size of stored fields does have an impact on how much OS disk cache you need, but not as much as the other parts of an index.

It's been my experience that termvectors take up almost as much space as stored data for the same fields, and sometimes more. Starting with Solr 4.2, termvectors are also compressed.

Adding docValues (new in 4.2) to the schema will also make the index larger. The requirements here are similar to stored fields. I do not know whether this data gets compressed, but I don't think it does.

As for the indexed data, this is where I am less clear about the storage ratios, but I think you can count on it needing almost as much space as the original data. If the schema uses types or filters that produce a lot of information, the indexed data might be larger than the original input. Examples of data explosions in a schema: trie fields with a non-zero precisionStep, the edgengram filter, the shingle filter.

Thanks,
Shawn

Reply via email to