Manuel Le Normand, I am sorry but I want to learn something. You said you
have 40 dedicated servers. What is your total document count, total
document size, and total shard size?

2013/4/11 Manuel Le Normand <manuel.lenorm...@gmail.com>

> Hi,
> We have different working hours, sorry for the reply delay. Your assumed
> numbers are right, about 25-30Kb per doc. giving a total of 15G per shard,
> there are two shards per server (+2 slaves that should do no work
> normally).
> An average query has about 30 conditions (OR AND mixed), most of them
> textual, a small part on dateTime. They use only simple queries (no facet,
> filters etc.) as it is taken from the actual query set of my entreprise
> that works with an old search engine.
>
> As we said, if the shards in collection1 and collection2 have the same
> number of docs each (and same RAM & CPU per shard), it is apparently not a
> slow IO issue, right? So the fact of not having cached all my index doesn't
> seem the be the bottleneck.Moreover, i do store the fields but my query set
> requests only the id's and rarely snippets so I'd assume that the plenty of
> RAM i'd give the OS wouldn't make any difference as these *.fdt files don't
> need to get cached.
>
> The conclusion i get to is that the merging issue is the problem, and the
> only possibility of outsmarting it is to distribute to much fewer shards,
> meaning that i'll get back to few millions of docs per shard which are
> about linearly slower with the num of docs per shard. Though the latter
> should improve if i give much more RAM per server.
>
> I'll try tweaking a bit my schema and making better use of solr cache
> (filter query as an example), but i have something telling me the problem
> might be elsewhere. My main clue to it is that merging seems a simple CPU
> task, and tests show that even with a small amount of responses it takes a
> long time (and clearly the merging task on few docs is very short)
>
>
> On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey <s...@elyograg.org> wrote:
>
> > On 4/9/2013 3:50 PM, Furkan KAMACI wrote:
> >
> >> Hi Shawn;
> >>
> >> You say that:
> >>
> >> *... your documents are about 50KB each.  That would translate to an
> index
> >> that's at least 25GB*
> >>
> >> I know we can not say an exact size but what is the approximately ratio
> of
> >> document size / index size according to your experiences?
> >>
> >
> > If you store the fields, that is actual size plus a small amount of
> > overhead.  Starting with Solr 4.1, stored fields are compressed.  I
> believe
> > that it uses LZ4 compression.  Some people store all fields, some people
> > store only a few or one - an ID field.  The size of stored fields does
> have
> > an impact on how much OS disk cache you need, but not as much as the
> other
> > parts of an index.
> >
> > It's been my experience that termvectors take up almost as much space as
> > stored data for the same fields, and sometimes more.  Starting with Solr
> > 4.2, termvectors are also compressed.
> >
> > Adding docValues (new in 4.2) to the schema will also make the index
> > larger.  The requirements here are similar to stored fields.  I do not
> know
> > whether this data gets compressed, but I don't think it does.
> >
> > As for the indexed data, this is where I am less clear about the storage
> > ratios, but I think you can count on it needing almost as much space as
> the
> > original data.  If the schema uses types or filters that produce a lot of
> > information, the indexed data might be larger than the original input.
> >  Examples of data explosions in a schema: trie fields with a non-zero
> > precisionStep, the edgengram filter, the shingle filter.
> >
> > Thanks,
> > Shawn
> >
> >
>

Reply via email to