Manuel Le Normand, I am sorry but I want to learn something. You said you have 40 dedicated servers. What is your total document count, total document size, and total shard size?
2013/4/11 Manuel Le Normand <manuel.lenorm...@gmail.com> > Hi, > We have different working hours, sorry for the reply delay. Your assumed > numbers are right, about 25-30Kb per doc. giving a total of 15G per shard, > there are two shards per server (+2 slaves that should do no work > normally). > An average query has about 30 conditions (OR AND mixed), most of them > textual, a small part on dateTime. They use only simple queries (no facet, > filters etc.) as it is taken from the actual query set of my entreprise > that works with an old search engine. > > As we said, if the shards in collection1 and collection2 have the same > number of docs each (and same RAM & CPU per shard), it is apparently not a > slow IO issue, right? So the fact of not having cached all my index doesn't > seem the be the bottleneck.Moreover, i do store the fields but my query set > requests only the id's and rarely snippets so I'd assume that the plenty of > RAM i'd give the OS wouldn't make any difference as these *.fdt files don't > need to get cached. > > The conclusion i get to is that the merging issue is the problem, and the > only possibility of outsmarting it is to distribute to much fewer shards, > meaning that i'll get back to few millions of docs per shard which are > about linearly slower with the num of docs per shard. Though the latter > should improve if i give much more RAM per server. > > I'll try tweaking a bit my schema and making better use of solr cache > (filter query as an example), but i have something telling me the problem > might be elsewhere. My main clue to it is that merging seems a simple CPU > task, and tests show that even with a small amount of responses it takes a > long time (and clearly the merging task on few docs is very short) > > > On Wed, Apr 10, 2013 at 2:50 AM, Shawn Heisey <s...@elyograg.org> wrote: > > > On 4/9/2013 3:50 PM, Furkan KAMACI wrote: > > > >> Hi Shawn; > >> > >> You say that: > >> > >> *... your documents are about 50KB each. That would translate to an > index > >> that's at least 25GB* > >> > >> I know we can not say an exact size but what is the approximately ratio > of > >> document size / index size according to your experiences? > >> > > > > If you store the fields, that is actual size plus a small amount of > > overhead. Starting with Solr 4.1, stored fields are compressed. I > believe > > that it uses LZ4 compression. Some people store all fields, some people > > store only a few or one - an ID field. The size of stored fields does > have > > an impact on how much OS disk cache you need, but not as much as the > other > > parts of an index. > > > > It's been my experience that termvectors take up almost as much space as > > stored data for the same fields, and sometimes more. Starting with Solr > > 4.2, termvectors are also compressed. > > > > Adding docValues (new in 4.2) to the schema will also make the index > > larger. The requirements here are similar to stored fields. I do not > know > > whether this data gets compressed, but I don't think it does. > > > > As for the indexed data, this is where I am less clear about the storage > > ratios, but I think you can count on it needing almost as much space as > the > > original data. If the schema uses types or filters that produce a lot of > > information, the indexed data might be larger than the original input. > > Examples of data explosions in a schema: trie fields with a non-zero > > precisionStep, the edgengram filter, the shingle filter. > > > > Thanks, > > Shawn > > > > >