RE: How to do a Data sharding for data in a database table

Carlos Maroto Fri, 19 Jun 2015 13:17:33 -0700

As stated previously, using Field Collapsing (group parameters) tends to 
significantly slow down queries.  In my experience, search response gets even 
worst when:
- Requesting facets, which more often than not I do in my query formulation
- Asking for the facet counts to be on the groups via the group.facet=true 
parameter (way worst in some of my use cases that had a lot of distinct values 
for at least one of the facets)
- Queries are matching many hits, i.e. individual counts (hundreds of thousands 
or more in our case) and total groups counts (in the few thousands)


Also stated by someone, switching to CollapseQParserPlugin will likely reduce 
significantly the response time given its different implementation.  Using 
CollapseQParserPlugin means that you:

1- Have to change how the query gets created
2- May need to change how you consume the Solr response (depending on what you 
are using today)
3- Will not have the total number of individual hits (before collapsing count) 
because the numFound returned by the CollapseQParserPlugin represents the total 
number of groups (like groups.ngroups does)
4- You may have an issue with facet value counts not being exact in the 
CollapseQParserPlugin response

With respect to sharding, there are multiple considerations.  The most relevant 
given your need for grouping is to implement custom routing of documents to 
shards so that all members of a group are indexed in the same shard, if you 
can.  Otherwise your grouping across shards will have some issues (particularly 
with counts, I believe.)

CARLOS MAROTO       
http://www.searchtechnologies.com/
M +1 626 354 7750

-----Original Message-----
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Friday, June 19, 2015 12:08 PM
To: solr-user@lucene.apache.org
Subject: RE: How to do a Data sharding for data in a database table

Also, since you are tuning for relative times, you can tune on the smaller 
index.   Surely, you will want to test at scale.   But tuning query, analyzer 
or schema options is usually easier to do on a smaller index.   If you get a 3x 
improvement at small scale, it may only be 2.5x at full scale.

E.g. storing the group field as doc values is one option that can help grouping 
performance in some cases (at least according to this list, I haven't tried it 
yet).

The number of distinct values of the grouping field is important as well.  If 
there are very many, you may want to try CollapsingQParserPlugin.     

The point being, some of these options may require reindexing!   So, again, it 
is a much easier and faster process to tune on a smaller index.

-----Original Message-----
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, June 19, 2015 2:33 PM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

Do be aware that turning on &debug=query adds a load. I've seen the debug 
component take 90% of the query time. (to be fair it usually takes a much 
smaller percentage).

But you'll see a section at the end of the response if you set debug=all with 
the time each component took so you'll have a sense of the relative time used 
by each component.

Best,
Erick

On Fri, Jun 19, 2015 at 11:06 AM, Wenbin Wang <wwang...@gmail.com> wrote:
> As for now, the index size is 6.5 M records, and the performance is 
> good enough. I will re-build the index for all the records (14 M) and 
> test it again with debug turned on.
>
> Thanks
>
>
> On Fri, Jun 19, 2015 at 12:10 PM, Erick Erickson 
> <erickerick...@gmail.com>
> wrote:
>
>> First and most obvious thing to try:
>>
>> bq: the Solr was started with maximal 4G for JVM, and index size is < 
>> 2G
>>
>> Bump your JVM to 8G, perhaps 12G. The size of the index on disk is 
>> very loosely coupled to JVM requirements. It's quite possible that 
>> you're spending all your time in GC cycles. Consider gathering GC 
>> characteristics, see:
>> http://lucidworks.com/blog/garbage-collection-bootcamp-1-0/
>>
>> As Charles says, on the face of it the system you describe should 
>> handle quite a load, so it feels like things can be tuned and you 
>> won't have to resort to sharding.
>> Sharding inevitably imposes some overhead so it's best to go there last.
>>
>> From my perspective, this is, indeed, an XY problem. You're assuming 
>> that sharding is your solution. But you really haven't identified the 
>> _problem_ other than "queries are too slow". Let's nail down the 
>> reason queries are taking a second before jumping into sharding. I've 
>> just spent too much of my life fixing the wrong thing ;)
>>
>> It would be useful to see a couple of sample queries so we can get a 
>> feel for how complex they are. Especially if you append, as Charles 
>> mentions, "debug=true"
>>
>> Best,
>> Erick
>>
>> On Fri, Jun 19, 2015 at 7:02 AM, Reitzel, Charles 
>> <charles.reit...@tiaa-cref.org> wrote:
>> > Grouping does tend to be expensive.   Our regular queries typically
>> return in 10-15ms while the grouping queries take 60-80ms in a test 
>> environment (< 1M docs).
>> >
>> > This is ok for us, since we wrote our app to take the grouping 
>> > queries
>> out of the critical path (async query in parallel with two primary queries
>> and some work in middle tier).   But this approach is unlikely to work for
>> most cases.
>> >
>> > -----Original Message-----
>> > From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org]
>> > Sent: Friday, June 19, 2015 9:52 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: RE: How to do a Data sharding for data in a database table
>> >
>> > Hi Wenbin,
>> >
>> > To me, your instance appears well provisioned.  Likewise, your 
>> > analysis
>> of test vs. production performance makes a lot of sense.  Perhaps 
>> your time would be well spent tuning the query performance for your 
>> app before resorting to sharding?
>> >
>> > To that end, what do you see when you set debugQuery=true?   Where does
>> solr spend the time?   My guess would be in the grouping and sorting steps,
>> but which?   Sometime the schema details matter for performance.   Folks on
>> this list can help with that.
>> >
>> > -Charlie
>> >
>> > -----Original Message-----
>> > From: Wenbin Wang [mailto:wwang...@gmail.com]
>> > Sent: Friday, June 19, 2015 7:55 AM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: How to do a Data sharding for data in a database table
>> >
>> > I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound 
>> > or
>> computer disk bound. In addition, the Solr was started with maximal 
>> 4G for JVM, and index size is < 2G. In a typical test, I made sure 
>> enough free RAM of 10G was available. I have not tuned any parameter 
>> in the configuration, it is default configuration.
>> >
>> > The number of fields for each record is around 10, and the number 
>> > of
>> results to be returned per page is 30. So the response time should 
>> not be affected by network traffic, and it is tested in the same 
>> machine. The query has a list of 4 search parameters, and each 
>> parameter takes a list of values or date range. The results will also 
>> be grouped and sorted. The response time of a typical single request 
>> is around 1 second. It can be > 1 second with more demanding requests.
>> >
>> > In our production environment, we have 64 cores, and we need to 
>> > support >
>> > 300 concurrent users, that is about 300 concurrent request per second.
>> Each core will have to process about 5 request per second. The 
>> response time under this load will not be 1 second any more. My 
>> estimate is that an average of 200 ms response time of a single 
>> request would be able to handle
>> > 300 concurrent users in production. There is no plan to increase 
>> > the
>> total number of cores 5 times.
>> >
>> > In a previous test, a search index around 6M data size was able to
>> handle >
>> > 5 request per second in each core of my 8-core machine.
>> >
>> > By doing data sharding from one single index of 13M to 2 indexes of
>> > 6 or
>> 7 M/each, I am expecting much faster response time that can meet the 
>> demand of production environment. That is the motivation of doing data 
>> sharding.
>> > However, I am also open to solution that can improve the 
>> > performance of
>> the  index of 13M to 14M size so that I do not need to do a data sharding.
>> >
>> >
>> >
>> >
>> >
>> > On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <
>> erickerick...@gmail.com>
>> > wrote:
>> >
>> >> You've repeated your original statement. Shawn's observation is 
>> >> that 10M docs is a very small corpus by Solr standards. You either 
>> >> have very demanding document/search combinations or you have a 
>> >> poorly tuned Solr installation.
>> >>
>> >> On reasonable hardware I expect 25-50M documents to have 
>> >> sub-second response time.
>> >>
>> >> So what we're trying to do is be sure this isn't an "XY" problem, 
>> >> from Hossman's apache page:
>> >>
>> >> Your question appears to be an "XY Problem" ... that is: you are 
>> >> dealing with "X", you are assuming "Y" will help you, and you are
>> asking about "Y"
>> >> without giving more details about the "X" so that we can 
>> >> understand the full issue.  Perhaps the best solution doesn't involve "Y" 
>> >> at all?
>> >> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>> >>
>> >> So again, how would you characterize your documents? How many fields?
>> >> What do queries look like? How much physical memory on the machine?
>> >> How much memory have you allocated to the JVM?
>> >>
>> >> You might review:
>> >> http://wiki.apache.org/solr/UsingMailingLists
>> >>
>> >>
>> >> Best,
>> >> Erick
>> >>
>> >> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <wwang...@gmail.com> wrote:
>> >> > The query without load is still under 1 second. But under load, 
>> >> > response
>> >> time
>> >> > can be much longer due to the queued up query.
>> >> >
>> >> > We would like to shard the data to something like 6 M / shard, 
>> >> > which will still give a under 1 second response time under load.
>> >> >
>> >> > What are some best practice to shard the data? for example, we 
>> >> > could
>> >> shard
>> >> > the data by date range, but that is pretty dynamic, and we could 
>> >> > shard
>> >> data
>> >> > by some other properties, but if the data is not evenly 
>> >> > distributed, you
>> >> may
>> >> > not be able shard it anymore.
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-d
>> >> ata- in-a-database-table-tp4212765p4212803.html
>> >> > Sent from the Solr - User mailing list archive at Nabble.com.
>> >>
>> >
>> > *******************************************************************
>> > ****** This e-mail may contain confidential or privileged 
>> > information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *******************************************************************
>> > ******
>> >
>> > *******************************************************************
>> > ****** This e-mail may contain confidential or privileged 
>> > information.
>> > If you are not the intended recipient, please notify the sender
>> immediately and then delete it.
>> >
>> > TIAA-CREF
>> > *******************************************************************
>> > ******
>>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

RE: How to do a Data sharding for data in a database table

Reply via email to