RE: How to do a Data sharding for data in a database table

Reitzel, Charles Fri, 19 Jun 2015 07:04:31 -0700

Grouping does tend to be expensive.   Our regular queries typically return in 
10-15ms while the grouping queries take 60-80ms in a test environment (< 1M 
docs).

This is ok for us, since we wrote our app to take the grouping queries out of 
the critical path (async query in parallel with two primary queries and some 
work in middle tier).   But this approach is unlikely to work for most cases.

-----Original Message-----
From: Reitzel, Charles [mailto:charles.reit...@tiaa-cref.org] 
Sent: Friday, June 19, 2015 9:52 AM
To: solr-user@lucene.apache.org
Subject: RE: How to do a Data sharding for data in a database table

Hi Wenbin,

To me, your instance appears well provisioned.  Likewise, your analysis of test 
vs. production performance makes a lot of sense.  Perhaps your time would be 
well spent tuning the query performance for your app before resorting to 
sharding?   

To that end, what do you see when you set debugQuery=true?   Where does solr 
spend the time?   My guess would be in the grouping and sorting steps, but 
which?   Sometime the schema details matter for performance.   Folks on this 
list can help with that.

-Charlie

-----Original Message-----
From: Wenbin Wang [mailto:wwang...@gmail.com]
Sent: Friday, June 19, 2015 7:55 AM
To: solr-user@lucene.apache.org
Subject: Re: How to do a Data sharding for data in a database table

I have enough RAM (30G) and Hard disk (1000G). It is not I/O bound or computer 
disk bound. In addition, the Solr was started with maximal 4G for JVM, and 
index size is < 2G. In a typical test, I made sure enough free RAM of 10G was 
available. I have not tuned any parameter in the configuration, it is default 
configuration.

The number of fields for each record is around 10, and the number of results to 
be returned per page is 30. So the response time should not be affected by 
network traffic, and it is tested in the same machine. The query has a list of 
4 search parameters, and each parameter takes a list of values or date range. 
The results will also be grouped and sorted. The response time of a typical 
single request is around 1 second. It can be > 1 second with more demanding 
requests.

In our production environment, we have 64 cores, and we need to support >
300 concurrent users, that is about 300 concurrent request per second. Each 
core will have to process about 5 request per second. The response time under 
this load will not be 1 second any more. My estimate is that an average of 200 
ms response time of a single request would be able to handle
300 concurrent users in production. There is no plan to increase the total 
number of cores 5 times.

In a previous test, a search index around 6M data size was able to handle >
5 request per second in each core of my 8-core machine.

By doing data sharding from one single index of 13M to 2 indexes of 6 or 7 
M/each, I am expecting much faster response time that can meet the demand of 
production environment. That is the motivation of doing data sharding.
However, I am also open to solution that can improve the performance of the  
index of 13M to 14M size so that I do not need to do a data sharding.

On Fri, Jun 19, 2015 at 12:39 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> You've repeated your original statement. Shawn's observation is that 
> 10M docs is a very small corpus by Solr standards. You either have 
> very demanding document/search combinations or you have a poorly tuned 
> Solr installation.
>
> On reasonable hardware I expect 25-50M documents to have sub-second 
> response time.
>
> So what we're trying to do is be sure this isn't an "XY" problem, from 
> Hossman's apache page:
>
> Your question appears to be an "XY Problem" ... that is: you are 
> dealing with "X", you are assuming "Y" will help you, and you are asking 
> about "Y"
> without giving more details about the "X" so that we can understand 
> the full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
> So again, how would you characterize your documents? How many fields? 
> What do queries look like? How much physical memory on the machine? 
> How much memory have you allocated to the JVM?
>
> You might review:
> http://wiki.apache.org/solr/UsingMailingLists
>
>
> Best,
> Erick
>
> On Thu, Jun 18, 2015 at 3:23 PM, wwang525 <wwang...@gmail.com> wrote:
> > The query without load is still under 1 second. But under load, 
> > response
> time
> > can be much longer due to the queued up query.
> >
> > We would like to shard the data to something like 6 M / shard, which 
> > will still give a under 1 second response time under load.
> >
> > What are some best practice to shard the data? for example, we could
> shard
> > the data by date range, but that is pretty dynamic, and we could 
> > shard
> data
> > by some other properties, but if the data is not evenly distributed, 
> > you
> may
> > not be able shard it anymore.
> >
> >
> >
> > --
> > View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-do-a-Data-sharding-for-data-
> in-a-database-table-tp4212765p4212803.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
>

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

*************************************************************************
This e-mail may contain confidential or privileged information.
If you are not the intended recipient, please notify the sender immediately and 
then delete it.

TIAA-CREF
*************************************************************************

RE: How to do a Data sharding for data in a database table

Reply via email to