My experience working with big-data analytics tells that for most cases query concurrency is very low. And running query chunks on two slaves is ok (otherwise second slave will be idle anyway).
But from other hand for public reporting systems (like google analytics) it could be a problem. We could solve it by allowing client to specify redundancy level and let our users decide if they want better reliability or less load. On Thu, Sep 13, 2012 at 3:00 AM, Constantine Peresypkin <[email protected]> wrote: > You're absolutely correct. > My point was that even less than 2 is sufficient. > > > On Thu, Sep 13, 2012 at 1:40 AM, Ted Dunning <[email protected]> wrote: > >> It isn't a doubling. It is a power. >> >> If probability of exceeding the SLA is p, then the probability that two >> independent resources will exceed the SLA is p^2. For three, the >> probability is p^3. >> >> To be concrete, I just did a simulation with a mixture of two log-normal >> distributions. Using a mixture distribution here is important to emulate >> the long-tailed nature of response time distributions ... it doesn't >> suffice to use normal distributions. >> >> With a long tailed distribution that has a median of 20 ms response, the >> raw distribution has about a 2% chance of having a response > 50ms. Using >> the lesser of two responses gives a probability of > 50 ms response if >> 0.04%. Three responses gives a probability of 0.0008%. For most >> applications, the difference between 2 and 3 replicated queries is nil. >> >> Moreover, if the second query has an artificial delay of a few ms, you get >> nearly the same improvements in probability of meeting the SLA, but you pay >> much lower average cost because you rarely invoke the redundant queries. >> >> So the reason that 2 are used instead of 3 is that 2 helps a lot while 3 >> only improves things slightly more. >> >> On Wed, Sep 12, 2012 at 1:01 PM, Constantine Peresypkin < >> [email protected]> wrote: >> >> > If you do a double query you're increasing your chances to success by >> > factor of 2 only. >> > Why not triple or quadruple? >> > >> > On Wed, Sep 12, 2012 at 10:14 PM, Ted Dunning <[email protected]> >> > wrote: >> > >> > > Heavens.... we can easily satisfy both needs. >> > > >> > > Just have a parameter that can be set to 0 (= universal double query) >> or >> > > Integer.MAX_INTEGER to get no backups at all. >> > > >> > > On Wed, Sep 12, 2012 at 11:47 AM, Constantine Peresypkin < >> > > [email protected]> wrote: >> > > >> > > > > The PowerDrill paper also mentions a variant of this where each >> query >> > > > fragment is sent to two machines, and the results for that fragment >> are >> > > > used from whatever machine responds first. >> > > > >> > > > >> > > > To send each query or request twice cluster load will be increased by >> > > 100%. >> > > > >> > > >> > >> -- Vladimir Klimontovich Cell: +7-926-890-2349, skype: klimontovich
