Re: Shard takeover behavior

Aaron McCurry Fri, 06 May 2016 07:29:08 -0700

Cool, thanks for sharing the info!

On Fri, May 6, 2016 at 3:48 AM, Ravikumar Govindarajan <
[email protected]> wrote:


> We migrated about 30% of our data in Blur.
>
> But our servers were struggling to index the remaining 60% of data coupled
> with new incoming documents for already migrated 30% user-base...
>
> We wanted to isolate migration (data-pumping) from normal indexing/search
> requests..
>
> Blur made it very easy for me.
>
> In MasterBasedFactory, I reserve a shard-range for migration & make sure
> only a certain set of machines serve it. Once the shards reach a desired
> size(15-20GB) , I de-reserve it (Normal machines pick it up & start serving
> incoming requests immediately). Newer reserved shards are allocated to
> continue our data pumping un-interrupted.
>
> This was a big problem for us & the hidden gems of Blur continue to suprise
> & help us so much.
>
> Thanks a lot...
>
> --
> Ravi
>
> On Wed, Mar 19, 2014 at 12:07 PM, Ravikumar Govindarajan <
> [email protected]> wrote:
>
> > Sure will take up 0.2.2 codebase. Thanks for all your help
> >
> >
> > On Tue, Mar 18, 2014 at 4:24 PM, Aaron McCurry <[email protected]>
> wrote:
> >
> >> On Tue, Mar 18, 2014 at 1:30 AM, Ravikumar Govindarajan <
> >> [email protected]> wrote:
> >>
> >> > >
> >> > > If I understand this one.  Favor the primary response until a
> certain
> >> > > amount of time has passed then fall back to the secondary response
> >> > assuming
> >> > > it's available to return.
> >> >
> >> >
> >> > Exactly. This is one such option. Another option is the
> >> first-past-the-post
> >> >
> >> > Buffer cache?  Are you referring to block cache?
> >> >
> >> >
> >> > Yup. Was referring to the block-cache here. But like you said, we can
> >> just
> >> > let it fall off the LRU
> >> >
> >> >  The interesting thing here is that Blur is fully committed to disk
> >> (HDFS)
> >> >
> >> > upon each mutate
> >> >
> >> > I think this is a new feature that I have missed in Blur. Will for
> sure
> >> > check it out. This auto-solves the stale-read issue also
> >> >
> >> > The problem now is, I am doing quite low-level changes on top of blur.
> >> Some
> >> > of them are..
> >> >
> >> > 1. Online Shard-Creation
> >> > 2. Externalizing RowId->Shard mapping via BlurPartitioner
> >> > 3. Splitting shards upon reaching configured size
> >> > 4. Secondary read-only shard for availability...
> >> >
> >>
> >> I would love hear about more of the details of the implementations of
> >> these.  :-)
> >>
> >>
> >> >
> >> > and many more such stuff needed for our app
> >> >
> >> > Hope to share and get feedback for these changes from Blur community
> >> once
> >> > the system survives a couple of production-cycles.
> >> >
> >>
> >> That would be awesome.  Based on your other email, I would strongly
> >> recommend you take a look at the 0.2.2 codebase.  It has MANY fixes,
> >> performance improvements, and stability enhancements.  Let us know if
> you
> >> have any questions.
> >>
> >> Aaron
> >>
> >>
> >> >
> >> > --
> >> > Ravi
> >> >
> >> >
> >> > On Mon, Mar 17, 2014 at 7:17 PM, Aaron McCurry <[email protected]>
> >> wrote:
> >> >
> >> > > On Sat, Mar 15, 2014 at 12:57 PM, Ravikumar Govindarajan <
> >> > > [email protected]> wrote:
> >> > >
> >> > > > Aaron,
> >> > > >
> >> > > > I was thinking about another way of utilizing read-only shards
> >> > > >
> >> > > > Instead of logic/intelligence of finding a primary replica
> >> > > struggling/down,
> >> > > > can we opt for pushing a logic on client-side?
> >> > > >
> >> > > > We can take a few approaches as below
> >> > > >
> >> > > > 1. Query both primary/secondary shards in parallel and return
> which
> >> > ever
> >> > > > comes first
> >> > >
> >> > >
> >> > > > 2. Query both primary/secondary shards in parallel. Wait for
> primary
> >> > > > response as per configured delay. If not forthcoming, return
> >> > secondary's
> >> > > > response
> >> > > >
> >> > >
> >> > > If I understand this one.  Favor the primary response until a
> certain
> >> > > amount of time has passed then fall back to the secondary response
> >> > assuming
> >> > > it's available to return.
> >> > >
> >> > >
> >> > > >
> >> > > > These are useful only when client agrees for a "stale-read"
> >> scenario.
> >> > > > "stale-read" in this case will be the last-commit of the index.
> >> > > >
> >> > > > What I am aiming at, is in the case of layout-conscious apps
> [layout
> >> > does
> >> > > > not change when VM update/crash/hang is restarted], we can always
> >> > > fall-back
> >> > > > on replica reads, resulting in greater availability but lesser
> >> > > consistency
> >> > > >
> >> > > > A secondary-replica layout need to be present in ZK.
> Replica-shards
> >> > > should
> >> > > > be always served from a server other than primary. May be we can
> >> > > switch-off
> >> > > > buffer-cache for replica reads, as it is used only temporarily
> >> > > >
> >> > >
> >> > > Buffer cache?  Are you referring to block cache?  Or a query cache?
> >> Just
> >> > > as a FYI, Blur's query cache is currently disabled.  As for the
> block
> >> > > cache, maybe.  The block cache seems to help performance quite a bit
> >> and
> >> > > usually is does so at little cost.  Also, we could flush the
> secondary
> >> > > shard from the cache from time to time.  Or we could just let it
> fall
> >> out
> >> > > of the LRU.
> >> > >
> >> > >
> >> > > >
> >> > > > 95% apps queue their indexing operations and can always retry
> after
> >> > > primary
> >> > > > comes back online.
> >> > > >
> >> > >
> >> > > The interesting thing here is that Blur is fully committed to disk
> >> (HDFS)
> >> > > upon each mutate.  So assuming that the secondary shard has
> refreshed,
> >> > the
> >> > > primary shard being down just means that you can't write to that
> >> shard.
> >> > >  Reads should be in the same state.
> >> > >
> >> > >
> >> > > >
> >> > > > Please let me know your views on this
> >> > > >
> >> > >
> >> > > I like all these ideas, the only thing I would add is that we we
> would
> >> > need
> >> > > to build these sort of options into Blur on a configured per-table
> >> basis.
> >> > >  The querying both primary and secondary shards at the same time
> could
> >> > > produce the most consistent respond times but at the cost of CPU
> >> > resources
> >> > > (obviously).
> >> > >
> >> > > Thanks for the thoughts and ideas!  I like it!
> >> > >
> >> > > Aaron
> >> > >
> >> > >
> >> > > >
> >> > > > --
> >> > > > Ravi
> >> > > >
> >> > > >
> >> > > > On Sat, Mar 8, 2014 at 8:56 PM, Aaron McCurry <[email protected]
> >
> >> > > wrote:
> >> > > >
> >> > > > > On Fri, Mar 7, 2014 at 5:42 AM, Ravikumar Govindarajan <
> >> > > > > [email protected]> wrote:
> >> > > > >
> >> > > > > > >
> >> > > > > > > Well it works that way for OOMs and for when the process
> drop
> >> > hard
> >> > > > > (Think
> >> > > > > > > kill -9).  However when a shard server is shutdown it
> >> currently
> >> > > ends
> >> > > > > it's
> >> > > > > > > session in ZooKeeper, thus triggering a layout change.
> >> > > > > >
> >> > > > > >
> >> > > > > > Yes, may be we can have a config to determine whether it shud
> >> > > > > end/maintain
> >> > > > > > the session in ZK when doing a normal shutdown and then
> >> subsequent
> >> > > > > restart.
> >> > > > > > By this way, both MTTR-conscious and layout-conscious settings
> >> can
> >> > be
> >> > > > > > supported.
> >> > > > > >
> >> > > > >
> >> > > > > That's a neat idea.  Once we have shards being served on
> multiple
> >> > > servers
> >> > > > > we should definitely take a look at this.  When we implement the
> >> > > > > multi-shard serving I would guess that there will be 2 layout
> >> > > strategies
> >> > > > > (they might be implemented together).
> >> > > > >
> >> > > > > 1. Would be to get the N replicas online on different servers.
> >> > > > > 2. Would the writing leader for the shard, assuming that it's
> >> needed.
> >> > > > >
> >> > > > >
> >> > > > > >
> >> > > > > > How do you think we can detect that a particular shard-server
> is
> >> > > > > > struggling/shut-down and hence incoming search-requests need
> to
> >> go
> >> > to
> >> > > > > some
> >> > > > > > other server?
> >> > > > > >
> >> > > > > > I am listing few paths off the top of my head
> >> > > > > >
> >> > > > > > 1. Process baby-sitters like supervisord, alerting controllers
> >> > > > > > 2. Tracking first network-exception in controller and
> diverting
> >> to
> >> > > > > > read-only
> >> > > > > >     instance. Periodically may be re-try
> >> > > > > > 3. Take a statistics based decision, based on previous
> response
> >> > times
> >> > > > > etc..
> >> > > > > >
> >> > > > >
> >> > > > > Anding to this one and this may be obvious but measuring the
> >> response
> >> > > > time
> >> > > > > in comparison with other shards.  Meaning if the entire cluster
> is
> >> > > > > experiencing an increase in load and all responses times are
> >> > increasing
> >> > > > we
> >> > > > > wouldn't want to start killing off shard servers inadvertently.
> >> > >  Looking
> >> > > > > for outliers.
> >> > > > >
> >> > > > >
> >> > > > > > 4. Build some kind of leasing mechanism in ZK etc...
> >> > > > > >
> >> > > > >
> >> > > > > I think that all of these are good approaches.  Likely to
> >> determine
> >> > > that
> >> > > > a
> >> > > > > node is misbehaving and should be killed/not used anymore we
> would
> >> > want
> >> > > > > multiple ways to measure that condition and then vote on the
> need
> >> > kick
> >> > > > out.
> >> > > > >
> >> > > > >
> >> > > > > Aaron
> >> > > > >
> >> > > > > >
> >> > > > > > --
> >> > > > > > Ravi
> >> > > > > >
> >> > > > > >
> >> > > > > > On Fri, Mar 7, 2014 at 8:01 AM, Aaron McCurry <
> >> [email protected]>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > > On Thu, Mar 6, 2014 at 6:30 AM, Ravikumar Govindarajan <
> >> > > > > > > [email protected]> wrote:
> >> > > > > > >
> >> > > > > > > > I came to know about zk.session.timeout variable just now,
> >> > while
> >> > > > > > reading
> >> > > > > > > > more about this problem.
> >> > > > > > > >
> >> > > > > > > > This will only trigger dead-node notification after the
> >> > > configured
> >> > > > > > > timeout
> >> > > > > > > > exceeds. Setting it to 3-4 mins must be fine for OOMs and
> >> > > > > > > rolling-restarts.
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > Well it works that way for OOMs and for when the process
> drop
> >> > hard
> >> > > > > (Think
> >> > > > > > > kill -9).  However when a shard server is shutdown it
> >> currently
> >> > > ends
> >> > > > > it's
> >> > > > > > > session in ZooKeeper, thus triggering a layout change.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > Only extra stuff I am looking for, is to divert search
> calls
> >> > to a
> >> > > > > > > read-only
> >> > > > > > > > shard instance during this 3-4 mins time to avoid
> >> mini-outages
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > Yes, and I think that the controllers will automatically
> >> spread
> >> > the
> >> > > > > > queries
> >> > > > > > > across those servers that are online.  The BlurClient class
> >> > already
> >> > > > > > takes a
> >> > > > > > > list of connection strings and treats all connections as
> >> equals.
> >> > >  For
> >> > > > > > > example, it's current use is to provide the client with all
> >> the
> >> > > > > > controllers
> >> > > > > > > connection strings.  Internally if any one of the
> controllers
> >> > goes
> >> > > > down
> >> > > > > > or
> >> > > > > > > has a network issue another controller is automatically
> >> retried
> >> > > > without
> >> > > > > > the
> >> > > > > > > user having to do anything.  There is back off, ping, and
> >> pooling
> >> > > > logic
> >> > > > > > in
> >> > > > > > > the BlurClientManager that the BlurClient utilizes.
> >> > > > > > >
> >> > > > > > > Aaron
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Ravi
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > On Thu, Mar 6, 2014 at 3:34 PM, Ravikumar Govindarajan <
> >> > > > > > > > [email protected]> wrote:
> >> > > > > > > >
> >> > > > > > > > > What do you think of giving an extra leeway for
> >> shard-server
> >> > > > > >  failover
> >> > > > > > > > > cases?
> >> > > > > > > > >
> >> > > > > > > > > Ex: Whenever a shard-server process gets killed, the
> >> > > > > controller-node
> >> > > > > > > does
> >> > > > > > > > > not immediately update-layout, but rather mark it as a
> >> > suspect.
> >> > > > > > > > >
> >> > > > > > > > > When we have a read-only back-up of shard, searches can
> >> > > continue
> >> > > > > > > > > unhindered. Indexing during this time can be diverted
> to a
> >> > > queue,
> >> > > > > > which
> >> > > > > > > > > will store and retry-ops, when shard-server comes online
> >> > again.
> >> > > > > > > > >
> >> > > > > > > > > Over configured number of attempts/time, if the
> >> shard-server
> >> > > does
> >> > > > > not
> >> > > > > > > > come
> >> > > > > > > > > up, then one controller-server can authoritatively mark
> >> it as
> >> > > > down
> >> > > > > > and
> >> > > > > > > > > update the layout.
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > > Ravi
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
> >
>

Re: Shard takeover behavior

Reply via email to