Re: The power of the siblings....

Mike Oxford Tue, 04 Oct 2011 12:43:06 -0700

Good info; thanks for taking the time to respond!
I can use now/0 and shard on the mod to spread them out in a
sub-second spread.


The only thing stopping me from doing so was that I'm lazy and
didn't want to have to write the correlation aspect over that many
buckets if I didn't have to.  Enough other things to do.  :)

Rock on.

-mox

On Tue, Oct 4, 2011 at 6:18 AM, Ryan Zezeski <rzeze...@basho.com> wrote:
>
>
> On Tue, Oct 4, 2011 at 12:07 AM, Mike Oxford <moxf...@gmail.com> wrote:
>>
>> SSDs are an option, sure.  I have one in my laptop; we have a bunch
>> of X25s on the way already for the servers.  Yes, they're good.  But
>> IOPS is not the core issue since the whole thing can sit in RAM
>> which is faster yet.  Disk-flush "later" isn't time critical.  Getting the
>> data into the buckets is.
>
> If you're writing to bitcask, which I assumed, then IOPS is very much an
> issue.  If using bitcask and the I/O throughput is not there you are going
> to have major bakcups on the vnode mailbox.  If there are any selective
> receives in the vnode implementation things will really get nasty.
> Are you saying you're using an in-memory backend for these keys?
>
>>
>> 5k per second per key, over multiple concurrent writers (3-6 initially,
>> possibly more later.) Pre-cache+flush doesn't work because you
>> lose the interleave from the multiple writers.  NTP's resolution is only
>> "so good." :)
>
> So for each key you have 3-6 concurrent writers averaging around 5kw/s.  How
> many keys do you have like this?
>
>>
>> The buckets can by cycled/sharded based on time, so slicing it into
>> "5 second buckets of children" is possible but this is just a
>> specialization
>> of the sharding ideology.
>
> I assume you mean that every 5s you would change the bucket name to avoid
> overloading the same key space?  Yea, that would probably help but I still
> think you'll have trouble with 5kw/s on a single key if using a durable
> backend.
>
>>
>> Point being: If it's basically used as an append-only-bucket (throw it
>> all in, sort it out later) how painful, underneath, is the child
>> resolution vs
>> the traditional "get it, write it" and then dealing with children ANYWAY
>> when you do get collisions (which, at 5kps, you ARE going to end up with.
>
> Yea, I agree either way you'll end up with children.  I would imagine you'd
> have faster writes without the get/modify/put cycle but I've also never seen
> anyone explode siblings that high on purpose so for all I know it will be
> worse.  I'd be curious to see how Riak handles large sibling counts like
> that but my gut says it won't do so well.
>>
>> This was touched on that it uses lists underneath.  Given high-end modern
>> hardware, (6 core CPUs, SSDs, etc.) ballpark, where would you guess the
>> red-line is?  10k children? 25k? 100k?  I won't hold anyone to it, but if
>> you say "hell no, children are really expensive" then I'll abort the idea
>> right here compared to "they're pretty efficient underneath, it might be
>> doable."
>
> I think it's a bad idea, no matter what the practical limit is.  Siblings,
> when possible, are to be avoided.  They only exist because when you write a
> distributed application like Riak there are certain scenarios where they
> can't be avoided.  You can certainly try to use them as you describe, but I
> can tell you the code was not written with that in mind.  Like I said, I'd
> be curious to see the results.
>>
>> I'm familiar with all the HA/clustering "normal stuff" but I'm curious
>> about Riak in particular because while Riak isn't built to be fast,
>> I'm curious about how much load you can push a ring through before
>> the underlying architecture stresses.
>
> In most cases we expect Riak to be I/O bound.  If you're not stressing I/O
> then my first instinct would be to raise the ring size so that each node has
> more partitions.  There is no hard and fast rule about how many partitions a
> node should have but is dependent on the type of disk you have.  Obviously,
> SSDs and the like will handle more.  We even have some people that run SSDs
> RAID 0.
> Also, since ring size is something that you can't change once a cluster has
> been created you need to do some form of capacity planning ahead of time to
> guess what will be the best node/partition ratio.  In 1.0 we did some work
> to make better use of I/O without relying on the ring size (such as async
> folding and whatnot) but I'm not sure on all the details and I'm hoping one
> of my colleagues can help me out if I'm missing something.
>>
>> I know Yammer was putting some load on theirs; something around 4k
>> per sec over a few boxes but not to a single key.
>
> The key part of that sentence: _not to a single key_.  Writing to a single
> key is serialized and therefore it can only move as fast as the vnodes that
> map to it.
>
>>
>> The big "problem" is that you have to have "knowledge of the buckets"
>> to later correlate them. Listing buckets is expensive.  I don't want to
>> hard-code bucket names into the application space if I can help it.
>> Writing "list of buckets" to another key simply moves the bottleneck
>> from one key to another.  Shifting buckets based on time works, but
>> it's obnoxious to have to correlate at 10 second intervals ....
>> 8640 buckets worth of obnoxious.  Every day.  Much easier to sort a
>> large dataset all at once from a single bucket.
>
> I'm not sure if you realize this but "bucket" is really just a namespace in
> the key.  Said another way <REAL KEY>=<BUCKET>/<KEY>.  The <REAL KEY> is
> what's hashed and determines the ring position.  There are no special
> provisions for a bucket for the most part (one exception I can think of is
> custom properties which get stored in the gossiped ring).  So while 8640
> buckets seems wrong it really shouldn't make much of a difference to Riak.
>  However, for places where we do treat buckets specially the code may not be
> optimized for a large number of buckets.  Once again, something to try and
> measure.
>
>>
>> Assuming an entry size of 300 bytes that works out to around
>> ~130G per day, which will fit in RAM for the boxes.  Correlation can be
>> done on separate boxes later.  GigE cards bonded, etc.
>>
>> Removing the hardware limitations, where are the guesses on where
>> Riak itself will curl up in a corner, sob and not come out?
>>
>> If you had to do it, what suggestions would you all propose?
>> (Yes, I know I could just memcache with backup writes to
>> secondary/tertiary copies and flush later ... I'm interested in Riak.  :)
>
> I think, in general, writing to a single key 5k times a second will be
> problematic.  Riak simply was not designed to modify a single key in a tight
> loop. I'd love to be proven wrong.  I would either find a way to distribute
> these writes across the key space better or batch them locally at the
> application layer and send them in chunks that can be reconstructed later.
> -Ryan

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: The power of the siblings....

Reply via email to