Re: speeding up riaksearch precommit indexing

Mathias Meyer Wed, 22 Jun 2011 00:34:39 -0700

Les,

maybe it's worth looking into Beetle [1] which is a HA messaging solution built 
on RabbitMQ and Redis. It supports multiple brokers and message de-duplication, 
using Redis. It's written in Ruby, but should either way give you some 
inspiration on how something like this could be achieved.


On a different note, you could draw some inspiration from existing 
implementations that build on top of Riak Core like riak_zab [2] or riak_id 
[3]. You'd benefit from everything Riak itself is built on and could fashion 
your own implementation on top of that, sort of faking a queue system and using 
Riak KV as a storage backend.

As a final thought, there's a plugin for RabbitMQ [4] to store message directly 
into Riak, again benefitting from the fault-tolerance of Riak, but basically 
baked right into your messaging system. You could run multiple Rabbits, all 
writing messages directly to Riak, and then resort to failover should one of 
the Rabbits go down.

Mathias Meyer
Developer Advocate, Basho Technologies

[1] http://xing.github.com/beetle/
[2] https://github.com/jtuple/riak_zab
[3] https://github.com/seancribbs/riak_id
[4] https://github.com/jbrisbin/riak-exchange


On Mittwoch, 22. Juni 2011 at 00:07, Les Mikesell wrote:

> I'd like to have fully redundant feeds with no single point of failure, 
> but avoid the work of indexing the duplicate copy and having it written 
> to a bitcask even if it would eventually be cleaned up.
> 
> 
> On 6/21/2011 4:43 PM, Sylvain Niles wrote:
> > Why not write to a queue bucket with a timestamp and have a queue
> > processor move writes to the "final" bucket once they're over a
> > certain age? It can dedup/validate at that point too.
> > 
> > 
> > On Tue, Jun 21, 2011 at 2:26 PM, Les Mikesell<lesmikes...@gmail.com 
> > (mailto:lesmikes...@gmail.com)> wrote:
> > > Where can I find the redis hacks that get close to clustering? Would
> > > membase work with syncronous replication on a pair of nodes for a reliable
> > > atomic 'check and set' operation to dedup redundant data before writing to
> > > riak? Conceptually I like the 'smart client' fault tolerance of
> > > memcache/membase and restricting it to a pair of machines would keep the
> > > client configuration reasonable.
> > > 
> > >  -Les
> > > 
> > > 
> > > On 6/18/2011 6:54 PM, John D. Rowell wrote:
> > > > 
> > > > The "real" queues like HornetQ and others can take care of this without
> > > > a single point of failure but it's a pain (in my opinion) to set them up
> > > > that way, and usually with all the cluster and failover features active
> > > > they get quite slow for writes.We use Redis for this because it's
> > > > simpler and lightweight. The problem is that there is no real clustering
> > > > option for Redis today, even thought there are some hacks that get
> > > > close. When we cannot afford a single point of failure or any downtime,
> > > > we tend to use MongoDB for simple queues. It has full cluster support
> > > > and the performance is pretty close to what you get with Redis in this
> > > > use case.
> > > > 
> > > > OTOH you could keep it all Riak and setup a separate small cluster with
> > > > a RAM backend and use that as a queue, probably with similar
> > > > performance. The idea here is that you can scale these clusters (the
> > > > "queue" and the indexed production data) independently in response to
> > > > your load patterns, and have optimum hardware and I/O specs for the
> > > > different cluster nodes.
> > > > 
> > > > -jd
> > > > 
> > > > 2011/6/18 Les Mikesell<lesmikes...@gmail.com 
> > > > (mailto:lesmikes...@gmail.com)
> > > > <mailto:lesmikes...@gmail.com>>
> > > > 
> > > >  Is there a good way to handle something like this with redundancy
> > > >  all the way through? On simple key/value items you could have two
> > > >  readers write the same things to riak and let bitcask cleanup
> > > >  eventually discard one, but with indexing you probably need to use
> > > >  some sort of failover approach up front. Do any of those queue
> > > >  managers handle that without adding their own single point of
> > > >  failure? Assuming there are unique identifiers in the items being
> > > >  written, you might use the CAS feature of redis to arbitrate writes
> > > >  into its queue, but what happens when the redis node fails?
> > > > 
> > > >  -Les
> > > > 
> > > > 
> > > > 
> > > >  On 6/17/11 11:48 PM, John D. Rowell wrote:
> > > > 
> > > >  Why not decouple the twitter stream processing from the
> > > >  indexing? More than
> > > >  likely you have a single process consuming the spritzer stream,
> > > >  so you can put
> > > >  the fetched results in a queue (hornetq, beanstalk, or even a
> > > >  simple Redis
> > > >  queue) and then have workers pull from the queue and insert into
> > > >  Riak. You could
> > > >  run one worker per node and thus insert in parallel into all
> > > >  nodes. If you need
> > > >  free CPU (e.g. for searches), just throttle the workers to some
> > > >  sane level. If
> > > >  you see the queue getting bigger, add another Riak node (and
> > > >  thus another local
> > > >  worker).
> > > > 
> > > >  -jd
> > > > 
> > > >  2011/6/13 Steve Webb<sw...@gnip.com<mailto:sw...@gnip.com>
> > > > <mailto:sw...@gnip.com<mailto:sw...@gnip.com>>>
> > > > 
> > > > 
> > > >  Ok, I've changed my two VMs to each have:
> > > > 
> > > >  3 CPUs, 1GB ram, 120GB disk
> > > > 
> > > >  I'm ingesting the twitter spritzer stream (about 10-20
> > > >  tweets per second,
> > > >  approx 2k of data per tweet). One bucket is storing the
> > > >  non-indexed tweets
> > > >  in full. Another bucket is storing the indexed tweet
> > > >  string, id, date and
> > > >  username. A maximum of 20 clients can be hitting the
> > > >  'cluster' at any one time.
> > > > 
> > > >  I'm using n_val=2 so there is replication going on behind
> > > >  the scenes.
> > > > 
> > > >  I'm using a hardware load-balancer to distribute the work
> > > >  amongst the two
> > > >  nodes and now I'm seeing about 75% CPU usage as opposed to
> > > >  100% on one node
> > > >  and 50% on the replicating-only node.
> > > > 
> > > >  I've monitored the VM over the last few days and it seems to
> > > >  be mostly
> > > >  CPU-bound. The disk I/O is low. The Network I/O is low.
> > > > 
> > > >  Q: Can I change the pre-commit to a post-commit trigger or
> > > >  something perhaps
> > > >  or will that make any difference at all? I'm ok if the
> > > >  tweet stuff doesn't
> > > >  get indexed immediately and there's a slight lag in indexing
> > > >  if it saves on CPU.
> > > > 


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: speeding up riaksearch precommit indexing

Reply via email to