Re: MapReduce scalability

Bernard Fouché Thu, 28 Feb 2013 05:55:01 -0800

Hi Christian,

Athttp://docs.basho.com/riak/1.3.0/references/appendices/MapReduce-Implementation/, one can read "...any Riak node can also coordinate a MapReduce queryby sending a map-step evaluation request directly to the noderesponsible for maintaining the input data. Map-step results are sentback to the coordinating node, where reduce-step processing can producea unified result.".

What you wrote means that the above description is purely theoreticalsince if there is any problem to get access to data in a node, then theMR fails. We have also seen that deleting a key while doing a MR justmakes the MR to run forever so it makes me think that your descriptionis accurate and for the documentation to be correct it seems that onemust first be sure that all input data reading will never trigger anykind of error processing, otherwise the MR job will fail (or be stuck).Please correct me if I've misunderstood!

Now if I want to split processing of a list of keys in the cluster, isthere a way to know what node is supposed to have at least one copy of aK/V ?

If so, we can setup our own kind of MR, by sending subset of keys tonodes known to have at least one version of the K/V pair. Hence if R==2,there will be one local read in the node receiving the subset and onlyone more read in another node that holds a copy. Then this distributedprocessing can handle read-repair, aggregate data and send the result tothe coordinating node.


Best Regards,

        Bernard

Le 28/02/2013 10:32, Christian Dahlqvist a écrit :

Hi Boris,
Apart from not scaling quite as well as straight K/V access, emulatingmultiGET through MapReduce also has another significant drawback.MapReduce has no concept of quorum reads, and only work on a singlecopy of the data, which can be thought of basically as a read with R=1that does not trigger read-repair. It is therefore possible that itcan give inconsistent or incorrect results if all replicas do not havethe same data. It is worth noting that MapReduce was designed as a wayto efficiently spread compute work across the cluster, andre-appropriating it for use with data collection is not its designedpurpose.
The recommended way to implement efficient multiget is to performnormal GET operations in parallel. If you are retrieving 20 objects,you don't necessarily need to do all 20 GETs in parallel, but couldset it up to use perhaps 3 or 4 connections. If you then pair thiswith a connection pool that can grow and shrink in size (perhapsbetween a minimum and a maximum value) as load requires, you should beable to retrieve the objects in a reasonable time without overloadingthe cluster.
Best regards,

Christian
On 27 Feb 2013, at 02:18, Boris Okner <[email protected]<mailto:[email protected]>> wrote:
Thanks Christian,
The problem I'm trying to solve is to find the way to retrieve valuesfor limited number of keys with the best possible latency (or maybewith decent latency which is balanced with decent throughput). Let'ssay we have keys stored in some cacheon top of Riak, and want to retrieve values, 20 at the time, to beable to implement pagination. Another alternative to mapreduce wouldto send multiple asynchronous gets, but then we'd have to worry aboutconnection pool being exhausted if there's too many such "page"requests. So what would be the proper way to deal with the situationwhen we need to emulate multiple key retrieval?
On Tue, Feb 26, 2013 at 1:57 AM, Christian Dahlqvist<[email protected] <mailto:[email protected]>> wrote:
    Hi Boris,

    MapReduce is a very flexible and powerful way of querying Riak
    and allows processing to be performed locally where the data
    resides, which allows for efficient processing of larger data
    sets. A result of this is that every mapreduce job requires a
    covering set of vnodes (all vnodes that hold the data required
    for processing) to participate, meaning that it puts considerable
    more load on the system compared to straight K/V access and
    therefore does not scale quite as well. It is primarily designed
    for batch type processing over reasonably large amounts of data
    and scales well with increased data volumes as new nodes are
    added. We do however usually not recommended using it as an
    interface for realtime queries where low and predictable
    latencies are required and the concurrency level, and therefore
    load level on the cluster, can not be controlled.

    I am not sure I understand what you mean by the performance
    degrading with the number of nodes, unless you are strictly
    measuring latency rather than throughput. As the number of nodes
    increase, it gets more and more likely that multiple physical
    nodes will be involved in the job, which will add to the amount
    of communication and coordination required between the nodes,
    thereby increasing latency. Could you please explain in more
    detail what you are trying to achieve?

    Best regards,

    Christian


    On 25 Feb 2013, at 16:41, Boris Okner <[email protected]
    <mailto:[email protected]>> wrote:
    Hello,

    I'm experimenting with 2 Riak 1.3.0 nodes (both are "bare
    metal"), and it looks like mapreduce performs better when one of
    the nodes is down. The mapreduce requests are running on 20-key
    blocks. So am I doing something wrong, or is it an expected
    behaviour, i.e. mapreduce degrades with the the number of nodes
    increased? If the former, could
    you give me some pointers on how to set up it to get advantage
    of multiple nodes?

    Thanks in advance for your help,
    Boris
    _______________________________________________
    riak-users mailing list
    [email protected] <mailto:[email protected]>
    http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected] <mailto:[email protected]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[email protected]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: MapReduce scalability

Reply via email to