The most common issue with large MapReduce jobs is one of the nodes starting to swap (typically the node the request was made on).
Streaming results and pre-reduction[0] (where applicable) can often lower the memory overhead of running MapReduce jobs on large numbers of objects. That would be the first thing I would check. Also potentially applicable: http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/ 0: http://docs.basho.com/riak/1.1.4/references/appendices/MapReduce-Implementation/, pre-reduce is down at the bottom. On Sun, Apr 14, 2013 at 6:47 PM, Chris Corbyn <[email protected]> wrote: > All, > > Just copying this from my stackoverflow post, as the riak tag doesn't get > much love over there :) It's fine to just outright say that Riak is never > going to work efficiently in this case because I'm inherently depending on > MapReduce. > > Everywhere I read, people say you shouldn't use Riak's MapReduce over an > entire bucket and that there are other ways of achieving your goals. I'm not > sure how, though. I'm also not clear on why using an entire bucket is slow, > if you only have one bucket in the entire system, so either way, you need to > go over all the entries. Passing in a list of key-bucket pairs has the same > effect. Maybe the rule should be "don't use MapReduce with more than a > handful of keys". Which makes me wonder (apart from link traversal) what use > it really has in the real world. > > I have a list of 500K+ documents that represent sales data. I need to view > this data in different ways: for example, how much revenue was made in each > month the business was operating? How much revenue did each product raise? > How many of each product were sold in a given month? I always thought > MapReduce was supposed to be good at solving these types of aggregate > problems. That sounds like a myth now though, unless we're just looking at > Hadoop. > > My documents are all in a bucket named 'sales' and they are records with the > following fields (as native Erlang records, not JSON): > > {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": > "1.00", "created_at": 1365931758}. > > Let's take the example where I need to report the total revenue for each > product in each month over the past 4 years (that's basically the entire > bucket, but that's just the requirement), how does one use Riak's MapReduce > to do that efficiently? Even just trying to use an identity map operation on > the data I get a timeout after ~30 seconds, which MySQL handles in > milliseconds. > > I'm doing this in Erlang (using the protocol buffers client), but any > language is fine for an explanation. > > The equivalent SQL (MySQL) would be: > > SELECT SUM(price) AS revenue, > FROM_UNIXTIME(created_at, '%Y-%m') AS month, > product_key > FROM sales > GROUP BY month, product_key > ORDER BY month ASC; > > Even with secondary indexes, there's still a MapReduce involved in this > query, once you have the list of keys to process. > > (Ordering not important right now). > > Cheers, > > Chris > > _______________________________________________ > riak-users mailing list > [email protected] > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
