So then Riak will be good to perform this sort of aggregate query over 500K records? Or am I trying to flex what is effectively a key-value store to do something it's not really good at doing? ;) It feels like I probably need a lot of hardware to get Riak to do this in the same amount of time MySQL does it.
Il giorno 15/apr/2013, alle ore 12:41, Evan Vigil-McClanahan <[email protected]> ha scritto: > The most common issue with large MapReduce jobs is one of the nodes > starting to swap (typically the node the request was made on). > > Streaming results and pre-reduction[0] (where applicable) can often > lower the memory overhead of running MapReduce jobs on large numbers > of objects. > > That would be the first thing I would check. Also potentially > applicable: > http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/ > > > 0: > http://docs.basho.com/riak/1.1.4/references/appendices/MapReduce-Implementation/, > pre-reduce is down at the bottom. > > > > On Sun, Apr 14, 2013 at 6:47 PM, Chris Corbyn <[email protected]> wrote: >> All, >> >> Just copying this from my stackoverflow post, as the riak tag doesn't get >> much love over there :) It's fine to just outright say that Riak is never >> going to work efficiently in this case because I'm inherently depending on >> MapReduce. >> >> Everywhere I read, people say you shouldn't use Riak's MapReduce over an >> entire bucket and that there are other ways of achieving your goals. I'm not >> sure how, though. I'm also not clear on why using an entire bucket is slow, >> if you only have one bucket in the entire system, so either way, you need to >> go over all the entries. Passing in a list of key-bucket pairs has the same >> effect. Maybe the rule should be "don't use MapReduce with more than a >> handful of keys". Which makes me wonder (apart from link traversal) what use >> it really has in the real world. >> >> I have a list of 500K+ documents that represent sales data. I need to view >> this data in different ways: for example, how much revenue was made in each >> month the business was operating? How much revenue did each product raise? >> How many of each product were sold in a given month? I always thought >> MapReduce was supposed to be good at solving these types of aggregate >> problems. That sounds like a myth now though, unless we're just looking at >> Hadoop. >> >> My documents are all in a bucket named 'sales' and they are records with the >> following fields (as native Erlang records, not JSON): >> >> {"id":1, "product_key": "cyber-pet-toy", "price": "10.00", "tax": >> "1.00", "created_at": 1365931758}. >> >> Let's take the example where I need to report the total revenue for each >> product in each month over the past 4 years (that's basically the entire >> bucket, but that's just the requirement), how does one use Riak's MapReduce >> to do that efficiently? Even just trying to use an identity map operation on >> the data I get a timeout after ~30 seconds, which MySQL handles in >> milliseconds. >> >> I'm doing this in Erlang (using the protocol buffers client), but any >> language is fine for an explanation. >> >> The equivalent SQL (MySQL) would be: >> >> SELECT SUM(price) AS revenue, >> FROM_UNIXTIME(created_at, '%Y-%m') AS month, >> product_key >> FROM sales >> GROUP BY month, product_key >> ORDER BY month ASC; >> >> Even with secondary indexes, there's still a MapReduce involved in this >> query, once you have the list of keys to process. >> >> (Ordering not important right now). >> >> Cheers, >> >> Chris >> >> _______________________________________________ >> riak-users mailing list >> [email protected] >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >> _______________________________________________ riak-users mailing list [email protected] http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
