Hello, I have to give a presentation this week on how the MapReduce (of Google and Hadoop fame) algorithm works.
I understand how map() works, and how reduce() works, and having read the google papers, I have an idea of their implementation (which I must say takes certain liberties with FP-derived terminology). As I understand it, we have a "map" function which is passed a set of key/value pairs, which, by applying a user function, produces an intermediate set of key-value pairs. I've come up with an example... return the url and how frequently occurs if the url contains a given word. I'm planning to pass a "map" function an indexed list of urls: >>> stuff [(1, 'http://www.beer.com'), (2, 'http://www.ban-beer.com'), (3, 'http://www.bbc.co.uk'), (4, 'http://www.beer.com'), (5, 'http://wwww.kernel.org')] And I have a map function with reverses the keys and values if the word is present: def myMap(stuff): return [(url, key) for key, url in stuff if "beer" in url] This does as expected: >>> myMap(stuff) [('http://www.beer.com', 1), ('http://www.ban-beer.com', 2), ('http://www.beer.com', 4)] What I want to do is now "group" these urls so that repeated urls have as their "partner" a lsit of indexes. To take a test example of the method I have in mind: def testGrouper(self): """Group occurences of a record together""" test_list = [('fred', 1), ('jim', 2), ('bill', 3), ('jim', 4)] grouped_list = [('fred', 1), ('jim', [2, 4]), ('bill' ,3)] self.assertEqual(myGroup(test_list), grouped_list) I have seen some code which purports to do this, but I thought it was ugly, and it seemed to return a generator object rather than a list. I refactored it for clarity thus: def myGroup(stuff): sorted_list = sorted(stuff) if not sorted_list: return reduce_key, reduce_list = sorted_list[0][0], sorted_list[0][1] for key, url in sorted_list[1:]: if key == reduce_key: reduce_list.append(value) else: yield (remote_key, remote_list) remote_key, remote_list = k, [v] yield(remote_key, remote_list) Does this make sense? I would like a clearer, more attractive way of making the test pass. If this can be done in functional style, even better. Your help is always appreciated! S. _______________________________________________ Tutor maillist - [email protected] http://mail.python.org/mailman/listinfo/tutor
