On 14 June 2011 13:05, Robert Collins <[email protected]> wrote: > On Wed, Jun 15, 2011 at 6:34 AM, Martin Pool <[email protected]> wrote: >> One idea that came up talking to Robert about the Services design was >> exposing an external map/reduce interface, mentioned under >> <https://dev.launchpad.net/ArchitectureGuide/ServicesRoadmap#A >> map/reduce facility>. It is pretty blue sky at the moment but I think >> it is such an interesting idea it would be worth writing down. > > Are you interested in implementing a map reduce API ? There are quite > a few things we'd make a lot better by doing one IMO.
I am interested, though I don't know if I'll have time to do it. Having seen how much good stuff people get out of the API, but also how much it is like sucking a camel through the eye of a needle, I think there is a large potential win here. > My mental sketch for this service would have it call forward: in > python terms the signature is something like: > Launchpad.mapreduce(objectsearch, mapfn, reducefn, resultendpoint) So the objectsearch parameter is something that can be compiled into a storm sql expression, so that we don't have to map over everything in the database? > resultendpoint would get a signed POST from LP with the results, when > they are ready. However see below, we can be much more low key > initiially and just use email. I think it is quite desirable not to require the client have an smtp or http listener. For instance, I do not have one on my natted laptop, but I might like to experiment with the API. But perhaps it's the pragmatic thing to start with. Another alternative would be to stick the results in the librarian and let them poll or long-get. (Hm, can we give a predictable URL in advance?) > > Having folk replicate LP in an adhoc fashion isn't just inelegant: any > new bug analysis task requires someone new to pull all 800K bugs + > 830K bugtasks + 9M messages out of the DB, store it locally, and then > process it. It makes running analysis a complex and time consuming > task. Its great folk /can/ do it, but its also hard to support - our > top timeout today is due to folk analysing hardware DB records - a > 2.7M rows into the collection it starts timing out. Right, I agree: this is awful for all parties, and yet many people do go to the trouble of doing it, which suggests a better way would be worthwhile. >> So things like the kanban that want to say "give me everything >> assigned to mbp or jam or ... and either inprogress or (fixreleased >> and fixed in the last 30 days" could make a (say) javascript >> expression of such and get back just the actually relevant bugs, >> rather than fetching a lot more stuff and filtering client side. > > There are python sandboxes around we could use to, though javascript > is perhaps easier to be confident in.Z (Someone here at Velocity, I think BrowserMob, takes the fairly creative approach of spinning up an ec2 instance holding the raw results in a mysql database. The user can do anything they want and can only hurt themselves. It's not exactly a good fit for us but it is quite clever.) > I wouldn't try to run map reduce jobs in a webserver context > initially; its certainly possible if we wanted to aim at it, but we'd > want oh 70%-80% use on the mapreduce cluster - we'd need an *awfully* > large amount of jobs coming through it to need a parallel cluster > large enough to process every bug in ubuntu in < 5 seconds. I think mapreduce as such would not make sense. Taking something that can generate sensible database queries with non-enormous results, and then doing some manipulation of the results, all capped by the existing request timeout, could make sense. The queries that currently make up say a +bugs page all run in 2s (or whatever), and an API call could reasonably be allowed to do a similar amount of work. Perhaps it is better to steer straight for mapreduce if it's actually cheap. I have some fear of the pipelines in introducing new infrastructure and concepts to run it. > I think a very simple map reduce can be done as follows: > - use our existing api object representations > - allow forwarding the output of a map reduce back into map reduce (chaining) > - use the existing Job framework to dispatch and manage map reduce runs > - start with a concurrency limit of 2 > - pick whatever language is most easily sandboxed > - send results to the submitters preferred email address. > > These points are picked to minimise development time: if we find that > the result is awesome, we can look at doing more-complex but even > greater return approaches such as getting http://discoproject.org to > sandbox and using it instead; letting object searches return bug + > tasks and so forth. Well, that's pretty awesome to hear you think this could be simple. Do you imagine this mapreduce would talk directly to the db like other jobs? I can imagine that could work... So, actually, we could start by only supporting the objectsearch parameter, not any map or reduce function. I think many of these calls really can just be expressed in sql. Very interesting... Martin _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

