On Fri, Jun 17, 2011 at 1:21 PM, Martin Pool <[email protected]> wrote:
>> My mental sketch for this service would have it call forward: in >> python terms the signature is something like: >> Launchpad.mapreduce(objectsearch, mapfn, reducefn, resultendpoint) > > So the objectsearch parameter is something that can be compiled into a > storm sql expression, so that we don't have to map over everything in > the database? Something like that. I think we'd want to create -one- search language and let that evolve as needed. >> resultendpoint would get a signed POST from LP with the results, when >> they are ready. However see below, we can be much more low key >> initiially and just use email. > > I think it is quite desirable not to require the client have an smtp > or http listener. For instance, I do not have one on my natted > laptop, but I might like to experiment with the API. But perhaps it's > the pragmatic thing to start with. Another alternative would be to > stick the results in the librarian and let them poll or long-get. (Hm, > can we give a predictable URL in advance?) Email is approximately zero development to make happen. We can obviously iterate towards any degree of polish. I think theres room to aim at different sorts of completion long term, but POST - passing a message forward - is pretty standard for this sort of thing. long poll etc can be built on top of that. >> I wouldn't try to run map reduce jobs in a webserver context >> initially; its certainly possible if we wanted to aim at it, but we'd >> want oh 70%-80% use on the mapreduce cluster - we'd need an *awfully* >> large amount of jobs coming through it to need a parallel cluster >> large enough to process every bug in ubuntu in < 5 seconds. > > I think mapreduce as such would not make sense. Taking something that > can generate sensible database queries with non-enormous results, and > then doing some manipulation of the results, all capped by the > existing request timeout, could make sense. The queries that > currently make up say a +bugs page all run in 2s (or whatever), and an > API call could reasonably be allowed to do a similar amount of work. > Perhaps it is better to steer straight for mapreduce if it's actually > cheap. I have some fear of the pipelines in introducing new > infrastructure and concepts to run it. Constraining this to run on at most 75 bugs or something would be (IMO) useless. I don't think the problems are close enough to consider implementing one solution, and I don't think a browser-scale system would scale to working on the million-row datasets we have. > Well, that's pretty awesome to hear you think this could be simple. > Do you imagine this mapreduce would talk directly to the db like other > jobs? I can imagine that could work... I think the driver pulling stuff out might; no other part of it would, and it would require managing its transactions to avoid long-transaction issues. -Rob _______________________________________________ Mailing list: https://launchpad.net/~launchpad-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~launchpad-dev More help : https://help.launchpad.net/ListHelp

