@Erik: Reading this thread makes me think that it might be interesting to have a chat around using hadoop for indexing ( https://github.com/elastic/elasticsearch-hadoop). I have no idea how you currently index, but I'd love to learn :) Please let me know if you think it could be useful ! Joseph
On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson < ebernhard...@wikimedia.org> wrote: > makes sense. We will indeed be doing a batch process once a week to build > the completion indices which ideally will run through all the wiki's in a > day. We are going to do some analysis into how up to date our page view > data really needs to be for scoring purposes though, if we can get good > scoring results while only updating page view info when a page is edited we > might be able to spread out the load across time that way and just hit the > page view api once for each edit. Otherwise i'm sure we can do as suggested > earlier and pull the data from hive directly and stuff into a temporary > structure we can query while building the completion indices. > > On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <dandree...@wikimedia.org> > wrote: > >> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <mobro...@wikimedia.org> >> wrote: >> >>> On 15 September 2015 at 19:37, Dan Andreescu <dandree...@wikimedia.org> >>> wrote: >>> >>>> I worry a little bit about the performance without having a batch api, >>>>> but we can certainly try it out and see what happens. Basically we will be >>>>> requesting the page view information for every NS_MAIN article in every >>>>> wiki once a week. A quick sum against our search cluster suggests this >>>>> is >>>>> ~96 million api requests. >>>>> >>>> >>> 96m equals approx 160 req/s which is more than sustainable for RESTBase. >>> >> >> True, if we distributed the load over the whole week, but I think Erik >> needs the results to be available weekly, as in, probably within a day or >> so of issuing the request. Of course, if we were to serve this kind of >> request from the API, we would make a better batch-query endpoint for his >> use case. But I think it might be hard to make that useful generally. I >> think for now, let's just collect these one-off pageview querying use cases >> and slowly build them into the API when we can generalize two or more of >> them into one endpoint. >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > > -- *Joseph Allemandou* Data Engineer @ Wikimedia Foundation IRC: joal
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics