@Erik:
Reading this thread makes me think that it might be interesting to have a
chat around using hadoop for indexing (
https://github.com/elastic/elasticsearch-hadoop).
I have no idea how you currently index, but I'd love to learn :)
Please let me know if you think it could be useful !
Joseph

On Wed, Sep 16, 2015 at 5:15 AM, Erik Bernhardson <
ebernhard...@wikimedia.org> wrote:

> makes sense. We will indeed be doing a batch process once a week to build
> the completion indices which ideally will run through all the wiki's in a
> day. We are going to do some analysis into how up to date our page view
> data really needs to be for scoring purposes though, if we can get good
> scoring results while only updating page view info when a page is edited we
> might be able to spread out the load across time that way and just hit the
> page view api once for each edit. Otherwise i'm sure we can do as suggested
> earlier and pull the data from hive directly and stuff into a temporary
> structure we can query while building the completion indices.
>
> On Tue, Sep 15, 2015 at 7:16 PM, Dan Andreescu <dandree...@wikimedia.org>
> wrote:
>
>> On Tue, Sep 15, 2015 at 6:56 PM, Marko Obrovac <mobro...@wikimedia.org>
>> wrote:
>>
>>> On 15 September 2015 at 19:37, Dan Andreescu <dandree...@wikimedia.org>
>>> wrote:
>>>
>>>> I worry a little bit about the performance without having a batch api,
>>>>> but we can certainly try it out and see what happens. Basically we will be
>>>>> requesting the page view information for every NS_MAIN article in every
>>>>> wiki once a week.  A quick sum against our search  cluster suggests this 
>>>>> is
>>>>> ~96 million api requests.
>>>>>
>>>>
>>> 96m equals approx 160 req/s which is more than sustainable for RESTBase.
>>>
>>
>> True, if we distributed the load over the whole week, but I think Erik
>> needs the results to be available weekly, as in, probably within a day or
>> so of issuing the request.  Of course, if we were to serve this kind of
>> request from the API, we would make a better batch-query endpoint for his
>> use case.  But I think it might be hard to make that useful generally.  I
>> think for now, let's just collect these one-off pageview querying use cases
>> and slowly build them into the API when we can generalize two or more of
>> them into one endpoint.
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to