I see. My main point was that -regardless of collection method- we might
not need every single data point to calculate uniques.

On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin <tneg...@wikimedia.org> wrote:

> Yes -- we disabled it because there wasn't a use case. We have one now :)
>
> On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> > I believe there is already an EL-Kafka pipeline and this would make it
>> easy to integrate page views with our regular processing.
>>
>> Note that the pipeline was disabled 6 months ago and thus my comment "in
>> the near term"
>>
>> https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff93704c1804e808a5d6e
>>
>> On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin <tneg...@wikimedia.org>
>> wrote:
>>
>>> I'd also like us to consider routing this dataset to hadoop. I believe
>>> there is already an EL-Kafka pipeline and this would make it easy to
>>> integrate page views with our regular processing.
>>>
>>> Gilles -- are mobile page views included in your stream?
>>>
>>> -Toby
>>>
>>> On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>>>
>>>> >Great, then I guess it's a matter of only making the data go to files
>>>> and not to DB for the particular schema we'll create. Does >that sound like
>>>> something feasible? How much work would be required to set it up?
>>>> I do not think this is feasible on the near term w/o changes in our
>>>> end. I also am not sure it is really needed. You are concern about sending
>>>> stuff to db due to "volume", correct? I do not understand why logging every
>>>> single data point would be needed. Maybe you can explain that with a bit
>>>> more detail for us to grasp the use case?
>>>>
>>>> If it is a matter of identifying distinct requests that can be done
>>>> having sampled your dataset if it is large enough, we can help with that
>>>> and leila just put together some docs on this regard, while this is for
>>>> hive queries principles can apply elsewhere:
>>>> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques
>>>>
>>>>
>>>>
>>>> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <gil...@wikimedia.org>
>>>> wrote:
>>>>
>>>>> Right -- couldn't we just tag the URL?
>>>>>>
>>>>>
>>>>> The event of the user actually viewing the image is completely
>>>>> disconnected from the URL hit in Media Viewer, which is why we need EL and
>>>>> can't rely on existing server logs.
>>>>>
>>>>>
>>>>>> Eventlogging data currently does go to files, as well as to the DB.
>>>>>>
>>>>>
>>>>> Great, then I guess it's a matter of only making the data go to files
>>>>> and not to DB for the particular schema we'll create. Does that sound like
>>>>> something feasible? How much work would be required to set it up?
>>>>>
>>>>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <ao...@wikimedia.org>
>>>>> wrote:
>>>>>
>>>>>> Eventlogging data currently does go to files, as well as to the DB.
>>>>>> Check it out on stat1003 at /srv/eventlogging/archive.
>>>>>>
>>>>>> If you need something with higher throughput then eventlogging itself
>>>>>> supports…then let’s talk :D
>>>>>>
>>>>>> -Ao
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Jan 6, 2015, at 13:28, Erik Zachte <ezac...@wikimedia.org> wrote:
>>>>>>
>>>>>> You mean attach an X-analytics parameter, for extra images beyond the
>>>>>> one the user initially requested.
>>>>>>
>>>>>> But then we would undercount, basically missing all image views from
>>>>>> clicking right arrow in image viewer.
>>>>>> I'm not sure how much we would miss then.
>>>>>> iirc Gilles said this browsing feature was used quite a long, but I'm
>>>>>> not sure.
>>>>>>
>>>>>>
>>>>>> *From:* analytics-boun...@lists.wikimedia.org [
>>>>>> mailto:analytics-boun...@lists.wikimedia.org
>>>>>> <analytics-boun...@lists.wikimedia.org>] *On Behalf Of *Toby Negrin
>>>>>> *Sent:* Tuesday, January 06, 2015 19:16
>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody who
>>>>>> has an interest in Wikipedia and analytics.
>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file
>>>>>> instead of the DB
>>>>>>
>>>>>>
>>>>>>
>>>>>> Right -- couldn't we just tag the URL?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <ezac...@wikimedia.org>
>>>>>> wrote:
>>>>>>
>>>>>> Just to clarify, this is about prefetched images which have not been
>>>>>> shown to the public.
>>>>>>
>>>>>> They were sent to the browser ahead of a possible request to speed
>>>>>> things up but in many cases never actually requested.
>>>>>>
>>>>>>
>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images
>>>>>>
>>>>>> - Erik
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* analytics-boun...@lists.wikimedia.org [mailto:
>>>>>> analytics-boun...@lists.wikimedia.org] *On Behalf Of *Toby Negrin
>>>>>> *Sent:* Tuesday, January 06, 2015 18:49
>>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody who
>>>>>> has an interest in Wikipedia and analytics.
>>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file
>>>>>> instead of the DB
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi Gilles -- why won't the page view logs work by themselves for this
>>>>>> purpose? EL can be configured to write into Hadoop which is probably the
>>>>>> best way to get the throughput you need but it seems overcomplicated.
>>>>>>
>>>>>>
>>>>>>
>>>>>> -Toby
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <gil...@wikimedia.org>
>>>>>> wrote:
>>>>>>
>>>>>> This depends on [1] so we're not going to need that immediately, but
>>>>>> in order to help Erik Zachte with his RfC [2] to track unique media views
>>>>>> in Media Viewer, I'm going to need to use something almost exactly like
>>>>>> EventLogging. The main difference being that it should skip writing to 
>>>>>> the
>>>>>> database and write to a log file instead.
>>>>>>
>>>>>> That's because we'll be recording around 20-25M image views per day,
>>>>>> which would needlessly overload EventLogging for little purpose since the
>>>>>> data will be used for offline stats generation and doesn't need to be 
>>>>>> made
>>>>>> available in a relational database. Of course if storage space and
>>>>>> EventLogging capacity were no object, we could just use EL and keep the
>>>>>> ever-growing table forever, but I have the impression that we want to be
>>>>>> reasonable here and only write to a log, since that's what Erik needs.
>>>>>>
>>>>>> So here's the question: for a specific schema, can EventLogging work
>>>>>> the way it does but only record hits to a log file (maybe it already does
>>>>>> that before hitting the DB?) and not write to the DB? If not, how 
>>>>>> difficult
>>>>>> would it be to make EL capable of doing that?
>>>>>>
>>>>>>
>>>>>> [1] https://phabricator.wikimedia.org/T44815
>>>>>> [2]
>>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Analytics mailing list
>>>>>> Analytics@lists.wikimedia.org
>>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>>
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Analytics mailing list
>>>>> Analytics@lists.wikimedia.org
>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to