I see. My main point was that -regardless of collection method- we might not need every single data point to calculate uniques.
On Wed, Jan 7, 2015 at 10:38 AM, Toby Negrin <tneg...@wikimedia.org> wrote: > Yes -- we disabled it because there wasn't a use case. We have one now :) > > On Wed, Jan 7, 2015 at 10:32 AM, Nuria Ruiz <nu...@wikimedia.org> wrote: > >> > I believe there is already an EL-Kafka pipeline and this would make it >> easy to integrate page views with our regular processing. >> >> Note that the pipeline was disabled 6 months ago and thus my comment "in >> the near term" >> >> https://github.com/wikimedia/operations-puppet/commit/f85b1dbcd61bbb58684ff93704c1804e808a5d6e >> >> On Wed, Jan 7, 2015 at 9:39 AM, Toby Negrin <tneg...@wikimedia.org> >> wrote: >> >>> I'd also like us to consider routing this dataset to hadoop. I believe >>> there is already an EL-Kafka pipeline and this would make it easy to >>> integrate page views with our regular processing. >>> >>> Gilles -- are mobile page views included in your stream? >>> >>> -Toby >>> >>> On Wed, Jan 7, 2015 at 9:27 AM, Nuria Ruiz <nu...@wikimedia.org> wrote: >>> >>>> >Great, then I guess it's a matter of only making the data go to files >>>> and not to DB for the particular schema we'll create. Does >that sound like >>>> something feasible? How much work would be required to set it up? >>>> I do not think this is feasible on the near term w/o changes in our >>>> end. I also am not sure it is really needed. You are concern about sending >>>> stuff to db due to "volume", correct? I do not understand why logging every >>>> single data point would be needed. Maybe you can explain that with a bit >>>> more detail for us to grasp the use case? >>>> >>>> If it is a matter of identifying distinct requests that can be done >>>> having sampled your dataset if it is large enough, we can help with that >>>> and leila just put together some docs on this regard, while this is for >>>> hive queries principles can apply elsewhere: >>>> https://wikitech.wikimedia.org/wiki/Analytics/Cluster/Hive/Counting_uniques >>>> >>>> >>>> >>>> On Wed, Jan 7, 2015 at 6:42 AM, Gilles Dubuc <gil...@wikimedia.org> >>>> wrote: >>>> >>>>> Right -- couldn't we just tag the URL? >>>>>> >>>>> >>>>> The event of the user actually viewing the image is completely >>>>> disconnected from the URL hit in Media Viewer, which is why we need EL and >>>>> can't rely on existing server logs. >>>>> >>>>> >>>>>> Eventlogging data currently does go to files, as well as to the DB. >>>>>> >>>>> >>>>> Great, then I guess it's a matter of only making the data go to files >>>>> and not to DB for the particular schema we'll create. Does that sound like >>>>> something feasible? How much work would be required to set it up? >>>>> >>>>> On Tue, Jan 6, 2015 at 7:45 PM, Andrew Otto <ao...@wikimedia.org> >>>>> wrote: >>>>> >>>>>> Eventlogging data currently does go to files, as well as to the DB. >>>>>> Check it out on stat1003 at /srv/eventlogging/archive. >>>>>> >>>>>> If you need something with higher throughput then eventlogging itself >>>>>> supports…then let’s talk :D >>>>>> >>>>>> -Ao >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Jan 6, 2015, at 13:28, Erik Zachte <ezac...@wikimedia.org> wrote: >>>>>> >>>>>> You mean attach an X-analytics parameter, for extra images beyond the >>>>>> one the user initially requested. >>>>>> >>>>>> But then we would undercount, basically missing all image views from >>>>>> clicking right arrow in image viewer. >>>>>> I'm not sure how much we would miss then. >>>>>> iirc Gilles said this browsing feature was used quite a long, but I'm >>>>>> not sure. >>>>>> >>>>>> >>>>>> *From:* analytics-boun...@lists.wikimedia.org [ >>>>>> mailto:analytics-boun...@lists.wikimedia.org >>>>>> <analytics-boun...@lists.wikimedia.org>] *On Behalf Of *Toby Negrin >>>>>> *Sent:* Tuesday, January 06, 2015 19:16 >>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody who >>>>>> has an interest in Wikipedia and analytics. >>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file >>>>>> instead of the DB >>>>>> >>>>>> >>>>>> >>>>>> Right -- couldn't we just tag the URL? >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 6, 2015 at 10:10 AM, Erik Zachte <ezac...@wikimedia.org> >>>>>> wrote: >>>>>> >>>>>> Just to clarify, this is about prefetched images which have not been >>>>>> shown to the public. >>>>>> >>>>>> They were sent to the browser ahead of a possible request to speed >>>>>> things up but in many cases never actually requested. >>>>>> >>>>>> >>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts#Prefetched_images >>>>>> >>>>>> - Erik >>>>>> >>>>>> >>>>>> >>>>>> *From:* analytics-boun...@lists.wikimedia.org [mailto: >>>>>> analytics-boun...@lists.wikimedia.org] *On Behalf Of *Toby Negrin >>>>>> *Sent:* Tuesday, January 06, 2015 18:49 >>>>>> *To:* A mailing list for the Analytics Team at WMF and everybody who >>>>>> has an interest in Wikipedia and analytics. >>>>>> *Subject:* Re: [Analytics] Making EventLogging output to a log file >>>>>> instead of the DB >>>>>> >>>>>> >>>>>> >>>>>> Hi Gilles -- why won't the page view logs work by themselves for this >>>>>> purpose? EL can be configured to write into Hadoop which is probably the >>>>>> best way to get the throughput you need but it seems overcomplicated. >>>>>> >>>>>> >>>>>> >>>>>> -Toby >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Jan 6, 2015 at 9:41 AM, Gilles Dubuc <gil...@wikimedia.org> >>>>>> wrote: >>>>>> >>>>>> This depends on [1] so we're not going to need that immediately, but >>>>>> in order to help Erik Zachte with his RfC [2] to track unique media views >>>>>> in Media Viewer, I'm going to need to use something almost exactly like >>>>>> EventLogging. The main difference being that it should skip writing to >>>>>> the >>>>>> database and write to a log file instead. >>>>>> >>>>>> That's because we'll be recording around 20-25M image views per day, >>>>>> which would needlessly overload EventLogging for little purpose since the >>>>>> data will be used for offline stats generation and doesn't need to be >>>>>> made >>>>>> available in a relational database. Of course if storage space and >>>>>> EventLogging capacity were no object, we could just use EL and keep the >>>>>> ever-growing table forever, but I have the impression that we want to be >>>>>> reasonable here and only write to a log, since that's what Erik needs. >>>>>> >>>>>> So here's the question: for a specific schema, can EventLogging work >>>>>> the way it does but only record hits to a log file (maybe it already does >>>>>> that before hitting the DB?) and not write to the DB? If not, how >>>>>> difficult >>>>>> would it be to make EL capable of doing that? >>>>>> >>>>>> >>>>>> [1] https://phabricator.wikimedia.org/T44815 >>>>>> [2] >>>>>> https://www.mediawiki.org/wiki/Requests_for_comment/Media_file_request_counts >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>>> >>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>>> >>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>>> >>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> Analytics@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > Analytics@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics