Re: [Analytics] purging old data from eventlogging db

Nuria Ruiz Wed, 21 May 2014 02:27:52 -0700

>Not to hijack the thread, but: to do this in the schema itself confuses the 
>structure of the data
>with the mechanics of its use. I think having a couple of helpers in 
>JavaScript and PHP
> for simple random sampling is sufficient.
Much agree with ori here. We would be bloating schema with properties
that have nothing to do with data definition.


>Note that – per our data retention guidelines [1] – not all EL data is 
>expected to be automatically purged within 90 days >(see the section on 
>“Non-personal information associated with a user account”)
I certainly think we should keep performance data (like navigation
timing) for longer than 90 days removing pageId and userId if needed.



On Wed, May 21, 2014 at 9:03 AM, Ori Livneh <o...@wikimedia.org> wrote:
>
>
>
> On Tue, May 20, 2014 at 10:36 PM, Dario Taraborelli
> <dtarabore...@wikimedia.org> wrote:
>>
>> On May 20, 2014, at 10:09 PM, Sean Pringle <sprin...@wikimedia.org> wrote:
>>
>> Hi!
>>
>> I'd like to hear from stakeholders about purging old data from the
>> eventlogging database. Yes, no, why [not], etc.
>>
>> I understand from Ori that there is a 90 day retention policy, and that
>> purging has been discussed previously but not addressed for various reasons.
>> Certainly there are many timestamps older than 90 days still in the db, and
>> apparently largely untouched by queries?
>>
>> Perhaps we're in a better position now to do this properly what with data
>> now in multiple places: log files, database, hadoop...
>>
>> Can we please purge stuff? :-)
>>
>> BR
>> Sean
>>
>>
>> Hi Sean,
>>
>> I sent a similar proposal to the internal list for preliminary feedback
>> (see item 2 below)
>>
>> All, I wanted to hear your thoughts informally (before posting to the
>> lists) on two ideas that have been floating around recently:
>>
>> 1) add support for optional sampling in EventLogging via JSON schemas
>> (given the sheer number of teams who have asked for it). See
>> https://bugzilla.wikimedia.org/show_bug.cgi?id=65500
>
> Not to hijack the thread, but: to do this in the schema itself confuses the
> structure of the data with the mechanics of its use. I think having a couple
> of helpers in JavaScript and PHP for simple random sampling is sufficient.
>>
>>
>> 2) introduce 90-day pruning by default for all logs, (adding a dedicated
>> schema element to override the default).
>
> Same problem. To illustrate: suppose we're two months into a data collection
> job. The researcher carelessly forgot to modify the pruning policy, so it's
> set to the default 90 days, whereas the researcher needs it for 180. At this
> point our options are:
>
> 1) Decline to help, even though there's a full month before the pruning
> kicks in.
> 2) Somehow alter the schema revision without creating a new revision.
> EventLogging assumes that schema revisions are immutable and it exploits
> this property to provide guarantees about data validity and consistency, so
> this is a nonstarter.
> 3) Create a new schema revision that declares a 180 day expiration and then
> populate its table with a copy of each event logged under the previous
> schema.
>
> The motivation behind your proposal is (I think) a desire to have a unified
> configuration interface for data collection jobs. This makes total sense and
> it's worth pursuing. I just don't think we should stuff everything into the
> schema. The schema is just that: a schema. It's a data model.
>
>
>>
>> This would push to the customers the responsibility of ensuring the right
>> data is collected and retained.
>>
>> I understand 2) has already been partly implemented for the raw JSON logs
>> (not yet for EL data stored in SQL). Obviously, we would need to audit
>> existing logs to make sure that we don’t discard data that needs to be
>> retained in a sanitized or aggregate form past 90 days.
>>
>>
>> Note that – per our data retention guidelines [1] – not all EL data is
>> expected to be automatically purged within 90 days (see the section on
>> “Non-personal information associated with a user account”): many of these
>> logs have a status similar to MediaWiki data that is retained in the DB but
>> not fully exposed to labs.
>
>
>>
>> For this reason, I am proposing that we enable 90-day pruning by default
>> for new schemas, with the ability to override the default.
>
>
> Sounds good to me. I figure that the overrides would be specified as
> configuration values for the script that does the actual pruning. We could
> Puppetize that and document the process for adding exemptions.
>
>>
>> Existing schemas would need to be audited on a case by case basis.
>
>
> By whom? :) Surely not Sean! It'd be great to get this process going.
>
>
> _______________________________________________
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>

_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] purging old data from eventlogging db

Reply via email to