I am kicking off this thread after a good conversation with Nuria and Kaldari 
on pain points and opportunities we have around data QA for EventLogging.

Kaldari, Leila and I have gone through several rounds of data QA before and 
after the deployment of new features on Mobile and we haven’t found yet a good 
solution to catch data quality issues early enough in the deployment cycle. 
Data quality issues with EventLogging typically fall under one of these 5 
scenarios:

1) events are logged and schema-compliant but don’t capture data correctly (for 
example: a wrong value is logged; event counts that should match don’t)
2) events are logged but are not schema-compliant (e.g.: a required field is 
missing)
3) events are missing due to issues with the instrumentation (e.g.: a UI 
element is not instrumented)
4) events are missing due to client issues (a specific UI element is not 
correctly rendered on a given browser/platform and as a result the event is not 
fired)
5) events are missing due to EventLogging outages

In the early days, Ori and I floated the idea of unit tests for instrumentation 
to capture constraint violations that are not easily detected via manual 
testing or the existing client-side validation, but this never happened. When 
it comes to feature deployments, beta labs is a great starting point for 
running manual data QA in an environment that is as close as possible to prod. 
However, there are types of data quality issues that we only discover when 
collecting data at scale and in the wild (on browsers/platforms that we don’t 
necessarily test for internally).

Having a full-fledged set of unit tests for data would be terrific, but in the 
short term I’d like to find a better way to at least identify events that fail 
validation as early as possible.

- the SQL log database has real-time data but only for event that pass 
client-side validation
- the JSON logfiles on stat1003 include invalid events, but the data is only 
rsync’ed from vanadium once a day

is there a way to inspect invalid events in near real time without having 
access to vanadium? For example, could we create either a dedicated database to 
write invalid events only or a logfile for validation errors rsync’ed to 
stat1003 more frequently than once a day?

Thoughts?

Dario
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to