I am kicking off this thread after a good conversation with Nuria and Kaldari
on pain points and opportunities we have around data QA for EventLogging.
Kaldari, Leila and I have gone through several rounds of data QA before and
after the deployment of new features on Mobile and we haven’t found yet a good
solution to catch data quality issues early enough in the deployment cycle.
Data quality issues with EventLogging typically fall under one of these 5
scenarios:
1) events are logged and schema-compliant but don’t capture data correctly (for
example: a wrong value is logged; event counts that should match don’t)
2) events are logged but are not schema-compliant (e.g.: a required field is
missing)
3) events are missing due to issues with the instrumentation (e.g.: a UI
element is not instrumented)
4) events are missing due to client issues (a specific UI element is not
correctly rendered on a given browser/platform and as a result the event is not
fired)
5) events are missing due to EventLogging outages
In the early days, Ori and I floated the idea of unit tests for instrumentation
to capture constraint violations that are not easily detected via manual
testing or the existing client-side validation, but this never happened. When
it comes to feature deployments, beta labs is a great starting point for
running manual data QA in an environment that is as close as possible to prod.
However, there are types of data quality issues that we only discover when
collecting data at scale and in the wild (on browsers/platforms that we don’t
necessarily test for internally).
Having a full-fledged set of unit tests for data would be terrific, but in the
short term I’d like to find a better way to at least identify events that fail
validation as early as possible.
- the SQL log database has real-time data but only for event that pass
client-side validation
- the JSON logfiles on stat1003 include invalid events, but the data is only
rsync’ed from vanadium once a day
is there a way to inspect invalid events in near real time without having
access to vanadium? For example, could we create either a dedicated database to
write invalid events only or a logfile for validation errors rsync’ed to
stat1003 more frequently than once a day?
Thoughts?
Dario
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics