As part of work we are doing with the Sync ping, it was brought to our
attention that the "main ping" sees a number of duplicate pings - eg,
bug 1348008, but 1342111, etc.
We started to wonder if the Sync ping might have the same issue, which
would become very relevant to some of the analysis we do, which
(attempts to) track certain events across multiple Firefox devices (and
therefore obviously across multiple pings), where duplicate (or missing)
pings would skew our results.
Some of the bugs above have .ipynb scripts which performs some analysis
of dupes, but I admit much of their content sailed high over my head, so
I didn't feel I could reproduce them for the sync ping. So instead, I
knocked up a naive script to try and find duplicate meta/documentID or
id fields in all the sync pings - these scripts are below, and indicate
there isn't a single sync ping duplicated - but that almost sounds too
good to be true :)
So my questions are:
* Is my analysis below valid? ie, is it really telling me there are zero
sync pings duplicated?
* If not, what is the likelihood of such duplicates, how can we get a
sense of their volume, and what other actions should we take?
Thanks,
Mark.
Analysis: in a notebook I ran:
"""
from moztelemetry import get_pings
s = get_pings(sc, doc_type='sync', fraction=1.0) \
.map(lambda p: (p["meta"]["documentId"], 0)) \
.reduceByKey(lambda a,b: a+b)
print s.count()
print s.filter(lambda a: a[1] > 0).count()
"""
which printed:
2296974760
0
I re-ran it, replacing |p["meta"]["documentId"]| with |p["id"]|, which
printed:
2299252270
0
(the larger number of total pings is probably accounted for by the time
between running the 2 scripts)
_______________________________________________
Sync-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/sync-dev