As part of work we are doing with the Sync ping, it was brought to our attention that the "main ping" sees a number of duplicate pings - eg, bug 1348008, but 1342111, etc.

We started to wonder if the Sync ping might have the same issue, which would become very relevant to some of the analysis we do, which (attempts to) track certain events across multiple Firefox devices (and therefore obviously across multiple pings), where duplicate (or missing) pings would skew our results.

Some of the bugs above have .ipynb scripts which performs some analysis of dupes, but I admit much of their content sailed high over my head, so I didn't feel I could reproduce them for the sync ping. So instead, I knocked up a naive script to try and find duplicate meta/documentID or id fields in all the sync pings - these scripts are below, and indicate there isn't a single sync ping duplicated - but that almost sounds too good to be true :)

So my questions are:

* Is my analysis below valid? ie, is it really telling me there are zero sync pings duplicated?

* If not, what is the likelihood of such duplicates, how can we get a sense of their volume, and what other actions should we take?

Thanks,

Mark.

Analysis: in a notebook I ran:

"""
from moztelemetry import get_pings
s = get_pings(sc, doc_type='sync', fraction=1.0) \
    .map(lambda p: (p["meta"]["documentId"], 0)) \
    .reduceByKey(lambda a,b: a+b)

print s.count()
print s.filter(lambda a: a[1] > 0).count()
"""

which printed:
2296974760
0

I re-ran it, replacing |p["meta"]["documentId"]| with |p["id"]|, which printed:
2299252270
0

(the larger number of total pings is probably accounted for by the time between running the 2 scripts)
_______________________________________________
Sync-dev mailing list
[email protected]
https://mail.mozilla.org/listinfo/sync-dev

Reply via email to