2017-06-06 2:10 GMT+02:00 Mark Hammond <[email protected]>: > As part of work we are doing with the Sync ping, it was brought to our > attention that the "main ping" sees a number of duplicate pings - eg, bug > 1348008, but 1342111, etc. > > We started to wonder if the Sync ping might have the same issue, which > would become very relevant to some of the analysis we do, which (attempts > to) track certain events across multiple Firefox devices (and therefore > obviously across multiple pings), where duplicate (or missing) pings would > skew our results. >
Yes, every ping could have duplicates: could be due to network connections, etc. However, the amount of duplicate pings due to this conditions is generally low (e.g. ~1% - for the main pings + pingsender, on Nightly <https://github.com/mozilla/mozilla-reports/blob/master/projects/main_ping_delays_pingsender.kp/knowledge.md#deduplication> ). > > Some of the bugs above have .ipynb scripts which performs some analysis of > dupes, but I admit much of their content sailed high over my head, so I > didn't feel I could reproduce them for the sync ping. So instead, I knocked > up a naive script to try and find duplicate meta/documentID or id fields in > all the sync pings - these scripts are below, and indicate there isn't a > single sync ping duplicated - but that almost sounds too good to be true :) > > So my questions are: > > * Is my analysis below valid? ie, is it really telling me there are zero > sync pings duplicated? > > * If not, what is the likelihood of such duplicates, how can we get a > sense of their volume, and what other actions should we take? > > Thanks, > > Mark. > > Analysis: in a notebook I ran: > > """ > from moztelemetry import get_pings > s = get_pings(sc, doc_type='sync', fraction=1.0) \ > .map(lambda p: (p["meta"]["documentId"], 0)) \ > This line will give you a tuple like (docId, 0) for each ping. > .reduceByKey(lambda a,b: a+b) > And this line should sum up all the values for the tuples with the same docId. Please note that the second element is always being set to 0. So, for example: (docId_1, 0) (docId_1, 0) The reduction step should be summing the second part, and you'd get a sum of 0 + 0 which is 0. Mapping to 1 instead of 0 and filtering for > 1 should give you what you're looking for. > > print s.count() > print s.filter(lambda a: a[1] > 0).count() > """ > > which printed: > 2296974760 > 0 > > I re-ran it, replacing |p["meta"]["documentId"]| with |p["id"]|, which > printed: > 2299252270 > 0 > > (the larger number of total pings is probably accounted for by the time > between running the 2 scripts) > _______________________________________________ > Fx-data-dev mailing list > [email protected] > https://mail.mozilla.org/listinfo/fx-data-dev >
_______________________________________________ Sync-dev mailing list [email protected] https://mail.mozilla.org/listinfo/sync-dev

