On Tue, Apr 9, 2013 at 9:58 AM, Justin Cappos <jcap...@poly.edu> wrote: > FYI: For anyone who wants the executive summary, we think the TUF metadata > will be under 1MB and even with very broad / rapid adoption of TUF in the > next year or two will stay <3MB or so.
Is that after compression? Or did Trishank miscount the number of digits for the initial email? Cheers, Nick. > > Note that this cost is only paid upon the initial run of the client tool. > Everything after that just downloads diffs (or at least will once we fix an > open ticket). > > Thanks, > Justin > > > > On Mon, Apr 8, 2013 at 2:41 PM, Trishank Karthik Kuppusamy > <t...@students.poly.edu> wrote: >> >> Hello everyone, >> >> I have been testing and refining the pypi.updateframework.com automation >> over the past week, and looking at how much TUF metadata is generated for >> PyPI. >> >> In this email, I am going to focus only on the PyPI data under /simple; >> let us call that "simple data". >> >> Now, if we assume that every developer will have her own key to sign the >> simple data for her package, then this is what the TUF metadata could look >> like: >> >> metadata/targets.txt >> ==================== >> Delegation from the targets to the targets/simple role, with the former >> role being responsible for no target data because it has none of its own. >> >> metadata/targets/simple.txt >> =========================== >> Delegation from targets/simple to the targets/simple/packageI role, with >> the former role being responsible for one target datum: simple/index.html. >> >> metadata/targets/simple/packageI.txt >> ==================================== >> The targets/simple/packageI role is responsible only for the simple data >> at simple/packageI/index.html. >> >> In this upper bound case, where every developer is responsible for signing >> her own package, one can estimate the metadata size to be like so: >> >> - metadata/targets/targets.txt is, at most, about a few KB, and can be >> safely ignored. >> - metadata/targets/simple/packageI.txt is about 1KB. >> - metadata/targets/simple.txt is about the sum of all >> metadata/targets/simple/packageI.txt files. (This is a very rough estimate!) >> >> Therefore, if we have 30,000 developer packages on PyPI (roughly the >> current number of packages), then we would have about 29 MB of >> metadata/targets/simple/packageI.txt, and another 29 MB of >> metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB of >> total data (roughly what I saw from my last mirror), then the simple >> metadata is about 0.13% of total data size. >> >> This may seem like a lot of metadata, but let us remember a few important >> things: >> >> - So far, the metadata is simply uncompressed JSON. We are considering >> metadata compression or difference schemes. >> - This assumes the upper bound case, where every package developer is >> responsible for her own package, so that means that we have talk about a lot >> of keys (random data). >> - This is a one-time initial download cost. An update to PyPI is unlikely >> to change all the simple data; therefore, updates to the simple metadata >> will be cheap, because a TUF client would only download updated metadata. We >> could amortize the initial simple metadata download cost by distributing it >> with PyPI installers (e.g. pip). >> >> Could we do better? Yes! >> >> As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing >> for all of the developer packages itself. This means that we could reuse a >> key for multiple developer packages instead of dedicating a key per package. >> The tradeoff here is that if one such "shared key" is compromised, then >> multiple packages (but not all of them) could be compromised. >> >> In this case, where we use a shared key to sign up to, say, 1,000 >> developer packages, then we would have the following simple metadata size. >> First, let us define some terms: >> >> NP = # of developer packages >> NPK = # of developer packages signed by a key >> NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK) >> K = average key metadata size >> D = average delegated role metadata size given one target path >> P = average target path length >> T = average simple target (index.html) metadata size >> >> metadata/targets/simple.txt >> =========================== >> Most of the metadata here deals with all of the keys, and the roles, used >> to sign simple data. Therefore, the size of the keys and roles metadata will >> dominate this file. >> >> key metadata size = NR*K >> role metadata size = NR*(D+NPK*P) >> >> Takeaway: the lower the NPK (the number of developer packages signed by a >> key), then the higher the NR, and the larger the metadata. We would save >> metadata by setting NPK to, say, 1,000, because then one key could describe >> 1,000 packages. >> >> metadata/targets/simple/roleI.txt >> ==================================== >> When NPK=1, then this file would be equivalent to >> metadata/targets/simple/packageI.txt. >> >> It is a small metadata file if we assume that it only talks about the >> simple data (index.html) for one package. Most of the metadata talks about >> key signatures, and target metadata. If we increase NPK, then clearly the >> target metadata would increase in size: >> >> target metadata size = NPK*T < NPK*1KB >> >> Takeaway: the target metadata would increase in size, but it certainly >> will not increase as much as it would have if we had signed each developer >> package with a separate key. >> >> Finally, the question is how the savings in metadata/targets/simple.txt >> would compare to the "growth" of the metadata/targets/simple/roleI.txt >> files. Ultimately, the higher the NPK (and thus the lower the NR), then the >> less would we be talking about keys (random data). Everything else would >> remain the same, because there would still be the same number of targets, >> and thus the same amount of target metadata. So, we would have net savings. >> >> I hope this clears some questions about metadata size. If there was >> something confusing because I did not explain it well enough or I got >> something wrong, please be sure to let me know. My machine is nearly done >> generating all the simple metadata, so we can make better estimates then. >> >> -Trishank >> > > > _______________________________________________ > Distutils-SIG maillist - Distutils-SIG@python.org > http://mail.python.org/mailman/listinfo/distutils-sig > -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig