On 06/20/2018 10:13 AM, Lukáš Doktor wrote:
> Dne 19.6.2018 v 04:50 Cleber Rosa napsal(a):
>>
>>
>> On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
>>> Hello guys,
>>>
>>> As Cleber pointed out on the release meeting, we are struggling with our
>>> asset fetcher. It was designed with one goal, to cache arbitrary files by
>>> name and it fails when different asset use the same name as it simply
>>> assumes they are the same thing.
>>>
>>> Let's have a look at this example:
>>>
>>> fetch("https://www.example.org/foo.zip")
>>> fetch("https://www.example.org/bar.zip")
>>> fetch("http://www.example.org/foo.zip")
>>>
>>> Currently the third fetch simply uses the "foo.zip" downloaded from "https"
>>> even though it could be a completely different file (or downloaded from
>>> completely different url). This is good and bad. It's good when you're
>>> downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are
>>> downloading "vmlinuz" which is always called "vmlinuz" but comes from a
>>> different subdirectory, it might lead to big problems.
>>>
>>
>> Right, all your observations so far are correct.
>>
>>> From this I can see two mods of assets, anonymous and specific. Instead of
>>> trying to detect this based on combinations of hashes and methods, I'd
>>> suggest being explicit and either add it as extra argument, or even create
>>> new class `AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset`
>>> would be the current implementation and we still need to decide on
>>> `SpecificAsset` implementation. Let's discuss some approaches and use
>>> following assets in examples:
>>
>> Given that the `fetch_asset()` is the one interface test writers are
>> encouraged to use, having different classes is really an implementation
>> detail at this point. For this discussion, I do see value in naming the
>> different modes though.
>>
>>>
>>>
>>> Current implementation
>>> ----------------------
>>>
>>> Current implementation is Anonymous and the last one simply returns the
>>> "foo.zip" fetched in first fetch.
>>>
>>> Result:
>>>
>>> foo.zip
>>> bar.zip # This one is fetched from "https"
>>>
>>>
>>> + simplest
>>> - leads to clashes
>>>
>>>
>>
>> No need to say that this is where we're moving away from.
>>
>>> Hashed url dir
>>> --------------
>>>
>>> I can see multiple options. Cleber proposed in
>>> https://github.com/avocado-framework/avocado/pull/2652 to create in such
>>> case dir based "hash(url)" and store all assets of given url there. It
>>> seems to be fairly simple to develop and maintain, but the cache might
>>> become hard to upkeep and there is non-zero possibility of clashes (but
>>> nearly limiting to zero).
>>>
>>
>> Definitely my goals were to keep it simple, and to fix the one issue at
>> hand. My motivations are based on the idea that Avocado should be able
>> to help test writers with common tasks in tests, but it's hardly going
>> to contain in itself full-blown and complete solutions to generic
>> problems, downloading and caching files.
>>
>> Now, I'd be nice to describe the possible clashes (besides, of course
>> possible hash collisions). I can think of one:
>>
>> * Anonymous with url "http://foo/bar.zip"
>> * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54"
>>
>> The clash exists because sha1("http://foo/bar.zip") is
>> 4b3d163d2565966cf554ccb6e81f31f8695ceb54.
>>
>
> Yep, this is one possible clash. Other is to have different urls that use the
> same hash. As I mentioned it's not frequent, but it can happen.
>
Please excuse me, but I don't really follow the "other" possible clash
you mention. Do you mean a hash collision on different URLs? Like a
hypothetical sha1("http://foo/bar") == sha1("http://something/else") ?
> Note we can decrease the probability by prefixing something, eg. "url_$sha".
>
In the example I give later, I think we can just as easily move from
"decrease the probability" to *eliminate* the possibility by putting
"anonymous" and "specific" assets in different directories.
>>> Another problem would be concurrent access as we might start downloading
>>> file with the same name as url dir and all kind of different clashes and
>>> we'll only find our all the issues when people start extensively using this.
>>
>> I'm not sure I follow. Right now the name of the *downloaded* files are
>> temporary names. Then, a lock for the destination name is acquired,
>> verification is done, and if successful, copied into the destination name.
>>
>>>
>>> Result:
>>>
>>> 2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip # Fetched from https
>>> 2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
>>> 6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip # Fetched from http
>>>
>>> + simple to develop
>>> + simple to maintain
>>
>> Agreed.
>>
>>> - possible clashes
>>
>> Agreed, ideally there should be none.
>>
>>> - hard to browse manually
>>
>> True. As a note, I'd trade the ease to browse the cache for a simpler
>> overall implementation, because the "browsing" could then be done with
>> simple scripts reusing the same implementation (like a contrib script
>> importing from "avocado.utils.asset").
>>
>
> Not really, you can't reproduce the url from hash so it will always be just
> you have those specific-downloaded files from somewhere and they are this
> old/used last time on. Better than nothing, but far from perfect.
>
But still we could reuse the same logic. For instance, a command such as:
$ avocado asset --is-in-cache=http://foo/bar/baz.iso
IN_CACHE: /home/foo/avocado/cache/$HASH/baz.iso
Could reuse "avocado.utils.asset" and give user information about the
cache. But please note that this is just an example: my mind set is to
fix a current bug in how the asset fetcher is and can be used.
Everything else is subject to change.
>>> - API changes might lead to unusable files (users would have to remove
>>> files manually)
>>
>> This too, but I really think this is minor. We may want to include the
>> cache stability along the same lines of the API stability (on LTS
>> releases) but I don't think we made promises about it yet.
>>
>
> Sure, I know users can simply remove everything once in a while...
>
>>>
>>>
>>> sqlite
>>> ------
>>>
>>> Another approach would be to create sqlite database in every cache-dir. For
>>> anonymous assets nothing would change, but for specific assets we'd create
>>> a new tmpdir per given asset and store the mapping in the database.
>>>
>>> Result:
>>>
>>> .avocado_cache.sqlite
>>> foo-1X3s/foo.zip
>>> bar-3s2a/bar.zip
>>> foo-t3d2/foo.zip
>>>
>>> where ".avocado_cache.sqlite" would contain:
>>>
>>> https://www.example.org/foo.zip foo-1X3s/foo.zip
>>> https://www.example.org/bar.zip bar-3s2a/bar.zip
>>> http://www.example.org/foo.zip foo-t3d2/foo.zip
>>>
>>> Obviously by creating a db we could improve many things. First example
>>> would be to store expiration date and based on last access to db we could
>>> run cache-dir upkeep, removing outdated assets.
>>>
>>> Another improvement would be to store the downloaded asset hash and
>>> re-download&update hash when the file was modified even when user didn't
>>> provided hash.
>>>
>>> + easy to browse manually
>>> + should be simple to expand the features (upkeep, security, ...)
>>> + should simplify locks as we can atomically move the downloaded
>>> file&update db. Even crashes should lead to predictable behavior
>>> - slightly more complicated to develop
>>> - "db" file would have to be protected
>>>
>>>
>>
>> I'm a bit skeptical about the "improvements" here, not because they are
>> not valid features, but because we don't need them at this point. I
>> don't think future possibilities should shape too much of the immediate
>> architecture decisions.
>>
>
> Well if you remember the first implementation, I said using a DB would be
> useful in the future. The future is here, we can either patch it and somehow
> support slightly insecure version, or finally do it properly.
>
In my opinion, your intentions are correct, but you're missing the point
that this "proper" implementation would most probably be a small project
in itself. If you are volunteering yourself to write that, and can
commit to deliver it in a "few days time", than we could halt any other
approach.
>>> Other solutions
>>> ---------------
>>>
>>> There are many other solutions like using `$basename-$url_hash` as the name
>>> or using `astring.string_to_safe_path` instead of url_hash and so on. We
>>> are open to suggestions.
>>>
>>>
>>
>> I think the one big problem we have identified is the clash between
>> "anonymous" and "specific" asset. My proposal to fix that would be to
>> define the following structure:
>>
>> base_writable_cache_dir
>> ├── by_name <- "specific" verified files
>> ├── by_origin <- "anonymous" verified files
>> └── download <- temporary base dir for downloads
>>
>
> This was actually something I kept for future discussion as possible
> improvements (even with DB). The second improvement was to use
> `astring.string_to_safe_path` along with hash that might improve the url
> reconstruction.
>
Right, I have not forgotten your valid suggestions of using
`astring.string_to_safe_path`.
>>> Questions
>>> =========
>>>
>>> There are basically two questions:
>>>
>>> 1. Do we want to explicitly set the mode (anonymous/specific), in which way
>>> and how to call them
>>
>> As an API designed, I don't think that difference is worth being
>> apparent to users. As a user, I'd expect to either pass the name of the
>> file, or one specific URL, with or without a verification hash.
>>
(BTW, I meant "As an API designer...")
>
> Well you can consider downloading `netperf-1.6.0.tar.bz2` from 2 different
> locations. Having explicitly said that it's an anonymous download, it'll use
> the other location no matter whether they match.
>
You can do that Today with:
fetch_asset('netperf-1.6.0.tar.bz2',
locations=['http://loc1', 'http://loc1'])
> Actually this brings + points to the DB based solution, because even specific
> downloads would be able to specify multiple locations (using list of urls
> instead of url).
>
Yep, the sky is the limit with such a database-based solution, both in
terms of work and possibilities.
>>> 2. Which implementation we want to use (are there existing solutions we can
>>> simply use?)
>>>
>>
>> I've did some research, and I don't ready to use code we could
>> incorporate. At this point, I'd choose the simplest solution we can
>> write, that solves the issue at hand.
>>
>
Oh my... Did I really write that first sentence? I can hardly understand
it. I'll cut myself some slack, as it was pretty late.
> I'm fine with that, although I'd prefer at least creating the `specific`
> cache (we don't have to necessarily split all 3, simply creating
> `specific_downloads` dir and put all $URL_HASH/$ASSET in there should do
> (given we don't want bullet-proof solution). One possible improvement might
> be to at least store the actual url and fetched file into
> `$URL_HASH/.specific_download_url`, which then might be used by a contrib
> script to reconstruct the original assets. Anyway all of these are just
> suggestions and I could live without them. The point of this email was to
> find out which way we want to go and whether there is anyone with some good
> arguments for/against each solution.
>
We do need to create at least two (say, "by_name" and "by_location") to
really avoid having *any* possible clash.
It looks like the possible name clash is the only deal breaker on the
previous version, so I'll be working on something based on it, but with
a separation of directories for "name based" and "location based" assets.
Thanks!
--
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[ 7ABB 96EB 8B46 B94D 5E0F E9BB 657E 8D33 A5F2 09F3 ]