Re: [Avocado-devel] Ways to solve asset fetcher clashes

Cleber Rosa Wed, 20 Jun 2018 11:20:39 -0700

On 06/20/2018 10:13 AM, Lukáš Doktor wrote:
> Dne 19.6.2018 v 04:50 Cleber Rosa napsal(a):
>>
>>
>> On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
>>> Hello guys,
>>>
>>> As Cleber pointed out on the release meeting, we are struggling with our 
>>> asset fetcher. It was designed with one goal, to cache arbitrary files by 
>>> name and it fails when different asset use the same name as it simply 
>>> assumes they are the same thing.
>>>
>>> Let's have a look at this example:
>>>
>>>     fetch("https://www.example.org/foo.zip";)
>>>     fetch("https://www.example.org/bar.zip";)
>>>     fetch("http://www.example.org/foo.zip";)
>>>
>>> Currently the third fetch simply uses the "foo.zip" downloaded from "https" 
>>> even though it could be a completely different file (or downloaded from 
>>> completely different url). This is good and bad. It's good when you're 
>>> downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are 
>>> downloading "vmlinuz" which is always called "vmlinuz" but comes from a 
>>> different subdirectory, it might lead to big problems.
>>>
>>
>> Right, all your observations so far are correct.
>>
>>> From this I can see two mods of assets, anonymous and specific. Instead of 
>>> trying to detect this based on combinations of hashes and methods, I'd 
>>> suggest being explicit and either add it as extra argument, or even create 
>>> new class `AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset` 
>>> would be the current implementation and we still need to decide on 
>>> `SpecificAsset` implementation. Let's discuss some approaches and use 
>>> following assets in examples:
>>
>> Given that the `fetch_asset()` is the one interface test writers are
>> encouraged to use, having different classes is really an implementation
>> detail at this point.  For this discussion, I do see value in naming the
>> different modes though.
>>
>>>
>>>
>>> Current implementation
>>> ----------------------
>>>
>>> Current implementation is Anonymous and the last one simply returns the 
>>> "foo.zip" fetched in first fetch.
>>>
>>> Result:
>>>
>>>     foo.zip
>>>     bar.zip    # This one is fetched from "https"
>>>
>>>
>>> + simplest
>>> - leads to clashes
>>>
>>>
>>
>> No need to say that this is where we're moving away from.
>>
>>> Hashed url dir
>>> --------------
>>>
>>> I can see multiple options. Cleber proposed in 
>>> https://github.com/avocado-framework/avocado/pull/2652 to create in such 
>>> case dir based "hash(url)" and store all assets of given url there. It 
>>> seems to be fairly simple to develop and maintain, but the cache might 
>>> become hard to upkeep and there is non-zero possibility of clashes (but 
>>> nearly limiting to zero).
>>>
>>
>> Definitely my goals were to keep it simple, and to fix the one issue at
>> hand.  My motivations are based on the idea that Avocado should be able
>> to help test writers with common tasks in tests, but it's hardly going
>> to contain in itself full-blown and complete solutions to generic
>> problems, downloading and caching files.
>>
>> Now, I'd be nice to describe the possible clashes (besides, of course
>> possible hash collisions).  I can think of one:
>>
>>  * Anonymous with url "http://foo/bar.zip";
>>  * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54";
>>
>> The clash exists because sha1("http://foo/bar.zip";) is
>> 4b3d163d2565966cf554ccb6e81f31f8695ceb54.
>>
> 
> Yep, this is one possible clash. Other is to have different urls that use the 
> same hash. As I mentioned it's not frequent, but it can happen.
> 

Please excuse me, but I don't really follow the "other" possible clash
you mention.  Do you mean a hash collision on different URLs?  Like a
hypothetical sha1("http://foo/bar";) == sha1("http://something/else";) ?

> Note we can decrease the probability by prefixing something, eg. "url_$sha".
> 

In the example I give later, I think we can just as easily move from
"decrease the probability" to *eliminate* the possibility by putting
"anonymous" and "specific" assets in different directories.

>>> Another problem would be concurrent access as we might start downloading 
>>> file with the same name as url dir and all kind of different clashes and 
>>> we'll only find our all the issues when people start extensively using this.
>>
>> I'm not sure I follow.  Right now the name of the *downloaded* files are
>> temporary names.  Then, a lock for the destination name is acquired,
>> verification is done, and if successful, copied into the destination name.
>>
>>>
>>> Result:
>>>
>>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
>>>     2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
>>>     6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http
>>>
>>> + simple to develop
>>> + simple to maintain
>>
>> Agreed.
>>
>>> - possible clashes
>>
>> Agreed, ideally there should be none.
>>
>>> - hard to browse manually
>>
>> True.  As a note, I'd trade the ease to browse the cache for a simpler
>> overall implementation, because the "browsing" could then be done with
>> simple scripts reusing the same implementation (like a contrib script
>> importing from "avocado.utils.asset").
>>
> 
> Not really, you can't reproduce the url from hash so it will always be just 
> you have those specific-downloaded files from somewhere and they are this 
> old/used last time on. Better than nothing, but far from perfect.
> 

But still we could reuse the same logic.  For instance, a command such as:

 $ avocado asset --is-in-cache=http://foo/bar/baz.iso
 IN_CACHE: /home/foo/avocado/cache/$HASH/baz.iso

Could reuse "avocado.utils.asset" and give user information about the
cache.  But please note that this is just an example: my mind set is to
fix a current bug in how the asset fetcher is and can be used.
Everything else is subject to change.

>>> - API changes might lead to unusable files (users would have to remove 
>>> files manually)
>>
>> This too, but I really think this is minor.  We may want to include the
>> cache stability along the same lines of the API stability (on LTS
>> releases) but I don't think we made promises about it yet.
>>
> 
> Sure, I know users can simply remove everything once in a while...
> 
>>>
>>>
>>> sqlite
>>> ------
>>>
>>> Another approach would be to create sqlite database in every cache-dir. For 
>>> anonymous assets nothing would change, but for specific assets we'd create 
>>> a new tmpdir per given asset and store the mapping in the database.
>>>
>>> Result:
>>>
>>>     .avocado_cache.sqlite
>>>     foo-1X3s/foo.zip
>>>     bar-3s2a/bar.zip
>>>     foo-t3d2/foo.zip
>>>
>>> where ".avocado_cache.sqlite" would contain:
>>>
>>>     https://www.example.org/foo.zip  foo-1X3s/foo.zip
>>>     https://www.example.org/bar.zip  bar-3s2a/bar.zip
>>>     http://www.example.org/foo.zip   foo-t3d2/foo.zip
>>>
>>> Obviously by creating a db we could improve many things. First example 
>>> would be to store expiration date and based on last access to db we could 
>>> run cache-dir upkeep, removing outdated assets.
>>>
>>> Another improvement would be to store the downloaded asset hash and 
>>> re-download&update hash when the file was modified even when user didn't 
>>> provided hash.
>>>
>>> + easy to browse manually
>>> + should be simple to expand the features (upkeep, security, ...)
>>> + should simplify locks as we can atomically move the downloaded 
>>> file&update db. Even crashes should lead to predictable behavior
>>> - slightly more complicated to develop
>>> - "db" file would have to be protected
>>>
>>>
>>
>> I'm a bit skeptical about the "improvements" here, not because they are
>> not valid features, but because we don't need them at this point.  I
>> don't think future possibilities should shape too much of the immediate
>> architecture decisions.
>>
> 
> Well if you remember the first implementation, I said using a DB would be 
> useful in the future. The future is here, we can either patch it and somehow 
> support slightly insecure version, or finally do it properly.
> 

In my opinion, your intentions are correct, but you're missing the point
that this "proper" implementation would most probably be a small project
in itself.  If you are volunteering yourself to write that, and can
commit to deliver it in a "few days time", than we could halt any other
approach.

>>> Other solutions
>>> ---------------
>>>
>>> There are many other solutions like using `$basename-$url_hash` as the name 
>>> or using `astring.string_to_safe_path` instead of url_hash and so on. We 
>>> are open to suggestions.
>>>
>>>
>>
>> I think the one big problem we have identified is the clash between
>> "anonymous" and "specific" asset.  My proposal to fix that would be to
>> define the following structure:
>>
>>  base_writable_cache_dir
>>    ├── by_name    <- "specific" verified files
>>    ├── by_origin  <- "anonymous" verified files
>>    └── download   <- temporary base dir for downloads
>>
> 
> This was actually something I kept for future discussion as possible 
> improvements (even with DB). The second improvement was to use 
> `astring.string_to_safe_path` along with hash that might improve the url 
> reconstruction.
> 

Right, I have not forgotten your valid suggestions of using
`astring.string_to_safe_path`.

>>> Questions
>>> =========
>>>
>>> There are basically two questions:
>>>
>>> 1. Do we want to explicitly set the mode (anonymous/specific), in which way 
>>> and how to call them
>>
>> As an API designed, I don't think that difference is worth being
>> apparent to users.  As a user, I'd expect to either pass the name of the
>> file, or one specific URL, with or without a verification hash.
>>

(BTW, I meant "As an API designer...")

> 
> Well you can consider downloading `netperf-1.6.0.tar.bz2` from 2 different 
> locations. Having explicitly said that it's an anonymous download, it'll use 
> the other location no matter whether they match.
> 

You can do that Today with:

   fetch_asset('netperf-1.6.0.tar.bz2',
               locations=['http://loc1', 'http://loc1'])

> Actually this brings + points to the DB based solution, because even specific 
> downloads would be able to specify multiple locations (using list of urls 
> instead of url).
> 

Yep, the sky is the limit with such a database-based solution, both in
terms of work and possibilities.

>>> 2. Which implementation we want to use (are there existing solutions we can 
>>> simply use?)
>>>
>>
>> I've did some research, and I don't ready to use code we could
>> incorporate.  At this point, I'd choose the simplest solution we can
>> write, that solves the issue at hand.
>>
>
Oh my... Did I really write that first sentence? I can hardly understand
it.  I'll cut myself some slack, as it was pretty late.

> I'm fine with that, although I'd prefer at least creating the `specific` 
> cache (we don't have to necessarily split all 3, simply creating 
> `specific_downloads` dir and put all $URL_HASH/$ASSET in there should do 
> (given we don't want bullet-proof solution). One possible improvement might 
> be to at least store the actual url and fetched file into 
> `$URL_HASH/.specific_download_url`, which then might be used by a contrib 
> script to reconstruct the original assets. Anyway all of these are just 
> suggestions and I could live without them. The point of this email was to 
> find out which way we want to go and whether there is anyone with some good 
> arguments for/against each solution.
> 

We do need to create at least two (say, "by_name" and "by_location") to
really avoid having *any* possible clash.

It looks like the possible name clash is the only deal breaker on the
previous version, so I'll be working on something based on it, but with
a separation of directories for "name based" and "location based" assets.

Thanks!

-- 
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[  7ABB 96EB 8B46 B94D 5E0F  E9BB 657E 8D33 A5F2 09F3  ]
Re: [Avocado-devel] Ways to solve asset fetcher clashes

Reply via email to