[openstack-dev] Feedback about Swift API - Especially about Large Objects

Pierre SOUCHAY Fri, 09 Oct 2015 13:47:13 -0700

Hi Swift Developpers,

We have been using Swift as a IAAS provider for more than two years now, but 
this mail is about feedback on the API side. I think it would be great to 
include some of the ideas in future revisions of API.


I’ve been developping a few Swift clients in HTML (in Cloudwatt Dashboard) with 
CORS, Java with Swing GUI (https://github.com/pierresouchay/swiftbrowser 
<https://github.com/pierresouchay/swiftbrowser>) and Go for Swift to filesystem 
(https://github.com/pierresouchay/swiftsync/ 
<https://github.com/pierresouchay/swiftsync/>), so I have now a few ideas about 
how improving a bit the API.

The API is quite straightforward and intuitive to use, and writing a client is 
now that difficult, but unfortunately, the Large Object support is not easy at 
all to deal with.

The biggest issue is that there is now way to know whenever a file is a large 
object when performing listings using JSON format, since, AFAIK a large object 
is an object with 0 bytes (so its size in bytes is 0), but it also has a hash 
of a zero file bytes.

For instance, a signature of such object is :
 {"hash": "d41d8cd98f00b204e9800998ecf8427e", "last_modified": 
"2015-06-04T10:23:57.618760", "bytes": 0, "name": "5G", "content_type": 
"octet/stream"}

which is, exactly the hash of a 0 bytes file :
$ echo -n | md5
d41d8cd98f00b204e9800998ecf8427e

Ok, now lets try HEAD :
$ curl -vv -XHEAD -H X-Auth-Token:$TOKEN 
'https://storage.fr1.cloudwatt.com/v1/AUTH_61b8fe6dfd0a4ce69f6622ea74444e0f/large_files/5G
…
< HTTP/1.1 200 OK
< Date: Fri, 09 Oct 2015 19:43:09 GMT
< Content-Length: 5000000000
< Accept-Ranges: bytes
< X-Object-Manifest: large_files/5G/.part-5000000000-
< Last-Modified: Thu, 04 Jun 2015 10:16:33 GMT
< Etag: "479517ec4767ca08ed0547dca003d116"
< X-Timestamp: 1433413437.61876
< Content-Type: octet/stream
< X-Trans-Id: txba36522b0b7743d683a5d-00561818cd

WTF ? While all files have the same value for ETag and hash, this is not the 
case for Large files…

Furthermore, the ETag is not the md5 of the whole file, but the hash of the 
hash of all manifest files (as described somewhere hidden deeply in the 
documentation)

Why this is a problem ?
-------------------------------

Imagine a « naive »  client using the API which performs some kind of Sync.

The client download each file and when it syncs, compares the local md5 to the 
md5 of the listing… of course, the hash is the hash of a zero bytes files… so 
it downloads the file again… and again… and again. Unfortunaly for our naive 
client, this is exactly the kind of files we don’t want to download twice… 
since the file is probably huge (after all, it has been split for a reason no ?)

I think this is really a design flaw since you need to know everything about 
Swift API and extensions to have a proper behavior. The minimum would be to at 
least return the same value as the ETag header.

OK, let’s continue…

We are not so Naive… our Swift Sync client know that 0 files needs more work.

* First issue: we have to know whenever the file is a « real » 0 bytes file or 
not. You may think most people do not create 0 bytes files after all… this is 
dummy. Actually, some I have seen two Object Storage middleware using many 0 
bytes files (for instance to store meta data or two set up some kind of 
directory like structure). So, in this cas, we need to perform a HEAD request 
to each 0 bytes files. If you have 1000 files like this, you have to perform 
1000 HEAD requests to finally know that there are not any Large file. Not very 
efficient. Your Swift Sync client took 1 second to sync 20G of data with naive 
approach, now, you need 5 minutes… hash of 0 bytes is not a good idea at all.

* Second issue: since the hash is the hash of all parts (I have an idea about 
why this decision was made, probably for performance reasons), your client 
cannot work on files since the hash of local file is not the hash of the Swift 
aggregated file (which is the hash of all the hash of manifest). So, it means 
you cannot work on existing data, you have to either :
 - split all the files in the same way as the manifest, compute the MD5 of each 
part, than compute the MD5 of the hashes and compare to the MD5 on server… (ok… 
doable, but I gave up with such system)
 - have a local database in your client (when you download, store the REAL Hash 
of file and store that in fact you have to compare it the the HASH returned by 
server)
 - perform some kind of crappy heuristics (size + grab the starting bytes of 
each data of each part or something like that…)

* Third issue:
 - If you don’t want to store the parts of your object file, you have to wait 
for all your HEAD requests to finish since it is the only way to guess all the 
files that are referenced in your manifest headers.

So summarize, I think the current API really need some refinements about the 
listings since a competent developper may trust the bytes value and the hash 
value and create an algorithm that does not behave nicely. So, the API looks 
easy but is in fact much more complicated than expected.

A few ideas to improve it :

In listings, if an Object is a large object.
 - either put the real MD5 of file if it is doable technically… or remove it 
(so naive program will work nicely)… same thing about bytes.
 - add an optional field in the JSON to tell the object is in fact a large 
object. A nice field to explain the object is a large object would be to use 
the object-manifest header value. So a client could know the file is a large 
file or simply a zero byte object, and also know what are the object that are 
in facts parts of a larger one (and do not wait for you thousands of HEAD 
requests to finish)

Finally, to help people creating interfaces quickly, add an Option to add CORS 
for all containers of an account. In our Cloud provider, we added a REST CALL 
in another WebService with CORS enabled that ensures a container has CORS setup 
for a Container. So, browsing Swift with HTML5 interfaces is easy. By doing so, 
it would - I think - greatly increase the Swift Usage (by not needing any 
specific software to browse Swift).

Best Regards


-- 
Pierre Souchay <[email protected] 
<mailto:[email protected]>>
Software Architect @ CloudWatt

Adresse : ETIK 892, Rue Yves Kermen 92100 Boulogne-Billancourt
N° Standard : +33 1 84 01 04 04
N° Fax : +33 1 84 01 04 05

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [email protected]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] Feedback about Swift API - Especially about Large Objects

Reply via email to