Re: [Opensim-dev] Proposal: Implement a de-duplicating core ROBUST asset service

Justin Clark-Casey Thu, 15 Mar 2012 00:13:55 -0700

That's a good point about the asset cache Bob, it is going to suffer from dupes. I think it would need an arrangementmuch like the asset service where the metadata/hashes is kept in a db since it would be prohibitive to scan and rehashall the assets in the cache on every startup.

It would have been so much simpler if asset UUIDs were actually directly hashes of the data. However, that might causesome issues with assets that are uploaded as both 'permanent' and 'temporary' since you couldn't have varying metadataand there are some special case assets with fixed + well known UUIDs. Maybe it's still worth thinking about at somepoint though.

And yeah, I think there's also room for optimizations (e.g. if an asset is uploaded the simulator can hash and ask aremote service if it already has a copy, rather than uploading the whole thing for the asset service to do the hashing).But this requires extensions of the asset service protocol which starts getting hairy, I would rather get the basicsright first.


On 10/03/12 16:11, Bob Wellman wrote:

Justin

Great news that you now have deduplication of assets in the asset database 
working. This will save a lot of space on the
asset server disks and is a great step forward IMHO.

Have you considered extending this work by also deduplicating the opensim 
region servers cache in a similar way.

What I mean by that is this:

Currently when the region server needs an asset (say a texture) it doesn't 
have, it sends to the asset server for that
asset (metadata + blob). When it needs a second asset (say an identical texture 
to the first one) it also sends for the
second asset (metadata and blob). It caches both of these assets as received on 
the region server. So we have had 2
large transmissions of data sent and cached both of them. The same duplicated 
blob is sent and stored twice. This seems
wasteful of cache space and more importantly wasteful of precious bandwidth.

Would it be possible to change the process so that when it sends for unknown 
assets, only the asset metadata and hash
pointer are received at first. It could then check to see if the hash for that 
blob is already cached (due to being a
duplicate of a previous asset) and only if it doesn't already have that blob it 
sends for that and caches it.

It means a 2 tier caching system (asset and blob) replicating the 2 tier asset 
database design you have invented
already. I not sure how hard this would be to program, but the benefits I think 
would be worthwhile in performance terms.

I confess that my ulterior motive would be, if we get this improvement, to ask 
later for a further one, where the
passing of assets from simulator to viewer adopt a similar 2 tier approach as 
that would gain us an even greater
performance benefit I believe.

Cutting down unnecessary internet traffic has to be a good goal I think. What 
is you opinion Justin?

Bob Wellman (PMgrid admin)

 > Date: Sat, 10 Mar 2012 02:41:11 +0000
 > From: [email protected]
 > To: [email protected]
 > Subject: Re: [Opensim-dev] Proposal: Implement a de-duplicating core ROBUST 
asset service
 >
 > I already have a working de-duplicating ROBUST asset service using hashing. 
It was not at all hard to do (largely
 > because SRAS had already demonstrated how it could be done) so complexity on 
this end is not an issue.
 >
 > I have read many articles on filesystem vs blob storage. There are pros and 
cons either way. From what I've read, the
 > performance difference is actually quite small.
 >
 > As this service is for light to medium use, in my opinion the simplicity of 
managing just a database wins out over the
 > advantages of a filesystem approach. Anybody wanting filesystem storage 
right now can use SRAS [1], which is a third
 > party project developed externally from opensim-core that provides this and 
other extra features or you can roll your
 > own, which would not be difficult for anybody moderately competent in PHP to 
do.
 >
 > If you're running a large grid this is always going to entail extra work and 
co-ordination of components, just like
 > running a large website.
 >
 > [1] https://github.com/coyled/sras
 >
 > On 09/03/12 04:06, Wade Schuette wrote:
 > > Justin,
 > >
 > > I have to respectfully agree with Cory.
 > >
 > > Wouldn't something like the following address your valid concerns about 
complexity and reducing total load as well as
 > > perceived system response time to both filing and retrieving assets?
 > >
 > > First, if you use event-driven processes, there's no reason to rescan the 
entire database, and by separating the
 > > processes into distinct streams, they are decoupled which is actually a 
good thing and simplifies both sides.
There's no
 > > reason I can see they need to be coupled, and separating them allows them 
to be optimized and tested separately, which
 > > is a good thing.
 > >
 > > In fact, the entire deduplication process could run overnight at a 
low-load time, which is even better, or have
multiple
 > > "worker" processes assisgned to it, if it's taking too long. Seems very 
flexible.
 > >
 > > I'm assuming that a hash-code isn't unique, but just specifies the bucket 
into which this item can be categorized.
 > >
 > > When a new asset arrives, if the hash-code already exists, put the 
unique-ID in a pipe and finish filing it and
move on.
 > > If the hash-code doesn't already exist, just file it and move on.
 > >
 > > At the other end of the pipe, this wakes up a process that can, as time 
allows, check in the background to see if not
 > > only the hash-code is the same, but the entire item is the same, and if 
so, change the handle to point to the existing
 > > copy. ( For all I know, this can be done in one step if CRC codes are 
sufficiently unique, but computing such a code is
 > > cpu intensive unless you can do it in hardware.)
 > >
 > > Of course, now the question arises of what happens when the original 
person DELETES the shared item. If you have solid
 > > database integrity, you only need to know how many pointers to it exist, and if 
someone deletes "their copy", you
 > > decrease the count by one, and when the count gets to one, the next delete 
can actually delete the entry.
 > >
 > >
 > >
 > > Wade
 > >
 > >
 > >
 > >
 > > On 3/8/12 7:41 PM, Justin Clark-Casey wrote:
 > >> On 08/03/12 22:00, Rory Slegtenhorst wrote:
 > >>> @Justin
 > >>> Can't we do the data de-duplication on a database level? Eg find the 
duplicates and just get rid of them on a regular
 > >>> interval (cron)?
 > >>
 > >> This would be enormously intricate. Not only would you have to keep 
rescanning the entire asset db but it adds another
 > >> moving part to an already complex system.
 > >>
 > >
 > > _______________________________________________
 > > Opensim-dev mailing list
 > > [email protected]
 > > https://lists.berlios.de/mailman/listinfo/opensim-dev
 > >
 >
 >
 > --
 > Justin Clark-Casey (justincc)
 > http://justincc.org/blog
 > http://twitter.com/justincc
 > _______________________________________________
 > Opensim-dev mailing list
 > [email protected]
 > https://lists.berlios.de/mailman/listinfo/opensim-dev



_______________________________________________
Opensim-dev mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/opensim-dev



--
Justin Clark-Casey (justincc)
http://justincc.org/blog
http://twitter.com/justincc
_______________________________________________
Opensim-dev mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/opensim-dev

Re: [Opensim-dev] Proposal: Implement a de-duplicating core ROBUST asset service

Reply via email to