Hy Ryan,

As you probably know, each forest has a "large data directory" location where 
files above the large file size limit (default 1MB) are placed.  By default 
it's a directory named "Large" next to the stands.  These files get hash names 
and are managed by MarkLogic with separate reference fragments in the regular 
forest data.  You've found a nice optimization that duplicates created within 
the database don't require duplicated storage (just duplicated reference 
fragments).

Loading from a regular filesystem location copies the data into the "large data 
directory" and adds a reference fragment.  Sounds like you're finding if you 
load the same file from the filesystem a second time it doesn't see the 
duplication?  That makes sense, right, it can be hard to know they're the same 
until the new file is fully loaded.  It's possible after finishing the loading 
to notice the duplication, but I don't know if that's implemented yet.  Looks 
like not.

-jh-

On Nov 18, 2011, at 1:50 PM, [email protected] wrote:

> I have found that if I load a Large Binary out of the DB and 
> xdmp:document-insert() it to a different uri in the DB, the Large Data usage 
> for the DB doesn't changes and the Large directory in the forest dir doesn't 
> change either. However, if I load the binary from off the file system then 
> the Large data usage grows with the size of the Large Binary files being 
> document-inserted. This leads me to believe that there is some optimization 
> going on that Large Binaries that originate from the the DB just have 
> pointers to some master record of the binary data itself. So if the file is 
> in ten places in the DB, even with different filenames, MarkLogic has 
> pointers in all of them back to a single binary file. 
> 
> When I delete or change the name of one of these binaries it doesn't seem to 
> affect the others who have the same "parent." This seems to be very useful, 
> except for when I'm trying to generate a lot of Large Binary data for 
> testing, only to find out that it's all linked in under the covers and not a 
> very good test set. Hence, I have been loading the seed files off the 
> filesystem to prevent a linkage in order to generate a large test set from a 
> small set of binaries.
> 
> Is my understanding mostly correct?
> 
> -Rayn
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to