David,

Kelly and Jason have suggested good techniques. I'll add that the original idea behind XQSync was to create "rsync for MarkLogic Server". The actual implementation falls short in some respects, but it might be useful anyway.

http://developer.marklogic.com/howto/tutorials/2006-08-xqsync.xqy

http://developer.marklogic.com/svn/xqsync/trunk/README.html

I don't think XQSync will do everything you're interested in, today, but it might be a useful starting point. It can certainly handle the "update everything every day" technique, but it doesn't have any mechanisms for detecting which files to update. You might be able to extend the SessionWriter class to meet your needs.

RecordLoader is another possibility: you might be able to plug in your own Loader and Content subclasses to implement the behavior you want. If you decide to extend either tool, I am interested in patches.

To elaborate on Jason's suggestion for detecting files that are no longer in the dataset, I would look into cts:uris(). Ensure that all your documents are always in the 'my-stuff' collection, and then call cts:uris() with a query like:

  cts:and-not-query((
    cts:collection-query('my-stuff'),
    cts:document-query($document-uris-today) ))

Anything in 'my-stuff' and not in $document-uris-today should be deleted.

-- Mike

On 2010-03-05 11:48, Lee, David wrote:
I have a task coming up where I need to daily update a large set of xml and 
binary files from an outside source.
This is about 6000 xml docs and 30,000 images.  About 2GB total.

I get these from an outside source as one huge 1GB zip file.  I expect maybe 
only 1% of the files to have changed in any drop, maybe even less (.1%?).

For any changed files I need to generates some additional data (outside of ML) 
then upload the files and update some properties.
I *could* just update ALL files every day, but I'd like to be more efficient 
then that considering the likely change rate is so low.

I'm sure this is a common problem (not unlike say rsync) ...
What do people do for this case ?
I was thinking of storing a checksum (MD5?) as a property of each file then 
comparing with the new files by listing the directory tree from ML.
Another idea is to keep a filesystem cache of whats in ML and do the comparison 
there.

My guess is it would be just as (in)efficient to try to upload each file to 
compare within ML as just updating the document,
or visa-vera - fetch each file from ML just to compare with the filesystem.  So 
I dont want to go that route.

Then there is also the deleted issue ... I need to detect files which are no 
longer in the dataset and delete them.



Any suggestions or ideas ?  Anyone do something like this before ?
Is there builtin marklogic features that could help ?


Thanks;

-David



----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
d...@epocrates.com<mailto:d...@epocrates.com>
812-482-5224



_______________________________________________
General mailing list
General@developer.marklogic.com
http://xqzone.com/mailman/listinfo/general

Reply via email to