David,
Kelly and Jason have suggested good techniques. I'll add that the
original idea behind XQSync was to create "rsync for MarkLogic Server".
The actual implementation falls short in some respects, but it might be
useful anyway.
http://developer.marklogic.com/howto/tutorials/2006-08-xqsync.xqy
http://developer.marklogic.com/svn/xqsync/trunk/README.html
I don't think XQSync will do everything you're interested in, today, but
it might be a useful starting point. It can certainly handle the "update
everything every day" technique, but it doesn't have any mechanisms for
detecting which files to update. You might be able to extend the
SessionWriter class to meet your needs.
RecordLoader is another possibility: you might be able to plug in your
own Loader and Content subclasses to implement the behavior you want. If
you decide to extend either tool, I am interested in patches.
To elaborate on Jason's suggestion for detecting files that are no
longer in the dataset, I would look into cts:uris(). Ensure that all
your documents are always in the 'my-stuff' collection, and then call
cts:uris() with a query like:
cts:and-not-query((
cts:collection-query('my-stuff'),
cts:document-query($document-uris-today) ))
Anything in 'my-stuff' and not in $document-uris-today should be deleted.
-- Mike
On 2010-03-05 11:48, Lee, David wrote:
I have a task coming up where I need to daily update a large set of xml and
binary files from an outside source.
This is about 6000 xml docs and 30,000 images. About 2GB total.
I get these from an outside source as one huge 1GB zip file. I expect maybe
only 1% of the files to have changed in any drop, maybe even less (.1%?).
For any changed files I need to generates some additional data (outside of ML)
then upload the files and update some properties.
I *could* just update ALL files every day, but I'd like to be more efficient
then that considering the likely change rate is so low.
I'm sure this is a common problem (not unlike say rsync) ...
What do people do for this case ?
I was thinking of storing a checksum (MD5?) as a property of each file then
comparing with the new files by listing the directory tree from ML.
Another idea is to keep a filesystem cache of whats in ML and do the comparison
there.
My guess is it would be just as (in)efficient to try to upload each file to
compare within ML as just updating the document,
or visa-vera - fetch each file from ML just to compare with the filesystem. So
I dont want to go that route.
Then there is also the deleted issue ... I need to detect files which are no
longer in the dataset and delete them.
Any suggestions or ideas ? Anyone do something like this before ?
Is there builtin marklogic features that could help ?
Thanks;
-David
----------------------------------------
David A. Lee
Senior Principal Software Engineer
Epocrates, Inc.
d...@epocrates.com<mailto:d...@epocrates.com>
812-482-5224
_______________________________________________
General mailing list
General@developer.marklogic.com
http://xqzone.com/mailman/listinfo/general