Hey Chris, Thanks for you're reply, much appreciated. You've cleared up a few issues in my understanding.
I've gone through your reply and just added a few notes for completeness. * Crawler data transfer. I.e. not using the File Manager as a client.* there are 2 ways to configure data transfer. If you are using a Crawler, > the crawler is going to > handle client side transfer to the FM server. You can configure Local, > Remote, or InPlace transfer at the moment, > or roll your own client side transfer and then pass it via the crawler > command line or config. 1) Local data transfer > Local means that the > source and dest file paths need to be visible from the crawler's machine > (or at least "appear" that way. A common pattern > here is to use a Distributed File System like HDFS or ClusterFS to > virtualize local disk, and mount it at a global virtual > root. That way even though the data itself is distributed, to the Crawler > and thus to LocalDataTransfer, it looks like > it's on the same path). 2) Remote data transfer > Remote means that the dest path can live on a different host, and that the > client will work > with the file manager server to chunk and transfer (via XML-RPC) that data > from the client to the server. 3) In place data transfer > InPlace means > that no data transfer will occur at all. > (Great explanations - thanks!) *Versioner schemes* * * > The Data Transferers have an acute coupling with the Versioner scheme, > case in point: if you are doing InPlaceTransfer, > you need a versioner that will handle file paths that don't change from > src to dest. > The Versioner is used to describe who a target directory is created for a file to archive. I.e a directory structure where the data will be place. So if I have an archive root at /var/kat/archive/data/ and I use a basic versioner it will archive a file called 1234567890.h5 at /var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe the destination for a local data transfer. I have the following versioner set in my policy/product-types.xml. policy/product-types.xml <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/> Just out of curiosity... why is this called a versioner? * * *Using the File Manager as the client* Configuring a data trransfer in filemgr.properties, and then not using the > crawler directly, but e.g., using the XmlRpcFileManagerClient,directly, > you can tell the server (on the ingest(...) method) to handle all the file > transfers for you. In that case, the server needs a > Data Transferer configured, and the above properties apply, with the > caveat that the FM server is now the "client" that is transferring > the data to itself :) If I set the following property in the etc/filemgr.property file filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer I did a quick try of this today, trying an ingest on my localhost, (to avoid any sticky network issues) and I was able to perform an ingest. I see you can specify the data transfer factory to use, so I assume then that the filemgr.datatransfer.factory setting is just the default if none is specified on the command line. Is this true? I ran a version of the command line client (my own version of filemgr-client with abs paths to the configuration files): cas-filemgr-client.sh --url http://localhost:9101<http://192.168.1.211:9101>--operation --ingestProduct --refs /Users/thomas/1331871808.h5 --productStructure Flat --productTypeName KatFile --metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5 --clientTransfer --dataTransfer org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory With the data factory also type spec'ed as: etc/filemgr.properties filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory And the versioner set as: policy/product-types.xml <versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/> And it ingested the file. +1 for OODT! *Local and remote transfers to the same filemgr* > One way to do this is to write a Facade java class, e.g., MultiTransferer, > that can e.g., on a per-product type basis, > decide whether to call and delegate to LocalDataTransfer or > RemoteDataTransfer. If wrote in a configurable way, that would be > an awesome addition to the OODT code base. We could call it > ProductTypeDelegatingDataTransfer. > I'm thinking I would prefer to have some crawlers specifying how file should be transferred. Is there any particular reason why this would not be a good idea - as long as the client specifies the transfer method to use? *Getting the product to a second archive* * * > One way to do it is to simply stand up a file manager at the remote site > and catalog, and then do remote data transfer (and met transfer) to take > care of that. > Then as long as your XML-RPC ports are open both the data and metadata can > be backed up by simply doing the same ingestion mechanisms. You could > wire that up as a Workflow task to run periodically, or as part of your > std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs > up to the remote > site by ingesting into the remote backup file manager). > Okay. Got it! I'll see if I can wire up both options! > I'd be happy to help you down either path. > Thanks! Much appreciated. > I was thinking, perhaps using the functionality described in OODT-84 (Ability for File Manager to stage an ingested Product to one of its clients) and then have a second crawler on the backup archive which will then update it's own catalogue. > > +1, that would work too! Once again, thanks for the input and advice - always informative ;) Cheers, Tom
