Re: Data transfer questions

Thomas Bennett Mon, 19 Mar 2012 12:18:42 -0700

Hey Chris,

Thanks for you're reply, much appreciated. You've cleared up a few issues
in my understanding.


I've gone through your reply and just added a few notes for completeness.
*

Crawler data transfer. I.e. not using the File Manager as a client.*

there are 2 ways to configure data transfer. If you are using a Crawler,
> the crawler is going to
> handle client side transfer to the FM server. You can configure Local,
> Remote, or InPlace transfer at the moment,
> or roll your own client side transfer and then pass it via the crawler
> command line or config.


1) Local data transfer


> Local means that the
> source and dest file paths need to be visible from the crawler's machine
> (or at least "appear" that way. A common pattern
> here is to use a Distributed File System like HDFS or ClusterFS to
> virtualize local disk, and mount it at a global virtual
> root. That way even though the data itself is distributed, to the Crawler
> and thus to LocalDataTransfer, it looks like
> it's on the same path).


2) Remote data transfer


> Remote means that the dest path can live on a different host, and that the
> client will work
> with the file manager server to chunk and transfer (via XML-RPC) that data
> from the client to the server.



3) In place data transfer

> InPlace means
> that no data transfer will occur at all.
>

(Great explanations - thanks!)

*Versioner schemes*
*
*

> The Data Transferers have an acute coupling with the Versioner scheme,
> case in point: if you are doing InPlaceTransfer,
> you need a versioner that will handle file paths that don't change from
> src to dest.
>

The Versioner is used to describe who a target directory is created for a
file to archive. I.e a directory structure where the data will be place. So
if I have an archive root at /var/kat/archive/data/ and I use a basic
versioner it will archive a file called 1234567890.h5 at
/var/kat/archive/data/1234567890.h5/1234567890.h5. So this would describe
the destination for a local data transfer.

I have the following versioner set in my policy/product-types.xml.

policy/product-types.xml
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

Just out of curiosity... why is this called a versioner?
*
*
*Using the File Manager as the client*

Configuring a data trransfer in filemgr.properties, and then not using the
> crawler directly, but e.g., using the XmlRpcFileManagerClient,directly,
> you can tell the server (on the ingest(...) method) to handle all the file
> transfers for you. In that case, the server needs a
> Data Transferer configured, and the above properties apply, with the
> caveat that the FM server is now the "client" that is transferring
> the data to itself :)


If I set the following property in the etc/filemgr.property file

filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransfer

I did a quick try of this today, trying an ingest on my localhost, (to
avoid any sticky network issues) and I was able to perform an ingest.

I see you can specify the data transfer factory to use, so I assume then
that the filemgr.datatransfer.factory setting is just the default if none
is specified on the command line. Is this true?

I ran a version of the command line client (my own version of
filemgr-client with abs paths to the configuration files):

cas-filemgr-client.sh --url
http://localhost:9101<http://192.168.1.211:9101>--operation
--ingestProduct --refs /Users/thomas/1331871808.h5
--productStructure Flat --productTypeName KatFile
--metadataFil/Users/thomas/1331871808.h5.met --productName 1331871808.h5
--clientTransfer --dataTransfer
org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory

With the data factory also type spec'ed as:

etc/filemgr.properties
filemgr.datatransfer.factory=org.apache.oodt.cas.filemgr.datatransfer.RemoteDataTransferFactory

And the versioner set as:

policy/product-types.xml
<versioner class="org.apache.oodt.cas.filemgr.versioning.BasicVersioner"/>

And it ingested the file. +1 for OODT!

*Local and remote transfers to the same filemgr*


> One way to do this is to write a Facade java class, e.g., MultiTransferer,
> that can e.g., on a per-product type basis,
> decide whether to call and delegate to LocalDataTransfer or
> RemoteDataTransfer. If wrote in a configurable way, that would be
> an awesome addition to the OODT code base. We could call it
> ProductTypeDelegatingDataTransfer.
>

I'm thinking I would prefer to have some crawlers specifying how file
should be transferred. Is there any particular reason why this would not be
a good idea - as long as the client specifies the transfer method to use?

*Getting the product to a second archive*
*
*

> One way to do it is to simply stand up a file manager at the remote site
> and catalog, and then do remote data transfer (and met transfer) to take
> care of that.
> Then as long as your XML-RPC ports are open both the data and metadata can
> be backed up by simply doing the same ingestion mechanisms. You could
> wire that up as a Workflow task to run periodically, or as part of your
> std ingest pipeline (e.g., a Crawler action that on postIngestSuccess backs
> up to the remote
> site by ingesting into the remote backup file manager).
>

Okay. Got it! I'll see if I can wire up both options!


> I'd be happy to help you down either path.
>

Thanks! Much appreciated.

> I was thinking, perhaps using the functionality described in OODT-84
(Ability for File Manager to stage an ingested Product to one of its
clients) and then have a second crawler on the backup archive which will
then update it's own catalogue.

>
> +1, that would work too!


Once again, thanks for the input and advice - always informative ;)

Cheers,
Tom

Re: Data transfer questions

Reply via email to