Hi,

I'm basically going ahead with your suggestion and plan on using RLS/DRS to
achieve the kind of file caching that I want.  However, I'm struggling with
how I should use these tools to meet our needs based solely on the
documentation and the limited amount I've been able to play around with
them.

For example, DRS seems promising as a tool that can query an RLS
installation (and in saying RLS i'm subsuming the LRC/RLI pairing for
simplicity), actually perform file transfers via RFT, and record the new
locations of the files in the RLS.  However, I don't exactly understand the
format of the request file as shown:

testrun-1      gsiftp://myhost:9001/sandbox/files/testrun-1
testrun-2      gsiftp://myhost:9001/sandbox/files/testrun-2
testrun-3      gsiftp://myhost:9001/sandbox/files/testrun-3
testrun-4      gsiftp://myhost:9001/sandbox/files/testrun-4
testrun-5      gsiftp://myhost:9001/sandbox/files/testrun-5

As a prerequisite to replicating files with DRS, do they already need to
have LFN entries in an RLS somewhere?  In other words, once the RLS is
bootstrapped with at least one LFN, perhaps it is possible to use DRS to
replicate it.  But it is also possible to have DRS transfer files/make
entries in an RLS de novo, from scratch?

Otherwise, it seems like I would have to do all of the RFT transfers myself,
either as part of the GRAM job submission or separately, and do all of the
RLS updating myself too, which is probably why DRS was created.

I've read about how flexible the LRC/RLI configuration can be, and I'm aware
that there's probably no single "best" way to set it up.  However, what
would be the simplest possible configuration that would work for me?  I
already have a properly functioning LRC/RLI on a machine let's call the
"Grid server".  GRAM jobs are submitted from this machine to one of many
other Grid resources (other Globus installations).  It's on these remote
resources too that I want some kind of file cache, i.e., some location into
which job input files can be staged without unnecessary file duplication.
Therefore, I need a way of knowing which files are on that remote resource,
where they exist, and so on, which it seems RLS can provide.  So, one simple
question is: do I need an RLS installation on each remote resource, or can I
get away with the single RLS installation on the Grid server keeping track
of the locations of files on a modest (~10) number of these resources?

The main reason I'm interested in this is because we are running into the
situation where we are staging files in needed by jobs (or job batches) over
and over again to the same resource, and sometimes duplicate files get
staged unnecessarily, wasting bandwidth and disk space.  Here's a more
technical question, though.  What if two different jobs need the same
logical file, but need it named two different things?  If I were setting it
up by hand, I could make one file a symlink to the other to avoid wasting
disk space.  The corollary to this in RLS-speak would be one LFN, two PFNs
on the same filesystem -- and ideally where one PFN is a symlink to the
other (or better yet!  both PFNs are symlinks to a third PFN, the actual
instantiation, so to speak, of the LFN).  But I don't see any way to do this
with RFT/RLS/DRS.  So even though the system would know that on a remote
filesystem the file resource it needs already exists, it would have to stage
in the file again (or make a copy -- either way, it's another PFN) and waste
disk space in the process.  It's a situation that hopefully won't happen
very often, but I'd like to have a better solution for it.

Thanks for your help!

Adam




On Jan 7, 2008 4:38 PM, Charles Bacon <[EMAIL PROTECTED]> wrote:

> Honestly, this sounds like a use case for RLS, the replication
> location service.  You can have logical file names and a map from the
> logical names to where they are physically instantiated.  In that
> case you would query RLS to find out if a particular node already had
> a copy of your file or not.  If it didn't, you could stage it in.
>
> Regarding the creation of symlinks, I don't think RFT/GridFTP do
> that.  You could use the Fork jobmanager to submit a symlink job if
> your compute server allows fork submissions.
>
>
> Charles
>
> On Jan 7, 2008, at 3:26 PM, Adam Bazinet wrote:
>
> > Hi,
> >
> > I recently posted this message on rft-user, but seeing as that list
> > doesn't get much traffic, I hope no one minds if I try again here.
> >
> > I'm trying to implement a file caching scheme with GRAM/RFT/
> > GridFTP, such that when GRAM jobs are submitted to remote
> > resources, input files end up in a cache of sorts on the remote
> > resource.  This is all in an effort to cut down on unnecessary file
> > duplication between jobs that are submitted.
> >
> > The structure of the cache might look like this:
> >
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/md5sum
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/foo -> $
> > {GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/md5sum
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/bar -> $
> > {GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/md5sum
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/baz -> $
> > {GLOBUS_SCRATCH_DIR}/cache/md5sum_file1/md5sum
> >
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/md5sum
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/foo -> $
> > {GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/md5sum
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/bar -> $
> > {GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/md5sum
> > ${GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/baz -> $
> > {GLOBUS_SCRATCH_DIR}/cache/md5sum_file2/md5sum
> >
> > In this example, you can imagine that md5sum, md5sum_file1, and
> > md5sum_file2 are all actual md5sums.
> >
> > There are really only three cases that need to be handled:
> >
> > If file exists in the cache with the same name
> >    do nothing
> > If file exists in the cache with a different name
> >    create a symlink in the cache
> > Else
> >    upload file and create symlink
> >
> > Ideally I would like this to happen automatically when a job is
> > submitted -- I never used the old GASS cache, but from some
> > Googling perhaps what I'm proposing is similar.  Let's say before I
> > construct my RSL file, I need to figure out whether or not a file
> > exists on the remote resource.  Is there a way to do this with pure
> > RFT/GridFTP?
> >
> > Furthermore, is there a way to cause a symlink to be created on a
> > remote resource using RFT/GridFTP?
> >
> > I thought I'd start with this list before I submit a message to
> > either the GRAM or GridFTP lists.
> >
> > Thanks,
> > Adam
> >
>
>

Reply via email to