Hi everybody,

I'm jumping in as Jeff is away due to an unexpected annoyance involving Californian wildlife.

On 8/7/13 7:47 PM, Andrew Wang wrote:
Blocks are supposed to be an internal abstraction within HDFS, and aren't an
inherent part of FileSystem (the user-visible class used to access all Hadoop
filesystems).

Yes, but it's a really useful abstraction :) Do you really believe the blocks could be abandoned in the next couple of years? I mean, it's such a simple and effective solution ...

Is it possible to instead deal with files and offsets? On a read failure, you
could open a stream to the same file on the backup filesystem, seek to the old
file position, and retry the read. This feels like it's possible via wrapping.

As Jeff briefly mentioned, all USCMS sites export their data via XRootd (not all of them use HDFS!) and we developed a specialization of XRootd caching proxy that can fetch only requested blocks (block size is passed from our input stream class to XRootd client (via JNI) and on to the proxy server) and keep them in a local cache. This allows as to do three things:

1. the first time we notice a block is missing, a whole block is fetched from elsewhere and further access requests from the same process get fulfilled with zero latency;

2. later requests from other processes asking for this block are fulfilled immediately (well, after the initial 3 retries);

3. we have a list of blocks that were fetched and we could (this is what we want to look into in the near future) re-inject them into HDFS if the data loss turns out to be permanent (bad disk vs. node that was offline/overloaded for a while).

Handling exceptions at the block level thus gives us just what we need. As input stream is the place where these errors become known it is, I think, also the easiest place to handle them.

I'll understand if you find opening-up of the interfaces in the central repository unacceptable. We can always apply the patch at the OSG level where rpms for all our deployments get built.

Thanks & Best regards,
Matevz


On Wed, Aug 7, 2013 at 3:29 PM, Jeff Dost <jd...@ucsd.edu
<mailto:jd...@ucsd.edu>> wrote:

    Thank you for the suggestion, but we don't see how simply wrapping a
    FileSystem object would be sufficient in our use case.  The reason why is we
    need to catch and handle read exceptions at the block level.  There aren't
    any public methods available in the high level FileSystem abstraction layer
    that would give us the fine grained control we need at block level read
    failures.

    Perhaps if I outline the steps more clearly it will help explain what we are
    trying to do.  Without our enhancements, suppose a user opens a file stream
    and starts reading the file from Hadoop. After some time, at some position
    into the file, if there happen to be no replicas available for a particular
    block for whatever reason, datanodes have gone down due to disk issues, etc.
    the stream will throw an IOException (BlockMissingException or similar) and
    the read will fail.

    What we are doing is rather than letting the stream fail, we have another
    stream queued up that knows how to fetch the blocks elsewhere outside of our
    Hadoop cluster that couldn't be retrieved.  So we need to be able to catch
    the exception at this point, and these externally fetched bytes then get
    read into the user supplied read buffer.  Now Hadoop can proceed to read in
    the stream the next blocks in the file.

    So as you can see this method of fail over on demand allows an input stream
    to keep reading data, without having to start it all over again if a failure
    occurs (assuming the remote bytes were successfully fetched).

    As a final note I would like to mention that we will be providing our
    failover module to the Open Science Grid.  Since we hope to provide this as
    a benefit to all OSG users running at participating T2 computing clusters,
    we will be committed to maintaining this software and any changes to Hadoop
    needed to make it work.  In other words we will be willing to maintain any
    implementation changes that may become necessary as Hadoop internals change
    in future releases.

    Thanks,
    Jeff


    On 8/7/13 11:30 AM, Andrew Wang wrote:

        I don't think exposing DFSClient and DistributedFileSystem members is
        necessary to achieve what you're trying to do. We've got wrapper
        FileSystems like FilterFileSystem and ViewFileSystem which you might be
        able to use for inspiration, and the HCFS wiki lists some third-party
        FileSystems that might also be helpful too.


        On Wed, Aug 7, 2013 at 11:11 AM, Joe Bounour <jboun...@ddn.com
        <mailto:jboun...@ddn.com>> wrote:

            Hello Jeff

            Is it something that could go under HCFS project?
            http://wiki.apache.org/hadoop/__HCFS
            <http://wiki.apache.org/hadoop/HCFS>
            (I might be wrong?)

            Joe


            On 8/7/13 10:59 AM, "Jeff Dost" <jd...@ucsd.edu
            <mailto:jd...@ucsd.edu>> wrote:

                Hello,

                We work in a software development team at the UCSD CMS Tier2
                Center.  We
                would like to propose a mechanism to allow one to subclass the
                DFSInputStream in a clean way from an external package.  First
                I'd like
                to give some motivation on why and then will proceed with the
                details.

                We have a 3 Petabyte Hadoop cluster we maintain for the LHC
                experiment
                at CERN.  There are other T2 centers worldwide that contain
                mirrors of
                the same data we host.  We are working on an extension to Hadoop
                that,
                on reading a file, if it is found that there are no available
                replicas
                of a block, we use an external interface to retrieve this block
                of the
                file from another data center.  The external interface is 
necessary
                because not all T2 centers involved in CMS are running a Hadoop
                cluster
                as their storage backend.

                In order to implement this functionality, we need to subclass 
the
                DFSInputStream and override the read method, so we can catch
                IOExceptions that occur on client reads at the block level.

                The basic steps required:
                1. Invent a new URI scheme for the customized "FileSystem" in
                core-site.xml:
                    <property>
                      <name>fs.foofs.impl</name>
                      <value>my.package.__FooFileSystem</value>
                      <description>My Extended FileSystem for foofs:
                uris.</description>
                    </property>

                2. Write new classes included in the external package that
                subclass the
                following:
                FooFileSystem subclasses DistributedFileSystem
                FooFSClient subclasses DFSClient
                FooFSInputStream subclasses DFSInputStream

                Now any client commands that explicitly use the foofs:// scheme
                in paths
                to access the hadoop cluster can open files with a customized
                InputStream that extends functionality of the default hadoop 
client
                DFSInputStream.  In order to make this happen for our use case,
                we had
                to change some access modifiers in the DistributedFileSystem,
                DFSClient,
                and DFSInputStream classes provided by Hadoop.  In addition, we
                had to
                comment out the check in the namenode code that only allows for 
URI
                schemes of the form "hdfs://".

                Attached is a patch file we apply to hadoop.  Note that we
                derived this
                patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1
                which can be
                found at:
                
http://archive.cloudera.com/__cdh4/cdh/4/hadoop-2.0.0-cdh4.__1.1.tar.gz
                
<http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz>

                We would greatly appreciate any advise on whether or not this
                approach
                sounds reasonable, and if you would consider accepting these
                modifications into the official Hadoop code base.

                Thank you,
                Jeff, Alja & Matevz
                UCSD Physics





Reply via email to