Hello Jeff Is it something that could go under HCFS project? http://wiki.apache.org/hadoop/HCFS (I might be wrong?)
Joe On 8/7/13 10:59 AM, "Jeff Dost" <[email protected]> wrote: >Hello, > >We work in a software development team at the UCSD CMS Tier2 Center. We >would like to propose a mechanism to allow one to subclass the >DFSInputStream in a clean way from an external package. First I'd like >to give some motivation on why and then will proceed with the details. > >We have a 3 Petabyte Hadoop cluster we maintain for the LHC experiment >at CERN. There are other T2 centers worldwide that contain mirrors of >the same data we host. We are working on an extension to Hadoop that, >on reading a file, if it is found that there are no available replicas >of a block, we use an external interface to retrieve this block of the >file from another data center. The external interface is necessary >because not all T2 centers involved in CMS are running a Hadoop cluster >as their storage backend. > >In order to implement this functionality, we need to subclass the >DFSInputStream and override the read method, so we can catch >IOExceptions that occur on client reads at the block level. > >The basic steps required: >1. Invent a new URI scheme for the customized "FileSystem" in >core-site.xml: > <property> > <name>fs.foofs.impl</name> > <value>my.package.FooFileSystem</value> > <description>My Extended FileSystem for foofs: uris.</description> > </property> > >2. Write new classes included in the external package that subclass the >following: >FooFileSystem subclasses DistributedFileSystem >FooFSClient subclasses DFSClient >FooFSInputStream subclasses DFSInputStream > >Now any client commands that explicitly use the foofs:// scheme in paths >to access the hadoop cluster can open files with a customized >InputStream that extends functionality of the default hadoop client >DFSInputStream. In order to make this happen for our use case, we had >to change some access modifiers in the DistributedFileSystem, DFSClient, >and DFSInputStream classes provided by Hadoop. In addition, we had to >comment out the check in the namenode code that only allows for URI >schemes of the form "hdfs://". > >Attached is a patch file we apply to hadoop. Note that we derived this >patch by modding the Cloudera release hadoop-2.0.0-cdh4.1.1 which can be >found at: >http://archive.cloudera.com/cdh4/cdh/4/hadoop-2.0.0-cdh4.1.1.tar.gz > >We would greatly appreciate any advise on whether or not this approach >sounds reasonable, and if you would consider accepting these >modifications into the official Hadoop code base. > >Thank you, >Jeff, Alja & Matevz >UCSD Physics
