[ https://issues.apache.org/jira/browse/HADOOP-9629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Liddell reassigned HADOOP-9629: ------------------------------------ Assignee: Mike Liddell (was: Mostafa Elhemali) > Support Windows Azure Storage - Blob as a file system in Hadoop > --------------------------------------------------------------- > > Key: HADOOP-9629 > URL: https://issues.apache.org/jira/browse/HADOOP-9629 > Project: Hadoop Common > Issue Type: Improvement > Reporter: Mostafa Elhemali > Assignee: Mike Liddell > Attachments: HADOOP-9629.2.patch, HADOOP-9629.3.patch, > HADOOP-9629.patch, HADOOP-9629.trunk.1.patch, HADOOP-9629.trunk.2.patch > > > h2. Description > This JIRA incorporates adding a new file system implementation for accessing > Windows Azure Storage - Blob from within Hadoop, such as using blobs as input > to MR jobs or configuring MR jobs to put their output directly into blob > storage. > h2. High level design > At a high level, the code here extends the FileSystem class to provide an > implementation for accessing blob storage; the scheme wasb is used for > accessing it over HTTP, and wasbs for accessing over HTTPS. We use the URI > scheme: {code}wasb[s]://<container>@<account>/path/to/file{code} to address > individual blobs. We use the standard Azure Java SDK > (com.microsoft.windowsazure) to do most of the work. In order to map a > hierarchical file system over the flat name-value pair nature of blob > storage, we create a specially tagged blob named path/to/dir whenever we > create a directory called path/to/dir, then files under that are stored as > normal blobs path/to/dir/file. We have many metrics implemented for it using > the Metrics2 interface. Tests are implemented mostly using a mock > implementation for the Azure SDK functionality, with an option to test > against a real blob storage if configured (instructions provided inside in > README.txt). > h2. Credits and history > This has been ongoing work for a while, and the early version of this work > can be seen in HADOOP-8079. This JIRA is a significant revision of that and > we'll post the patch here for Hadoop trunk first, then post a patch for > branch-1 as well for backporting the functionality if accepted. Credit for > this work goes to the early team: [~minwei], [~davidlao], [~lengningliu] and > [~stojanovic] as well as multiple people who have taken over this work since > then (hope I don't forget anyone): [~dexterb], Johannes Klein, [~ivanmi], > Michael Rys, [~mostafae], [~brian_swan], [~mikelid], [~xifang], and > [~chuanliu]. > h2. Test > Besides unit tests, we have used WASB as the default file system in our > service product. (HDFS is also used but not as default file system.) Various > different customer and test workloads have been run against clusters with > such configurations for quite some time. The current version reflects to the > version of the code tested and used in our production environment. -- This message was sent by Atlassian JIRA (v6.2#6252)