[jira] [Commented] (HADOOP-15407) Support Windows Azure Storage - Blob file system in Hadoop

Steve Loughran (JIRA) Wed, 06 Jun 2018 10:26:35 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503626#comment-16503626
 ]


Steve Loughran commented on HADOOP-15407:
-----------------------------------------

I'm starting off with some high level "strategic issues", not looking at the 
low-level details. IntelliJ is flagging this up there, but its secondary and 
can be dealt with package-by-package. I haven't yet done the testing.

The general things I review by are covered in my someout out of date [proposed 
styleguide 
update|https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md].
 That doesn't cover Java 8, lambda expressions and streams, but it does show 
what I'm thinking about when I review stuff
 
The key point: we want the code base to be broadly maintainable by anyone, 
tests to be fast and fail meaningfully, where possible as much of the same code 
style as elsewhere (though that evolves; moving things on is always good). So 
I'm looking it from a not just 'does it work' viewpoint, but "If I was left to 
maintain this, would I stand a chance?" 

Key requirements to be fixed before the big patch goes in

* {{AzureBlobFileSystemException}} a subclass of IOE.
* All those checkstyle complaints except in generated code.
* And findbugs, except when it is wrong & needs to be told so
* reinstate wasb contract tests to parallel runs
* use Azure specific names for the service injection interfaces and impls. 

Followup JIRAs to go in

* doc dependency injection in the javadocs
* review exception handling/wrapping
* Logging to use SLF4J directly for formatting
* Testing enhancements as proposed below
* Documentation.
  
Oh, and I'd like the first patch to be one which only does the POM dependency 
checks. This gives our new branch committers an initial patch to work with.
 
h3. Poms and dependencies

The patch to change the POM dependencies should go in on its own (presumably, 
first patch), so it's more isolated. People cherry picking or doing diffs on 
POMs will appreciate this.

Interesting that you had to exclude the guice transitive dependency on guava, 
but no other project did. Assumption: maven's "depth first" resolution logic 
has hidden it from others.

h3. Dependency Injection

We've tended to avoid this in the past; more a metric of the age of the 
codebase & the odd bad experience in the past. It is used in YARN a fair bit 
though.

I can see you've been busy here That's fine, except you do get to document it 
in detail.

I also worry that names like "ServiceProvider" may be a bit too generic; 
sommething like AzureStoreServiceProvider would be clearer.


Proposed: add a JIRA to cover names & some coverage in the package.html file of 
{{org.apache.hadoop.fs.azurebfs}}

h3. Exceptions

My ideal: exceptions which include stacks, URLs, anything else to diagnose 
problems, ideally consistent with the Hadoop stack "IOException" everywhere 
world.

* Pleae make {{AzureBlobFileSystemException}} a subclass of IOE. This matches 
the rest of our codebase and avoids having to translate things to IOEs in the 
FileSystem implementations.

* make sure that stack traces are always retained... 
{{AzureBlobFileSystem.evaluateFileSystemPathOperation())}} is an example of 
where they are being stripped out in some cases (404, 409)

* Also: look @ {{PathIOException}} for an exception class which takes a path; 
if {{AzureBlobFileSystemException}} could be built under that, it'd contain the 
path and operation, and be something other code already handles.

Proposed: move {{AzureBlobFileSystemException}} under IOE before the first 
patch goes in, tune the other details separately


h3. Logging

Key question: why not just use SLf4J API directly? Except for that specific 
case method {{log(LogLevel logLevel, tring message, Object... arguments)}}. I 
don't see anything there that's not in SLF4J

And, Why a new message format? The SLF4J Log.info("{}", object) log message 
constructor is more efficient than the JDK formatter, and handles null object 
values and only calls toString() on its args when actually logging them. * 
Followup 2: and a failure in formatting must be swallowed, not raised. Everyone 
hates a log failure, not least because the debug level logging doesn't normally 
get any test coverage.  


Given how fundamental this log injection is going on, I guess its hard to 
change, but I believe the formatting should be handed off to SL4J rather thsn 
trying to do thingd with a JSON parser which will inevitably (a) be slow and 
(b) be brittle.

Proposed: add a JIRA: SLf4J to do formatting for AbfsLoggingService. This is 
going to be a blocker BTW, keeping cost of logging down is something we care 
about, which is CPU overhead + cost of creating objects, calling toString(), 
etc.  

h2. Testing

We shouldn't need to add a new auth-keys file, just share the existing azure 
one, adding the new relevant fs URI. Simplifies path & allows .gitignore to be 
left alone.

Moving {{**/contract/ITest*.java}} from the parallel test runs to serial is a 
regression which will significanly hurt performance, especially for those of us 
over long-haul links, where
HTTPS setup costs tend to dominate.

# please keep the WASB tests in there
# and justify not running the AFSB contract tests in parallel 


Related to this, except in the special cases of: tests of 
instrumentation/metrics and stuff working on root paths, all tests need to be 
working on paths which can coexist with other test runs in parallel.

Proposed: 

# split the excludes to leave wasb contract tests in parallel
# revert to azure-auth-keys.xml
# Add a separate JIRA for parallelizing abfs tests.

h3. Checkstyle

I'm afraid you do get to fix these, though the Base64 "magic numbers" are it 
overreacting (see below about Base64). Things like .* imports, unused imports: 
except for machine (re)-generated code, they'll need fixing. {Special case for 
.*: static imports}

Also critical, adding a "." at the end of the first sentence of every single 
javadoc. It's a pain, but javadoc now requires it. 

That also means removing the unused imports suppressions from 
src/config/checkstyle-suppressions.xml

I would prefer this is done now, because otherwise it'll be neglected, and 
javadoc and patch merging will be in trouble from the outset.

h2. Misc

* Scope: I like the {{@Evolving}} marker, but for now, we should mark all the 
new files as @Private. That includes the FS, which people should only get via 
FileSystem.get() or Path.getFileSystem(). 

* For Java 7 specific issues/workarounds, use "JDK7" in comments; this our 
convention for future scans of the code (ex AbfsClient.flush())
 
log4j.properties; remove DEBUG ref, which, transitively, avoids needing to mark 
{{org.apache.hadoop.fs.azurebfs.contracts.services.LoggingService}} as INFO. 
Just stash those debug-level changes locally and reapply as needed.




h3. Classes

Not looked at these in detail. Got some notes, but will wait until another 
iteration and do some package-by-package reviw

* Can we get by without another Base64 impl on the classpath? There's one in 
the JVM and my IDE find many more. If there is one in, say httpclient, lets 
just pick it up. Less to test, less to maintain. 

h3. Docs

yes please!


> Support Windows Azure Storage - Blob file system in Hadoop
> ----------------------------------------------------------
>
>                 Key: HADOOP-15407
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15407
>             Project: Hadoop Common
>          Issue Type: New Feature
>          Components: fs/azure
>    Affects Versions: 3.2.0
>            Reporter: Esfandiar Manii
>            Assignee: Esfandiar Manii
>            Priority: Major
>         Attachments: HADOOP-15407-001.patch, HADOOP-15407-002.patch, 
> HADOOP-15407-003.patch, HADOOP-15407-004.patch, 
> HADOOP-15407-HADOOP-15407.006.patch, HADOOP-15407-HADOOP-15407.007.patch
>
>
> *{color:#212121}Description{color}*
>  This JIRA adds a new file system implementation, ABFS, for running Big Data 
> and Analytics workloads against Azure Storage. This is a complete rewrite of 
> the previous WASB driver with a heavy focus on optimizing both performance 
> and cost.
>  {color:#212121} {color}
>  *{color:#212121}High level design{color}*
>  At a high level, the code here extends the FileSystem class to provide an 
> implementation for accessing blobs in Azure Storage. The scheme abfs is used 
> for accessing it over HTTP, and abfss for accessing over HTTPS. The following 
> URI scheme is used to address individual paths:
>  {color:#212121} {color}
>  
> {color:#212121}abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>{color}
>  {color:#212121} {color}
>  {color:#212121}ABFS is intended as a replacement to WASB. WASB is not 
> deprecated but is in pure maintenance mode and customers should upgrade to 
> ABFS once it hits General Availability later in CY18.{color}
>  {color:#212121}Benefits of ABFS include:{color}
>  {color:#212121}·         Higher scale (capacity, throughput, and IOPS) Big 
> Data and Analytics workloads by allowing higher limits on storage 
> accounts{color}
>  {color:#212121}·         Removing any ramp up time with Storage backend 
> partitioning; blocks are now automatically sharded across partitions in the 
> Storage backend{color}
> {color:#212121}          .         This avoids the need for using 
> temporary/intermediate files, increasing the cost (and framework complexity 
> around committing jobs/tasks){color}
>  {color:#212121}·         Enabling much higher read and write throughput on 
> single files (tens of Gbps by default){color}
>  {color:#212121}·         Still retaining all of the Azure Blob features 
> customers are familiar with and expect, and gaining the benefits of future 
> Blob features as well{color}
>  {color:#212121}ABFS incorporates Hadoop Filesystem metrics to monitor the 
> file system throughput and operations. Ambari metrics are not currently 
> implemented for ABFS, but will be available soon.{color}
>  {color:#212121} {color}
>  *{color:#212121}Credits and history{color}*
>  Credit for this work goes to (hope I don't forget anyone): Shane Mainali, 
> {color:#212121}Thomas Marquardt, Zichen Sun, Georgi Chalakov, Esfandiar 
> Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal, Saurabh Pant, 
> and James Baker. {color}
>  {color:#212121} {color}
>  *Test*
>  ABFS has gone through many test procedures including Hadoop file system 
> contract tests, unit testing, functional testing, and manual testing. All the 
> Junit tests provided with the driver are capable of running in both 
> sequential/parallel fashion in order to reduce the testing time.
>  {color:#212121}Besides unit tests, we have used ABFS as the default file 
> system in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a 
> storage option. (HDFS is also used but not as default file system.) Various 
> different customer and test workloads have been run against clusters with 
> such configurations for quite some time. Benchmarks such as Tera*, TPC-DS, 
> Spark Streaming and Spark SQL, and others have been run to do scenario, 
> performance, and functional testing. Third parties and customers have also 
> done various testing of ABFS.{color}
>  {color:#212121}The current version reflects to the version of the code 
> tested and used in our production environment.{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-15407) Support Windows Azure Storage - Blob file system in Hadoop

Reply via email to