[ https://issues.apache.org/jira/browse/HADOOP-15407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503626#comment-16503626 ]
Steve Loughran commented on HADOOP-15407: ----------------------------------------- I'm starting off with some high level "strategic issues", not looking at the low-level details. IntelliJ is flagging this up there, but its secondary and can be dealt with package-by-package. I haven't yet done the testing. The general things I review by are covered in my someout out of date [proposed styleguide update|https://github.com/steveloughran/formality/blob/master/styleguide/styleguide.md]. That doesn't cover Java 8, lambda expressions and streams, but it does show what I'm thinking about when I review stuff The key point: we want the code base to be broadly maintainable by anyone, tests to be fast and fail meaningfully, where possible as much of the same code style as elsewhere (though that evolves; moving things on is always good). So I'm looking it from a not just 'does it work' viewpoint, but "If I was left to maintain this, would I stand a chance?" Key requirements to be fixed before the big patch goes in * {{AzureBlobFileSystemException}} a subclass of IOE. * All those checkstyle complaints except in generated code. * And findbugs, except when it is wrong & needs to be told so * reinstate wasb contract tests to parallel runs * use Azure specific names for the service injection interfaces and impls. Followup JIRAs to go in * doc dependency injection in the javadocs * review exception handling/wrapping * Logging to use SLF4J directly for formatting * Testing enhancements as proposed below * Documentation. Oh, and I'd like the first patch to be one which only does the POM dependency checks. This gives our new branch committers an initial patch to work with. h3. Poms and dependencies The patch to change the POM dependencies should go in on its own (presumably, first patch), so it's more isolated. People cherry picking or doing diffs on POMs will appreciate this. Interesting that you had to exclude the guice transitive dependency on guava, but no other project did. Assumption: maven's "depth first" resolution logic has hidden it from others. h3. Dependency Injection We've tended to avoid this in the past; more a metric of the age of the codebase & the odd bad experience in the past. It is used in YARN a fair bit though. I can see you've been busy here That's fine, except you do get to document it in detail. I also worry that names like "ServiceProvider" may be a bit too generic; sommething like AzureStoreServiceProvider would be clearer. Proposed: add a JIRA to cover names & some coverage in the package.html file of {{org.apache.hadoop.fs.azurebfs}} h3. Exceptions My ideal: exceptions which include stacks, URLs, anything else to diagnose problems, ideally consistent with the Hadoop stack "IOException" everywhere world. * Pleae make {{AzureBlobFileSystemException}} a subclass of IOE. This matches the rest of our codebase and avoids having to translate things to IOEs in the FileSystem implementations. * make sure that stack traces are always retained... {{AzureBlobFileSystem.evaluateFileSystemPathOperation())}} is an example of where they are being stripped out in some cases (404, 409) * Also: look @ {{PathIOException}} for an exception class which takes a path; if {{AzureBlobFileSystemException}} could be built under that, it'd contain the path and operation, and be something other code already handles. Proposed: move {{AzureBlobFileSystemException}} under IOE before the first patch goes in, tune the other details separately h3. Logging Key question: why not just use SLf4J API directly? Except for that specific case method {{log(LogLevel logLevel, tring message, Object... arguments)}}. I don't see anything there that's not in SLF4J And, Why a new message format? The SLF4J Log.info("{}", object) log message constructor is more efficient than the JDK formatter, and handles null object values and only calls toString() on its args when actually logging them. * Followup 2: and a failure in formatting must be swallowed, not raised. Everyone hates a log failure, not least because the debug level logging doesn't normally get any test coverage. Given how fundamental this log injection is going on, I guess its hard to change, but I believe the formatting should be handed off to SL4J rather thsn trying to do thingd with a JSON parser which will inevitably (a) be slow and (b) be brittle. Proposed: add a JIRA: SLf4J to do formatting for AbfsLoggingService. This is going to be a blocker BTW, keeping cost of logging down is something we care about, which is CPU overhead + cost of creating objects, calling toString(), etc. h2. Testing We shouldn't need to add a new auth-keys file, just share the existing azure one, adding the new relevant fs URI. Simplifies path & allows .gitignore to be left alone. Moving {{**/contract/ITest*.java}} from the parallel test runs to serial is a regression which will significanly hurt performance, especially for those of us over long-haul links, where HTTPS setup costs tend to dominate. # please keep the WASB tests in there # and justify not running the AFSB contract tests in parallel Related to this, except in the special cases of: tests of instrumentation/metrics and stuff working on root paths, all tests need to be working on paths which can coexist with other test runs in parallel. Proposed: # split the excludes to leave wasb contract tests in parallel # revert to azure-auth-keys.xml # Add a separate JIRA for parallelizing abfs tests. h3. Checkstyle I'm afraid you do get to fix these, though the Base64 "magic numbers" are it overreacting (see below about Base64). Things like .* imports, unused imports: except for machine (re)-generated code, they'll need fixing. {Special case for .*: static imports} Also critical, adding a "." at the end of the first sentence of every single javadoc. It's a pain, but javadoc now requires it. That also means removing the unused imports suppressions from src/config/checkstyle-suppressions.xml I would prefer this is done now, because otherwise it'll be neglected, and javadoc and patch merging will be in trouble from the outset. h2. Misc * Scope: I like the {{@Evolving}} marker, but for now, we should mark all the new files as @Private. That includes the FS, which people should only get via FileSystem.get() or Path.getFileSystem(). * For Java 7 specific issues/workarounds, use "JDK7" in comments; this our convention for future scans of the code (ex AbfsClient.flush()) log4j.properties; remove DEBUG ref, which, transitively, avoids needing to mark {{org.apache.hadoop.fs.azurebfs.contracts.services.LoggingService}} as INFO. Just stash those debug-level changes locally and reapply as needed. h3. Classes Not looked at these in detail. Got some notes, but will wait until another iteration and do some package-by-package reviw * Can we get by without another Base64 impl on the classpath? There's one in the JVM and my IDE find many more. If there is one in, say httpclient, lets just pick it up. Less to test, less to maintain. h3. Docs yes please! > Support Windows Azure Storage - Blob file system in Hadoop > ---------------------------------------------------------- > > Key: HADOOP-15407 > URL: https://issues.apache.org/jira/browse/HADOOP-15407 > Project: Hadoop Common > Issue Type: New Feature > Components: fs/azure > Affects Versions: 3.2.0 > Reporter: Esfandiar Manii > Assignee: Esfandiar Manii > Priority: Major > Attachments: HADOOP-15407-001.patch, HADOOP-15407-002.patch, > HADOOP-15407-003.patch, HADOOP-15407-004.patch, > HADOOP-15407-HADOOP-15407.006.patch, HADOOP-15407-HADOOP-15407.007.patch > > > *{color:#212121}Description{color}* > This JIRA adds a new file system implementation, ABFS, for running Big Data > and Analytics workloads against Azure Storage. This is a complete rewrite of > the previous WASB driver with a heavy focus on optimizing both performance > and cost. > {color:#212121} {color} > *{color:#212121}High level design{color}* > At a high level, the code here extends the FileSystem class to provide an > implementation for accessing blobs in Azure Storage. The scheme abfs is used > for accessing it over HTTP, and abfss for accessing over HTTPS. The following > URI scheme is used to address individual paths: > {color:#212121} {color} > > {color:#212121}abfs[s]://<filesystem>@<account>.dfs.core.windows.net/<path>{color} > {color:#212121} {color} > {color:#212121}ABFS is intended as a replacement to WASB. WASB is not > deprecated but is in pure maintenance mode and customers should upgrade to > ABFS once it hits General Availability later in CY18.{color} > {color:#212121}Benefits of ABFS include:{color} > {color:#212121}· Higher scale (capacity, throughput, and IOPS) Big > Data and Analytics workloads by allowing higher limits on storage > accounts{color} > {color:#212121}· Removing any ramp up time with Storage backend > partitioning; blocks are now automatically sharded across partitions in the > Storage backend{color} > {color:#212121} . This avoids the need for using > temporary/intermediate files, increasing the cost (and framework complexity > around committing jobs/tasks){color} > {color:#212121}· Enabling much higher read and write throughput on > single files (tens of Gbps by default){color} > {color:#212121}· Still retaining all of the Azure Blob features > customers are familiar with and expect, and gaining the benefits of future > Blob features as well{color} > {color:#212121}ABFS incorporates Hadoop Filesystem metrics to monitor the > file system throughput and operations. Ambari metrics are not currently > implemented for ABFS, but will be available soon.{color} > {color:#212121} {color} > *{color:#212121}Credits and history{color}* > Credit for this work goes to (hope I don't forget anyone): Shane Mainali, > {color:#212121}Thomas Marquardt, Zichen Sun, Georgi Chalakov, Esfandiar > Manii, Amit Singh, Dana Kaban, Da Zhou, Junhua Gu, Saher Ahwal, Saurabh Pant, > and James Baker. {color} > {color:#212121} {color} > *Test* > ABFS has gone through many test procedures including Hadoop file system > contract tests, unit testing, functional testing, and manual testing. All the > Junit tests provided with the driver are capable of running in both > sequential/parallel fashion in order to reduce the testing time. > {color:#212121}Besides unit tests, we have used ABFS as the default file > system in Azure HDInsight. Azure HDInsight will very soon offer ABFS as a > storage option. (HDFS is also used but not as default file system.) Various > different customer and test workloads have been run against clusters with > such configurations for quite some time. Benchmarks such as Tera*, TPC-DS, > Spark Streaming and Spark SQL, and others have been run to do scenario, > performance, and functional testing. Third parties and customers have also > done various testing of ABFS.{color} > {color:#212121}The current version reflects to the version of the code > tested and used in our production environment.{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org