[ https://issues.apache.org/jira/browse/HDFS-17316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17807603#comment-17807603 ]
Han Liu commented on HDFS-17316: -------------------------------- Thank you for the detailed comments from [~ste...@apache.org] and apologize for the late reply. Recently I reviewed my code so that it fits the Hadoop code format. {quote}1. filesystem contract tests are designed to do this from junit; If your FS implementation doesn't subclass and run these, you need to start there. {quote} Contract tests play an essential role in evaluation of storage service abilities, where a closely examination of core FS functions are performed, such as create, open, delete, etc. There is an overlap between contract tests and the benchmark we discussed here. The main mismatch between them is that contract tests mainly focus on quality of a most important subset of FS APIs, where a series of cases are designed for each API. The goal of the benchmark proposed here is to provide a general way to check basic compatibility of FS public APIs, which treats the interfaces as the same and cover all of them, including ACL, XAttr, StoragePolicy, Snapshot, Symlink, etc. It should be ensured that for a new FS implementation the benchmark examination can be performed quickly as long as the implementation jar file is supplied. The benchmark would also introduce a conception of 'suite' corresponding to a subset of APIs, aiming to check compatibility of specific scenarios such as 'tpcds'. {quote}2. filesystem API specification is intended to specify the API and document where problems surface. maintenance there always welcome -and as the contract tests are derived from it, enhancements in those tests to follow {quote} It is significant that API specification should keep maintenance. As the Hadoop ecosystem develops, the API core functions might evolve and require new contract cases. MultipartUploaderTest is an example. I am glad to keep an eye on it, and contribute more cases when needed. {quote}3. there's also terasort to validate commit protocols {quote} I agree that TeraSort can be used as part of the compatibility benchmark. There can be an individual suite for the validity of MapReduce file output committer. {quote}4. + distcp contract tests for its semantics {quote} The validity of DistCp can also be an individual suite, where the test case is a DistCp Job from MiniDFSCluster to target storage service. {quote}5. dfsio does a lot, but needs maintenance -it only targets the clusterfs, when really you should be able to point at cloud storage from your own computer. extending that to take a specific target fs would be good. {quote} I agree that DFSIO should be extended to general targets. This should be done in Hadoop as a separate task, so that benchmark tool can use it. Good idea! {quote}6. output must go into the class ant junit xml format so jenkins can present it. {quote} Good suggestion. The design of the benchmark is a tool quickly evaluating compatibility score of a FS implementation. It might be inappropriate to be treated as a unit test system. All cases must be simple, and after a quick run a report is automatically generated showing an overall score and a list of 'not compatible APIs'. The framework contains both Java cases and pjdfstest-style shell scripts. Thus, the benchmark framework is more flexible and do not need a junit report. {quote}We can create a new hadoop git repo for this. Do you have existing code and any detailed specification/docs. this also allows you to add dependencies on other things, e.g. spark. {quote} Yes, I already have some initial codes and will submit a PR for easier reference later. I'm also preparing a design doc for more details, will share the link here when ready. The goal of the benchmark we discussed does not need extra dependencies on spark or hive. On the contrary, the design may limit the dependency to only Hadoop itself. Thus, a small submodule of Hadoop repo might be OK, maybe a hadoop-compat-bench module under hadoop-tools I think. Welcome further discussion and the next code review together! > Compatibility Benchmark over HCFS Implementations > ------------------------------------------------- > > Key: HDFS-17316 > URL: https://issues.apache.org/jira/browse/HDFS-17316 > Project: Hadoop HDFS > Issue Type: New Feature > Reporter: Han Liu > Priority: Major > > {*}Background:{*}Hadoop-Compatible File System (HCFS) is a core conception in > big data storage ecosystem, providing unified interfaces and generally clear > semantics, and has become the de-factor standard for industry storage systems > to follow and conform with. There have been a series of HCFS implementations > in Hadoop, such as S3AFileSystem for Amazon's S3 Object Store, WASB for > Microsoft's Azure Blob Storage and OSS connector for Alibaba Cloud Object > Storage, and more from storage service's providers on their own. > {*}Problems:{*}However, as indicated by introduction.md, there is no formal > suite to do compatibility assessment of a file system for all such HCFS > implementations. Thus, whether the functionality is well accomplished and > meets the core compatible expectations mainly relies on service provider's > own report. Meanwhile, Hadoop is also developing and new features are > continuously contributing to HCFS interfaces for existing implementations to > follow and update, in which case, Hadoop also needs a tool to quickly assess > if these features are supported or not for a specific HCFS implementation. > Besides, the known hadoop command line tool or hdfs shell is used to directly > interact with a HCFS storage system, where most commands correspond to > specific HCFS interfaces and work well. Still, there are cases that are > complicated and may not work, like expunge command. To check such commands > for an HCFS, we also need an approach to figure them out. > {*}Proposal:{*}Accordingly, we propose to define a formal HCFS compatibility > benchmark and provide corresponding tool to do the compatibility assessment > for an HCFS storage system. The benchmark and tool should consider both HCFS > interfaces and hdfs shell commands. Different scenarios require different > kinds of compatibilities. For such consideration, we could define different > suites in the benchmark. > *Benefits:* We intend the benchmark and tool to be useful for both storage > providers and storage users. For end users, it can be used to evalute the > compatibility level and determine if the storage system in question is > suitable for the required scenarios. For storage providers, it helps to > quickly generate an objective and reliable report about core functioins of > the storage service. As an instance, if the HCFS got a 100% on a suite named > 'tpcds', it is demonstrated that all functions needed by a tpcds program have > been well achieved. It is also a guide indicating how storage service > abilities can map to HCFS interfaces, such as storage class on S3. > Any thoughts? Comments and feedback are mostly welcomed. Thanks in advance. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org