[ https://issues.apache.org/jira/browse/HADOOP-14766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated HADOOP-14766: ------------------------------------ Attachment: HADOOP-14766-001.patch Patch 001 this is the initial PoC imported into Hadoop under hadoop-common; eliminate copy & past of ContractTestUtils.NanoTime by moving the class and then retaining the old one as a subclass of the moved one. I'm not 100% sure this is the right home, but we don't yet have an explicit cloud module. Note: this also works with HDFS, even across the local FS...any FS which implements their own version of {{copyFromLocalFile}} will benefit from it. Testing: only manually against S3A and its copyFromLocalFile. There's no check for changed files; i.e. against checksums, timestamps or similar. None planned. This is primarily of a local-to-store upload program with comparable speed to that shipped with the AWS SDK, but able to work with any remote HCFS store, not some incremental backup mechanism. Though if someone were to issue getChecksum(path) across all the stores, it'd be good to log that, possibly even export a minimal avro file summary > Cloudup: an object store high performance dfs put command > --------------------------------------------------------- > > Key: HADOOP-14766 > URL: https://issues.apache.org/jira/browse/HADOOP-14766 > Project: Hadoop Common > Issue Type: New Feature > Components: fs, fs/azure, fs/s3 > Affects Versions: 2.8.1 > Reporter: Steve Loughran > Assignee: Steve Loughran > Priority: Minor > Attachments: HADOOP-14766-001.patch > > > {{hdfs put local s3a://path}} is suboptimal as it treewalks down down the > source tree then, sequentially, copies up the file through copying the file > (opened as a stream) contents to a buffer, writes that to the dest file, > repeats. > For S3A that hurts because > * it;s doing the upload inefficiently: the file can be uploaded just by > handling the pathname to the AWS xter manager > * it is doing it sequentially, when some parallelised upload would work. > * as the ordering of the files to upload is a recursive treewalk, it doesn't > spread the upload across multiple shards. > Better: > * build the list of files to upload > * upload in parallel, picking entries from the list at random and spreading > across a pool of uploaders > * upload straight from local file (copyFromLocalFile() > * track IO load (files created/second) to estimate risk of throttling. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org