Hi all, Have submitted a patch for this feature on Github (pull request 80). Please have a look. There are some points, where I would like your thoughts.
Regards, Mahendra On Thu, Dec 20, 2012 at 3:19 PM, Mahendra M (JIRA) <[email protected]> wrote: > Mahendra M created LIBCLOUD-269: > ----------------------------------- > > Summary: Multipart upload for amazon S3 > Key: LIBCLOUD-269 > URL: https://issues.apache.org/jira/browse/LIBCLOUD-269 > Project: Libcloud > Issue Type: Improvement > Components: Storage > Reporter: Mahendra M > > > This patch adds support for streaming data upload using Amazon's multipart > upload support as listed in ( > http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html > ) > > As per current behaviour, the upload_object_via_stream() API will download > the entire object in memory, and then upload it to S3. This can turn > problematic with large files (think HD videos around 4GB). This will be a > huge hit in performance and memory of the python application. > > With this patch, the API upload_object_via_stream() will use the S3 > multipart upload feature to upload data in 5MB chunks, thus reducing the > overall memory impact of the application. > > Design of this feature: > * The S3StorageDriver() is not used just for Amazon S3. It is subclassed > for being used with other S3 compliant cloud storage providers like Google > Storage. > * The Amazon S3 multipart upload is not (or may not be) supported by other > storage providers (who will prefer the Chunked upload mechanism) > > We can solve this problem in two ways: > 1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver), > which implements this new multipart upload mechanism. Other storage > providers will subclass S3StorageDriver. This is a more cleaner approach. > 2) Introduce an attribute supports_s3_multipart_upload and based on it's > value, control the callback function passed to _put_object() API. This > makes the code look a bit hacky, but this approach is better for supporting > such features in the future. We don't have to keep making subclasses for > each feature. > > In the current patch, I have implemented approach (2), though I prefer > (1). After discussions with the community and knowing their preferrences, > we can select a final approach. > > Design notes: > * Implementation has three steaps > 1) POST request to /container/object_name?uploads. This returns an XML > with a unique uploadId. This is handled as part of _put_object(). Doing it > via _put_object() ensures that all S3 related parameters are set correctly. > 2) Upload each chunk via POST to > /container/object_name?partNumber=X&uploadId=*** - This is done via the > callback that is passed to _put_object() named _upload_multipart() > 3) POST an XML containing part-numbers and etag headers returned for > each part to /container/object_name?uploadId=***, implemented via > _commit_multipart() > 4) In case of any failures in steps (2) or (3), the upload is deleted > from S3 through a DELETE request to /container/object_name?uploadId=****, > implemented via _abort_multipart() > > * The chunk size for upload was set as 5MB - This is the minimum allowed > size as per Amazon S3 docs. > > Other changes: > * Did some PEP8 cleanup on s3.py > > * s3.get_container() would iterate through the list of containers for > finding the requested entry. This can be simplified by making a HEAD > request. The only downside is that 'created_time' is not available for the > container. Let me know if this approach is OK or if I must revert it. > > * Introduced the following APIs for the S3StorageDriver(), to make some > functionality easier. > get_container_cdn_url() > get_object_cdn_url() > > * In libcloud.common.base.Connection, the request() method is used as the > basis for all HTTP requests made by libcloud. This method had a limitation, > which became apparent in S3 multipart upload implementation. For > initializing an upload, the API invoked was > /container/object_name?uploads > The 'uploads' parameter had to be passed as-is, without any values. If we > made use of "params" argument in request() method, it would have come up as > 'uploads=***'. To prevent this, the 'action' was set to > /container/object_name?uploads and slight modifications were made to how > parameters were appended. > > This also forced a change in BaseMockHttpObject._get_method_name() > > Bug fixes in test framework > * While working on the test cases, I noticed a small issue. Not sure if it > was a bug or as per design. > MockRawResponse._get_response_if_not_availale() would return two > different values on subsequent invocations. > if not self._response: > ... > return self <----- this was inconsistent. > return self._response > > While adding test cases for the Amazon S3 functionality, I noticed that > instead of getting back MockResponse, I was getting MockRawResponse > instance (which did not have methods like read()) or parse_body(). So, I > fixed this issue. Because of this other test cases started failing and they > were subsequently fixed. Not sure if this has to be fixed or if it was done > on purpose. If someone can throw some light on it, I can work on it > further. As of now, all test cases pass. > > * In test_s3.py, the driver was being set everywhere to S3StorageDriver. > This same test case is used for GoogleStorageDriver, where the driver turns > up as S3StorageDriver instead of GoogleStorageDriver. This was fixed by > changing code to driver=self.driver_type > > > -- > This message is automatically generated by JIRA. > If you think it was sent incorrectly, please contact your JIRA > administrators > For more information on JIRA, see: http://www.atlassian.com/software/jira > -- Mahendra http://twitter.com/mahendra
