[
https://issues.apache.org/jira/browse/LIBCLOUD-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tomaz Muraus resolved LIBCLOUD-269.
-----------------------------------
Resolution: Fixed
Assignee: Tomaz Muraus
> Multipart upload for amazon S3
> ------------------------------
>
> Key: LIBCLOUD-269
> URL: https://issues.apache.org/jira/browse/LIBCLOUD-269
> Project: Libcloud
> Issue Type: Improvement
> Components: Storage
> Reporter: Mahendra M
> Assignee: Tomaz Muraus
> Attachments: libcloud-269.diff, libcloud-269.diff
>
>
> This patch adds support for streaming data upload using Amazon's multipart
> upload support as listed in
> (http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)
> As per current behaviour, the upload_object_via_stream() API will download
> the entire object in memory, and then upload it to S3. This can turn
> problematic with large files (think HD videos around 4GB). This will be a
> huge hit in performance and memory of the python application.
> With this patch, the API upload_object_via_stream() will use the S3 multipart
> upload feature to upload data in 5MB chunks, thus reducing the overall memory
> impact of the application.
> Design of this feature:
> * The S3StorageDriver() is not used just for Amazon S3. It is subclassed for
> being used with other S3 compliant cloud storage providers like Google
> Storage.
> * The Amazon S3 multipart upload is not (or may not be) supported by other
> storage providers (who will prefer the Chunked upload mechanism)
> We can solve this problem in two ways:
> 1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver),
> which implements this new multipart upload mechanism. Other storage providers
> will subclass S3StorageDriver. This is a more cleaner approach.
> 2) Introduce an attribute supports_s3_multipart_upload and based on it's
> value, control the callback function passed to _put_object() API. This makes
> the code look a bit hacky, but this approach is better for supporting such
> features in the future. We don't have to keep making subclasses for each
> feature.
> In the current patch, I have implemented approach (2), though I prefer (1).
> After discussions with the community and knowing their preferrences, we can
> select a final approach.
> Design notes:
> * Implementation has three steaps
> 1) POST request to /container/object_name?uploads. This returns an XML with
> a unique uploadId. This is handled as part of _put_object(). Doing it via
> _put_object() ensures that all S3 related parameters are set correctly.
> 2) Upload each chunk via PUT to
> /container/object_name?partNumber=X&uploadId=*** - This is done via the
> callback that is passed to _put_object() named _upload_multipart()
> 3) POST an XML containing part-numbers and etag headers returned for each
> part to /container/object_name?uploadId=***, implemented via
> _commit_multipart()
> 4) In case of any failures in steps (2) or (3), the upload is deleted from
> S3 through a DELETE request to /container/object_name?uploadId=****,
> implemented via _abort_multipart()
> * The chunk size for upload was set as 5MB - This is the minimum allowed size
> as per Amazon S3 docs.
> Other changes:
> * Did some PEP8 cleanup on s3.py
> * s3.get_container() would iterate through the list of containers for finding
> the requested entry. This can be simplified by making a HEAD request. The
> only downside is that 'created_time' is not available for the container. Let
> me know if this approach is OK or if I must revert it.
> * Introduced the following APIs for the S3StorageDriver(), to make some
> functionality easier.
> get_container_cdn_url()
> get_object_cdn_url()
> * In libcloud.common.base.Connection, the request() method is used as the
> basis for all HTTP requests made by libcloud. This method had a limitation,
> which became apparent in S3 multipart upload implementation. For initializing
> an upload, the API invoked was
> /container/object_name?uploads
> The 'uploads' parameter had to be passed as-is, without any values. If we
> made use of "params" argument in request() method, it would have come up as
> 'uploads=***'. To prevent this, the 'action' was set to
> /container/object_name?uploads and slight modifications were made to how
> parameters were appended.
> This also forced a change in BaseMockHttpObject._get_method_name()
> Bug fixes in test framework
> * While working on the test cases, I noticed a small issue. Not sure if it
> was a bug or as per design.
> MockRawResponse._get_response_if_not_availale() would return two different
> values on subsequent invocations.
> if not self._response:
> ...
> return self <----- this was inconsistent.
> return self._response
> While adding test cases for the Amazon S3 functionality, I noticed that
> instead of getting back MockResponse, I was getting MockRawResponse instance
> (which did not have methods like read()) or parse_body(). So, I fixed this
> issue. Because of this other test cases started failing and they were
> subsequently fixed. Not sure if this has to be fixed or if it was done on
> purpose. If someone can throw some light on it, I can work on it further. As
> of now, all test cases pass.
> * In test_s3.py, the driver was being set everywhere to S3StorageDriver. This
> same test case is used for GoogleStorageDriver, where the driver turns up as
> S3StorageDriver instead of GoogleStorageDriver. This was fixed by changing
> code to driver=self.driver_type
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira