[ 
https://issues.apache.org/jira/browse/LIBCLOUD-269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomaz Muraus updated LIBCLOUD-269:
----------------------------------

    Fix Version/s: 0.12.1
    
> Multipart upload for amazon S3
> ------------------------------
>
>                 Key: LIBCLOUD-269
>                 URL: https://issues.apache.org/jira/browse/LIBCLOUD-269
>             Project: Libcloud
>          Issue Type: Improvement
>          Components: Storage
>            Reporter: Mahendra M
>            Assignee: Tomaz Muraus
>             Fix For: 0.12.1
>
>         Attachments: libcloud-269.diff, libcloud-269.diff
>
>
> This patch adds support for streaming data upload using Amazon's multipart 
> upload support as listed in 
> (http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html)
> As per current behaviour, the upload_object_via_stream() API will download 
> the entire object in memory, and then upload it to S3. This can turn 
> problematic with large files (think HD videos around 4GB). This will be a 
> huge hit in performance and memory of the python application.
> With this patch, the API upload_object_via_stream() will use the S3 multipart 
> upload feature to upload data in 5MB chunks, thus reducing the overall memory 
> impact of the application.
> Design of this feature:
> * The S3StorageDriver() is not used just for Amazon S3. It is subclassed for 
> being used with other S3 compliant cloud storage providers like Google 
> Storage.
> * The Amazon S3 multipart upload is not (or may not be) supported by other 
> storage providers (who will prefer the Chunked upload mechanism)
> We can solve this problem in two ways:
> 1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver), 
> which implements this new multipart upload mechanism. Other storage providers 
> will subclass S3StorageDriver. This is a more cleaner approach.
> 2) Introduce an attribute supports_s3_multipart_upload and based on it's 
> value, control the callback function passed to _put_object() API. This makes 
> the code look a bit hacky, but this approach is better for supporting such 
> features in the future. We don't have to keep making subclasses for each 
> feature.
> In the current patch, I have implemented approach (2), though I prefer (1). 
> After discussions with the community and knowing their preferrences, we can 
> select a final approach.
> Design notes:
> * Implementation has three steaps
>   1) POST request to /container/object_name?uploads. This returns an XML with 
> a unique uploadId. This is handled as part of _put_object(). Doing it via 
> _put_object() ensures that all S3 related parameters are set correctly.
>   2) Upload each chunk via PUT to 
> /container/object_name?partNumber=X&uploadId=*** - This is done via the 
> callback that is passed to _put_object() named _upload_multipart()
>   3) POST an XML containing part-numbers and etag headers returned for each 
> part to /container/object_name?uploadId=***, implemented via 
> _commit_multipart()
>   4) In case of any failures in steps (2) or (3), the upload is deleted from 
> S3 through a DELETE request to /container/object_name?uploadId=****, 
> implemented via _abort_multipart()
> * The chunk size for upload was set as 5MB - This is the minimum allowed size 
> as per Amazon S3 docs.
> Other changes:
> * Did some PEP8 cleanup on s3.py
> * s3.get_container() would iterate through the list of containers for finding 
> the requested entry. This can be simplified by making a HEAD request. The 
> only downside is that 'created_time' is not available for the container. Let 
> me know if this approach is OK or if I must revert it.
> * Introduced the following APIs for the S3StorageDriver(), to make some 
> functionality easier.
>   get_container_cdn_url()
>   get_object_cdn_url()
> * In libcloud.common.base.Connection, the request() method is used as the 
> basis for all HTTP requests made by libcloud. This method had a limitation, 
> which became apparent in S3 multipart upload implementation. For initializing 
> an upload, the API invoked was
>   /container/object_name?uploads
> The 'uploads' parameter had to be passed as-is, without any values. If we 
> made use of "params" argument in request() method, it would have come up as 
> 'uploads=***'. To prevent this, the 'action' was set to 
> /container/object_name?uploads and slight modifications were made to how 
> parameters were appended.
> This also forced a change in BaseMockHttpObject._get_method_name()
> Bug fixes in test framework
> * While working on the test cases, I noticed a small issue. Not sure if it 
> was a bug or as per design.
>   MockRawResponse._get_response_if_not_availale() would return two different 
> values on subsequent invocations.
>      if not self._response:
>          ...
>          return self  <----- this was inconsistent.
>      return self._response
>   While adding test cases for the Amazon S3 functionality, I noticed that 
> instead of getting back MockResponse, I was getting MockRawResponse instance 
> (which did not have methods like read()) or parse_body(). So, I fixed this 
> issue. Because of this other test cases started failing and they were 
> subsequently fixed. Not sure if this has to be fixed or if it was done on 
> purpose. If someone can throw some light on it, I can work on it further. As 
> of now, all test cases pass.
> * In test_s3.py, the driver was being set everywhere to S3StorageDriver. This 
> same test case is used for GoogleStorageDriver, where the driver turns up as 
> S3StorageDriver instead of GoogleStorageDriver. This was fixed by changing 
> code to driver=self.driver_type

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to