Hi all,

Have submitted a patch for this feature on Github (pull request 80). Please
have a look. There are some points, where I would like your thoughts.

Regards,
Mahendra



On Thu, Dec 20, 2012 at 3:19 PM, Mahendra M (JIRA) <[email protected]> wrote:

> Mahendra M created LIBCLOUD-269:
> -----------------------------------
>
>              Summary: Multipart upload for amazon S3
>                  Key: LIBCLOUD-269
>                  URL: https://issues.apache.org/jira/browse/LIBCLOUD-269
>              Project: Libcloud
>           Issue Type: Improvement
>           Components: Storage
>             Reporter: Mahendra M
>
>
> This patch adds support for streaming data upload using Amazon's multipart
> upload support as listed in (
> http://docs.amazonwebservices.com/AmazonS3/latest/dev/UsingRESTAPImpUpload.html
> )
>
> As per current behaviour, the upload_object_via_stream() API will download
> the entire object in memory, and then upload it to S3. This can turn
> problematic with large files (think HD videos around 4GB). This will be a
> huge hit in performance and memory of the python application.
>
> With this patch, the API upload_object_via_stream() will use the S3
> multipart upload feature to upload data in 5MB chunks, thus reducing the
> overall memory impact of the application.
>
> Design of this feature:
> * The S3StorageDriver() is not used just for Amazon S3. It is subclassed
> for being used with other S3 compliant cloud storage providers like Google
> Storage.
> * The Amazon S3 multipart upload is not (or may not be) supported by other
> storage providers (who will prefer the Chunked upload mechanism)
>
> We can solve this problem in two ways:
> 1) Create a new subclass of S3StorageDriver (say AmazonS3StorageDriver),
> which implements this new multipart upload mechanism. Other storage
> providers will subclass S3StorageDriver. This is a more cleaner approach.
> 2) Introduce an attribute supports_s3_multipart_upload and based on it's
> value, control the callback function passed to _put_object() API. This
> makes the code look a bit hacky, but this approach is better for supporting
> such features in the future. We don't have to keep making subclasses for
> each feature.
>
> In the current patch, I have implemented approach (2), though I prefer
> (1). After discussions with the community and knowing their preferrences,
> we can select a final approach.
>
> Design notes:
> * Implementation has three steaps
>   1) POST request to /container/object_name?uploads. This returns an XML
> with a unique uploadId. This is handled as part of _put_object(). Doing it
> via _put_object() ensures that all S3 related parameters are set correctly.
>   2) Upload each chunk via POST to
> /container/object_name?partNumber=X&uploadId=*** - This is done via the
> callback that is passed to _put_object() named _upload_multipart()
>   3) POST an XML containing part-numbers and etag headers returned for
> each part to /container/object_name?uploadId=***, implemented via
> _commit_multipart()
>   4) In case of any failures in steps (2) or (3), the upload is deleted
> from S3 through a DELETE request to /container/object_name?uploadId=****,
> implemented via _abort_multipart()
>
> * The chunk size for upload was set as 5MB - This is the minimum allowed
> size as per Amazon S3 docs.
>
> Other changes:
> * Did some PEP8 cleanup on s3.py
>
> * s3.get_container() would iterate through the list of containers for
> finding the requested entry. This can be simplified by making a HEAD
> request. The only downside is that 'created_time' is not available for the
> container. Let me know if this approach is OK or if I must revert it.
>
> * Introduced the following APIs for the S3StorageDriver(), to make some
> functionality easier.
>   get_container_cdn_url()
>   get_object_cdn_url()
>
> * In libcloud.common.base.Connection, the request() method is used as the
> basis for all HTTP requests made by libcloud. This method had a limitation,
> which became apparent in S3 multipart upload implementation. For
> initializing an upload, the API invoked was
>   /container/object_name?uploads
> The 'uploads' parameter had to be passed as-is, without any values. If we
> made use of "params" argument in request() method, it would have come up as
> 'uploads=***'. To prevent this, the 'action' was set to
> /container/object_name?uploads and slight modifications were made to how
> parameters were appended.
>
> This also forced a change in BaseMockHttpObject._get_method_name()
>
> Bug fixes in test framework
> * While working on the test cases, I noticed a small issue. Not sure if it
> was a bug or as per design.
>   MockRawResponse._get_response_if_not_availale() would return two
> different values on subsequent invocations.
>      if not self._response:
>          ...
>          return self  <----- this was inconsistent.
>      return self._response
>
>   While adding test cases for the Amazon S3 functionality, I noticed that
> instead of getting back MockResponse, I was getting MockRawResponse
> instance (which did not have methods like read()) or parse_body(). So, I
> fixed this issue. Because of this other test cases started failing and they
> were subsequently fixed. Not sure if this has to be fixed or if it was done
> on purpose. If someone can throw some light on it, I can work on it
> further. As of now, all test cases pass.
>
> * In test_s3.py, the driver was being set everywhere to S3StorageDriver.
> This same test case is used for GoogleStorageDriver, where the driver turns
> up as S3StorageDriver instead of GoogleStorageDriver. This was fixed by
> changing code to driver=self.driver_type
>
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>



-- 
Mahendra

http://twitter.com/mahendra

Reply via email to