Efficient import of binary data into Oak

Chetan Mehrotra Mon, 17 Feb 2014 23:26:25 -0800

Hi,

Currently in a Sling based application where a user uploads a file to
the JCR following sequence of steps are executed

1. User uploads file via HTTP request mostly using Multi-Part form
data based upload

2. Sling uses Commons File Upload to parse the multi-part request
which uses a DiskFileItemFactory and write the binary content to a
temporary file (for file size > 256 KB) [1]

3. Later the servlet would access the JCR Session and create a Binary
value by extracting the InputStream

4. The file content would then be spooled into the BlobStore

Effect of different blobstore
----------------------------------------

Now depending on the type of BlobStore one of the following code flow
would happen

A - JR2 DataStores - The inputstream would be copied to file
B - S3DataStore - The AWS SDK would be creating a temporary file and
then that file content would be streamed back to the S3
C - Segment - Content from InputStream would be stored as part of
various segments
D - MongoBlobStore - Content from InputStream would be pushed to
remote mongo via multiple remote calls

Things to note in above sequence

1. Uploaded content is copied twice.
2. The whole content is spooled via InputStream through JVM Heap

Possible areas of Improvement
--------------------------------

1. If the BlobStore is finally using some File (on same hard disk not
NFS) then it might be better to *move* the file which was created in
upload. This would help local FileDataStore and S3DataStore

2. Avoid spooling via InputStream if possible. Spooling via IS is slow
[3]. Though in most cases we use efficient buffered copy which is
marginally slower than NIO based variants. However avoiding moving
byte[] might reduce pressure on GC (probably!)

Changes required
------------------------

If we can have a way to create JCR Binary implementations which
enables DataStore/BlobStore to efficiently transfer content then that
would help.

For example for File based DS the Binary created can keep a reference
to the source File object and that Binary is used in JCR API.
Eventually the FileDataStore can treat it in a different way and move
the file.

Another example is S3DataStore - In some cases the file has already
been transferred to S3 using other options. And the user wants to
transfer the S3 file from its bucket to our bucket. So a Binary
implementation which can just wrap the S3 url would enable the
S3DataStore to transfer the content without streaming all content
again [4]

Any thoughts on the best way to enable users of Oak to create Binaries
via other means (compared to current mode which only enables via
InputStream) and enable the DataStores to make use of such binaries?

Chetan Mehrotra

[1]https://github.com/apache/sling/blob/trunk/bundles/engine/src/main/java/org/apache/sling/engine/impl/parameters/ParameterSupport.java#L190
[2]
http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html
[3] http://www.baptiste-wicht.com/2010/08/file-copy-in-java-benchmark/3/
[4]
http://stackoverflow.com/questions/9664904/best-way-to-move-files-between-s3-buckets

Efficient import of binary data into Oak

Reply via email to