Hi, Currently in a Sling based application where a user uploads a file to the JCR following sequence of steps are executed
1. User uploads file via HTTP request mostly using Multi-Part form data based upload 2. Sling uses Commons File Upload to parse the multi-part request which uses a DiskFileItemFactory and write the binary content to a temporary file (for file size > 256 KB) [1] 3. Later the servlet would access the JCR Session and create a Binary value by extracting the InputStream 4. The file content would then be spooled into the BlobStore Effect of different blobstore ---------------------------------------- Now depending on the type of BlobStore one of the following code flow would happen A - JR2 DataStores - The inputstream would be copied to file B - S3DataStore - The AWS SDK would be creating a temporary file and then that file content would be streamed back to the S3 C - Segment - Content from InputStream would be stored as part of various segments D - MongoBlobStore - Content from InputStream would be pushed to remote mongo via multiple remote calls Things to note in above sequence 1. Uploaded content is copied twice. 2. The whole content is spooled via InputStream through JVM Heap Possible areas of Improvement -------------------------------- 1. If the BlobStore is finally using some File (on same hard disk not NFS) then it might be better to *move* the file which was created in upload. This would help local FileDataStore and S3DataStore 2. Avoid spooling via InputStream if possible. Spooling via IS is slow [3]. Though in most cases we use efficient buffered copy which is marginally slower than NIO based variants. However avoiding moving byte[] might reduce pressure on GC (probably!) Changes required ------------------------ If we can have a way to create JCR Binary implementations which enables DataStore/BlobStore to efficiently transfer content then that would help. For example for File based DS the Binary created can keep a reference to the source File object and that Binary is used in JCR API. Eventually the FileDataStore can treat it in a different way and move the file. Another example is S3DataStore - In some cases the file has already been transferred to S3 using other options. And the user wants to transfer the S3 file from its bucket to our bucket. So a Binary implementation which can just wrap the S3 url would enable the S3DataStore to transfer the content without streaming all content again [4] Any thoughts on the best way to enable users of Oak to create Binaries via other means (compared to current mode which only enables via InputStream) and enable the DataStores to make use of such binaries? Chetan Mehrotra [1]https://github.com/apache/sling/blob/trunk/bundles/engine/src/main/java/org/apache/sling/engine/impl/parameters/ParameterSupport.java#L190 [2] http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/PutObjectRequest.html [3] http://www.baptiste-wicht.com/2010/08/file-copy-in-java-benchmark/3/ [4] http://stackoverflow.com/questions/9664904/best-way-to-move-files-between-s3-buckets