GitHub user thelmstedt opened a pull request:
https://github.com/apache/poi/pull/54
Add Image Optimisations
I need to be able to generate spreadsheets with 2000 images fast enough for
a synchronous HTTP request. `3.16` takes ~25 seconds for this usecase for me.
These changes take it down to ~1 second. I've added a test for my case, and I
don't get any more failures than `trunk`. I don't think I've broken any
invariants but it's definitely worth a 2nd look!
The slowdown was caused by the cost of creating and sorting
`PackagePartNames`. I assume it's part of the OOXML spec so there's no avoiding
the overhead. But `addPicture` happened to make some redundant usage of these:
* adding a new relationship enumerated all current relationships, building
`PackagePartName`s for each
* PackageParts were stored as as a `TreeMap`
Instead we
* cache relationship lookups by name (similarly to what is already done for
ID and type)
* Store PackageParts in a HashMap for quick lookups, and explicitly sort
its `.values()`
First commit adds a benchmark using
http://openjdk.java.net/projects/code-tools/jmh/
Prior to my changes `addPicture` gets:
```
# Run complete. Total time: 00:00:31
Benchmark Mode
CntScore Error Units
AddImageBench.benchCreatePicture avgt
10 2831.586 ± 38.824 us/op
AddImageBench.benchCreatePicture:·gc.alloc.rateavgt
10 810.418 ± 22.303 MB/sec
AddImageBench.benchCreatePicture:·gc.alloc.rate.norm avgt
10 2407955.352 ± 33327.581B/op
AddImageBench.benchCreatePicture:·gc.churn.PS_Eden_Space avgt
10 847.676 ± 361.511 MB/sec
AddImageBench.benchCreatePicture:·gc.churn.PS_Eden_Space.norm avgt
10 2520570.616 ± 1084187.937B/op
AddImageBench.benchCreatePicture:·gc.churn.PS_Survivor_Space avgt
100.561 ± 0.645 MB/sec
AddImageBench.benchCreatePicture:·gc.churn.PS_Survivor_Space.norm avgt
10 1667.673 ±1912.256B/op
AddImageBench.benchCreatePicture:·gc.count avgt
10 16.000counts
AddImageBench.benchCreatePicture:·gc.time avgt
10 69.000ms
AddImageBench.benchCreatePicture:·stackavgt
NaN ---
```
Afterwards we get 10x improvement in execution time, and 100x in memory:
```
# Run complete. Total time: 00:00:31
Benchmark Mode
Cnt Score Error Units
AddImageBench.benchCreatePicture avgt
10227.339 ±49.226 us/op
AddImageBench.benchCreatePicture:·gc.alloc.rateavgt
10119.667 ±25.859 MB/sec
AddImageBench.benchCreatePicture:·gc.alloc.rate.norm avgt
10 28021.776 ±54.539B/op
AddImageBench.benchCreatePicture:·gc.churn.PS_Eden_Space avgt
10 98.653 ± 314.433 MB/sec
AddImageBench.benchCreatePicture:·gc.churn.PS_Eden_Space.norm avgt
10 19826.075 ± 63192.153B/op
AddImageBench.benchCreatePicture:·gc.churn.PS_Survivor_Space avgt
10 0.228 ± 1.090 MB/sec
AddImageBench.benchCreatePicture:·gc.churn.PS_Survivor_Space.norm avgt
10 45.594 ± 217.979B/op
AddImageBench.benchCreatePicture:·gc.count avgt
10 2.000 counts
AddImageBench.benchCreatePicture:·gc.time avgt
10 88.000 ms
AddImageBench.benchCreatePicture:·stackavgt
NaN ---
```
Happy to back out the benchmark inclusion if you don't want to include
another test dependency.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/thelmstedt/poi feature/redo
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/poi/pull/54.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #54
commit c26b958ac32c20226db4cb41fb7dda8bc3e9a34f
Author: Tim Helmstedt
Date: 2016-10-23T20:59:16Z
Benchmark adding images
commit 1d7cf3574016e64e0631556bb50cb466a930c18f
Author: Tim Helmstedt
Date: 2016-10-22T11:06:53Z
PackageRelationshipCollection caches lookup by targetPart
Building partnames for all relationships is expensive. Here we avoid
this in findExistingRelation,