Re: Performance of the trunkversion

Timo Boehme Wed, 15 Jul 2015 03:29:08 -0700

Hi Manfred,

there is another update of ScratchFile. It now is able to use a certainamount of main memory before using the scratch file. Could you give it atry? You will have to change the source a bit since the constructorgetting the allowed amount of memory is currently not supported byPDDocument class. Simply change


    public ScratchFile(File scratchFileDirectory) throws IOException
    {
        this(scratchFileDirectory, 0);
    }

to
    public ScratchFile(File scratchFileDirectory) throws IOException
    {
        this(scratchFileDirectory, XXXXXX);
    }
where XXXXXX is the amount of main memory to be used for buffers in bytes.

If you use a larger value and the performance still is not same/betteras the May version than at least it is not the problem of the bufferhandling for streams.



Best,
Timo


Am 15.07.2015 um 12:20 schrieb Manfred Pock:

Hi Timo,

i have test it with different pdf's and die performance ist nearly of
the version from may. Just a little bit slower.

It will be ok, but it will be nice if it will performe better ;-)

thanks and regarts.
Manfred

Am 15.07.2015 um 10:24 schrieb Timo Boehme:

Hi Manfred,

the issue should be fixed in the updated versions attached to
PDFBOX-2882. Please give them a try.


Timo


Am 15.07.2015 um 09:51 schrieb Manfred Pock:

Hi Timo,

i have tried it put it doesn't work now and i get different exceptions
or Errors

i looks like that there is a problem with any kind of images, the rest
will be shown.

for example:

SCHWERWIEGEND: TIFFFaxDecoder: Invalid code encountered while decoding
2D group 4 compressed data.
java.io.IOException: TIFFFaxDecoder: Invalid code encountered while
decoding 2D group 4 compressed data.
     at
org.apache.pdfbox.filter.ccitt.TIFFFaxDecoder.decodeT6(TIFFFaxDecoder.java:1125)


     at
org.apache.pdfbox.filter.CCITTFaxFilter.decode(CCITTFaxFilter.java:94)
     at
org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
     at
org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:278)
     at
org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.<init>(PDImageXObject.java:120)


     at
org.apache.pdfbox.pdmodel.graphics.PDXObject.createXObject(PDXObject.java:67)


     at
org.apache.pdfbox.pdmodel.PDResources.getXObject(PDResources.java:340)

SCHWERWIEGEND: FlateFilter: stop reading corrupt stream due to a
DataFormatException
Jul 15, 2015 9:45:05 AM org.apache.pdfbox.contentstream.PDFStreamEngine
operatorException
WARNUNG: java.util.zip.DataFormatException: invalid block type

Jul 15, 2015 9:46:18 AM org.apache.pdfbox.contentstream.PDFStreamEngine
operatorException
WARNUNG: Not a JPEG file: starts with 0xe0 0x00

ul 15, 2015 9:46:23 AM org.apache.pdfbox.contentstream.PDFStreamEngine
operatorException
WARNUNG: Image stream was not read - filter: DCTDecode

SCHWERWIEGEND: java.util.zip.DataFormatException: invalid distance too
far back
java.io.IOException: java.util.zip.DataFormatException: invalid distance
too far back
     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:83)
     at
org.apache.pdfbox.cos.COSStream.attemptDecode(COSStream.java:422)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:398)
     at org.apache.pdfbox.cos.COSStream.doDecode(COSStream.java:335)
     at
org.apache.pdfbox.cos.COSStream.checkUnfilteredBuffer(COSStream.java:265)

     at
org.apache.pdfbox.cos.COSStream.getUnfilteredRandomAccess(COSStream.java:239)


     at
org.apache.pdfbox.pdfparser.BaseParser.<init>(BaseParser.java:146)
     at
org.apache.pdfbox.pdfparser.PDFStreamParser.<init>(PDFStreamParser.java:78)

     at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStreamOperators(PDFStreamEngine.java:451)


     at
org.apache.pdfbox.contentstream.PDFStreamEngine.processStream(PDFStreamEngine.java:438)


     at
org.apache.pdfbox.contentstream.PDFStreamEngine.processPage(PDFStreamEngine.java:149)


     at
org.apache.pdfbox.rendering.PageDrawer.drawPage(PageDrawer.java:180)
     at
org.apache.pdfbox.rendering.PDFRenderer.renderPage(PDFRenderer.java:205)
     at
org.apache.pdfbox.rendering.PDFRenderer.renderImage(PDFRenderer.java:136)

     at
org.apache.pdfbox.rendering.PDFRenderer.renderImageWithDPI(PDFRenderer.java:95)


.... Caused by: java.util.zip.DataFormatException: invalid distance too
far back
     at java.util.zip.Inflater.inflateBytes(Native Method)
     at java.util.zip.Inflater.inflate(Inflater.java:259)
     at java.util.zip.Inflater.inflate(Inflater.java:280)
     at
org.apache.pdfbox.filter.FlateFilter.decompress(FlateFilter.java:101)
     at org.apache.pdfbox.filter.FlateFilter.decode(FlateFilter.java:74)

Am 15.07.2015 um 00:35 schrieb Timo Boehme:

I've created PDFBOX-2882 with a drop-in replacement of the scratch
file implementation.
@Manfred: Could you please test if this helps in your scenario to
increase performance?

Best,
Timo


Am 14.07.2015 um 13:47 schrieb Timo Boehme:

Hi,

instead of having a linked page list in ScratchFileBuffer I would
propose having a list of pages with the page numbers (integer) kept in
memory (takes 1k for 1MB data). This would ease page handling, seeking
does not need I/O-operations and caching of pages would be a lot
easier.
I may find some time later to come up with such a replacement.

Best,
Timo


Am 14.07.2015 um 13:02 schrieb Timo Boehme:

Hi,

as I see it (had only a quick look at the implementation) the
ScratchFileBuffer implementation is not optimal for fast random
access.
Single writes of bytes are not buffered but directly written to the
file
- a lot of I/O-operations) and seek operations have to travel the
linked
page list reading some bytes of each page - again a lot of seek and
read
I/O-operations.
To speed things up it is crucial to minimize the number of
I/O-operations directly going to the random access file. Therefore
it is
needed to buffer writes, keep last read page in memory for sequential
reads and have an in-memory cache of page meta data (offset, link to
previous/next page).


Best,
Timo


Am 14.07.2015 um 12:15 schrieb Manfred Pock:

Yes, the input is a inputstream. I can try it direct from file.

But in general we get the pdf from an document management system as
stream.
Does make sense that i save the pdf to file before?

Why is there so an big performance difference beetween the version
from
May and the current version, if we use it with useScratchFiles =
true ?

regarts, Manfred

Am 14.07.2015 um 12:02 schrieb Andreas Lehmkühler:

Hi,

Manfred Pock <[email protected]> hat am 14. Juli 2015 um
11:39
geschrieben:


Ok, we load the pdf with useScratchFiles = true, if we load them
with
false the performance is better, but a little bit slower than the
old
one.

What do you use as input, a stream or a real file? If the latter
you
should use
the load method with the file parameter.

PDFBox needs ramdom access to the pdf and if a stream is provided
PDFBox copies
the data to a file (lower memory usage, slower performance) or
to the
memory
(higher memory usage, better performance).

BR
Andreas

But now it need more memory. I cannot load some pdfs with the
current
version with the same java-memory configuration.

Am 14.07.2015 um 11:26 schrieb Manfred Pock:

Hi,

we use the Pdfbox-trunkversion to render pdf's, currently we use
the
version from 12. May 2015.

Today i have done an update to the current version and have test
it.
It seems to be that it need now much more time to render
pdf's, it
depends of the size of the pdf.

for example you can try this one:

http://cloud.directupload.net/15bu

It need five times more then the version from May 2015.

regarts, Manfred

---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------

To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



--
Timo Boehme
OntoChem IT Solutions GmbH
Blücherstraße 24
06120 Halle (Saale)
Germany

phone: +49 345 478 047 4      | fax: +49 345 478 047 1
email: [email protected] | web: www.ontochem.com
HRB 21962 Amtsgericht Stendal | USt-IdNr.: DE815563824
managing director : Lutz Weber


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Performance of the trunkversion

Reply via email to