[
https://issues.apache.org/jira/browse/PDFBOX-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164470#comment-17164470
]
Andreas Lehmkühler commented on PDFBOX-4895:
--------------------------------------------
I've had a deeper look and like to sum my findings up.
There are different point of views we have to consider
_Java_
* a float value is able to handle bigger numbers than a long value
* precision gets lost if the value is too big for a single precision float
_PDF implementation limits_
* a reader should support single precision float values:
** largest values ± 3.403 × 10 38
** smallest values ± 1.175 × 10 -38, important note from the spec: Values
closer
than these are automatically converted to 0
* int values are limited to -2,147,483,647 < value < 2,147,483,647
* float values maybe written as int values if they don't
_PDFBox implementation_
After the latest changes PDFBox tries to convert a string into a number as
follows:
* detect one digit values (special case) -> COSInteger
* a single "-" or "." leads to COSInteger.ZERO
* try to detect float values by looking for "." and/or "e" -> COSFloat
* anything else should be an int (or a float value without fraction) ->
COSInteger
If the last step fails with a NumberFormatException PDFBox does the following
* before my partial revert: return a COSFloat with ± Float.MAX_VALUE
* after my partial revert: a COSFloat made using the string to be parsed
Both approaches are more or less wrong. The first one produces wrong values as
the origin value is
somewhere in between Long.MAX_VALUE and Float.MAX_VALUE and not
Float.MAX_VALUE. But the second one (mine)
is wrong as well as the value is altered due to the loss of precision.
The origin idea of using a COSFloat in such cases value was born in the context
of PDFBOX-3116. But at that
time the processing order was the other way around. PDFBox started with a
COSInteger and continues with a COSFLoat.
What are possible reasons for a NumberFormatException?
* The string isn't a valid number at all, neither a float nor a int -> throw
IOException
* The string is a valid int which is simply to big to fit into a long ->
that's the question to be answered here
These are the things I don't like to do:
* using a COSFloat isn't a good idea, as those big values are altered due to
the precision loss
* using ± Long.MAX_VALUE isn't a good idea as those values are simply wrong
* simply throwing an IOException would lead to more issues with broken pdfs
especially with the 2.0 parser, which parses all objects when loading the pdf
So why is it a bad idea to use altered int values but a good idea for float
values (at least very small and very big numbers are changed)?
Those int values often represent things like object numbers or the length of
streams and those must not be altered otherwised they won't work at all.
Here is my proposal:
In the case of a valid but too big int we should simply return "null" and let
the caller decide what to do with it. In the case of PDFBOX-3116 everything
is fine without any further changes as the invalid object number isn't used at
all. It just triggers the 2.0 parser at the beginning.
I'd like to give it a try and see if the mass test reveals any issue at all.
WDYT?
> Faster COSNumber
> ----------------
>
> Key: PDFBOX-4895
> URL: https://issues.apache.org/jira/browse/PDFBOX-4895
> Project: PDFBox
> Issue Type: Improvement
> Affects Versions: 2.0.20, 3.0.0 PDFBox
> Reporter: Alfred
> Assignee: Tilman Hausherr
> Priority: Trivial
> Labels: Optimization
> Fix For: 2.0.21, 3.0.0 PDFBox
>
> Attachments: PDFBOX-4895-b.patch, PDFBOX-4895.patch
>
>
> A small improvement can be made to COSNumber when checking if it's float.
> Current version uses indexOf twice, to check for '.' or 'e'.
> We can do that in one scan.
>
> Each call will scan through the entire string.
> We only have to scan until we find the chars, and stop if found.
>
> I found while profiling the code that the method gets called a lot, so the
> improvement makes a a bit of a difference.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]