[ 
https://issues.apache.org/jira/browse/PDFBOX-4895?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164470#comment-17164470
 ] 

Andreas Lehmkühler commented on PDFBOX-4895:
--------------------------------------------

I've had a deeper look and like to sum my findings up.

There are different point of views we have to consider

_Java_
 * a float value is able to handle bigger numbers than a long value
 * precision gets lost if the value is too big for a single precision float

_PDF implementation limits_
 * a reader should support single precision float values:
 ** largest values ± 3.403 × 10 38
 ** smallest values ± 1.175 × 10 -38, important note from the spec: Values 
closer
 than these are automatically converted to 0
 * int values are limited to -2,147,483,647 < value < 2,147,483,647
 * float values maybe written as int values if they don't

_PDFBox implementation_

After the latest changes PDFBox tries to convert a string into a number as 
follows:
 * detect one digit values (special case) -> COSInteger
 * a single "-" or "." leads to COSInteger.ZERO
 * try to detect float values by looking for "." and/or "e" -> COSFloat
 * anything else should be an int (or a float value without fraction) -> 
COSInteger

If the last step fails with a NumberFormatException PDFBox does the following
 * before my partial revert: return a COSFloat with ± Float.MAX_VALUE
 * after my partial revert: a COSFloat made using the string to be parsed

Both approaches are more or less wrong. The first one produces wrong values as 
the origin value is 
 somewhere in between Long.MAX_VALUE and Float.MAX_VALUE and not 
Float.MAX_VALUE. But the second one (mine)
 is wrong as well as the value is altered due to the loss of precision.

The origin idea of using a COSFloat in such cases value was born in the context 
of PDFBOX-3116. But at that
 time the processing order was the other way around. PDFBox started with a 
COSInteger and continues with a COSFLoat.

What are possible reasons for a NumberFormatException?
 * The string isn't a valid number at all, neither a float nor a int -> throw 
IOException
 * The string is a valid int which is simply to big to fit into a long -> 
that's the question to be answered here

These are the things I don't like to do:
 * using a COSFloat isn't a good idea, as those big values are altered due to 
the precision loss
 * using ± Long.MAX_VALUE isn't a good idea as those values are simply wrong
 * simply throwing an IOException would lead to more issues with broken pdfs 
especially with the 2.0 parser, which parses all objects when loading the pdf

So why is it a bad idea to use altered int values but a good idea for float 
values (at least very small and very big numbers are changed)?
 Those int values often represent things like object numbers or the length of 
streams and those must not be altered otherwised they won't work at all.

Here is my proposal:

In the case of a valid but too big int we should simply return "null" and let 
the caller decide what to do with it. In the case of PDFBOX-3116 everything
 is fine without any further changes as the invalid object number isn't used at 
all. It just triggers the 2.0 parser at the beginning.

I'd like to give it a try and see if the mass test reveals any issue at all.

WDYT?

> Faster COSNumber
> ----------------
>
>                 Key: PDFBOX-4895
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4895
>             Project: PDFBox
>          Issue Type: Improvement
>    Affects Versions: 2.0.20, 3.0.0 PDFBox
>            Reporter: Alfred
>            Assignee: Tilman Hausherr
>            Priority: Trivial
>              Labels: Optimization
>             Fix For: 2.0.21, 3.0.0 PDFBox
>
>         Attachments: PDFBOX-4895-b.patch, PDFBOX-4895.patch
>
>
> A small improvement can be made to COSNumber when checking if it's float.
> Current version uses indexOf twice, to check for '.' or 'e'.
>  We can do that in one scan.
>  
> Each call will scan through the entire string.
>  We only have to scan until we find the chars, and stop if found.
>  
> I found while profiling the code that the method gets called a lot, so the 
> improvement makes a a bit of a difference.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to