Hi,
@Mel good to see you are still around
Am 07.03.22 um 14:55 schrieb Martinez, Mel - 0441 - MITLL:
I’m mainly a lurker on PDFBox these days, as I have moved to focus on other
things (though ironically my latest tasking does indirectly make use of PDFBox
again!) and I am not familiar with the details of the code but I would offer
the following advice on this:
If two objects are:
a) the same type and
b) have the same value (i.e., evaluate: obj1.equals(obj2) == true ) and
c) are immutable (i.e., their value cannot be changed once constructed)
Then they should have the same hashcode.
Conceptually such objects represent the same exact, immutable, information.
This is why two String objects that both hold the same character sequence have
the same hashcode. These sort of immutable objects are considered
interchangeable when they have the same data value. Code execution is exactly
the same regardless of which object instance you use for a given sequence of
code. Numbers such as Integers, Longs, Floats and Doubles and Booleans also
all represent immutable information and the same rules apply. The number “5”
is informationally identical throughout the universe and indeed all references
to “5” are really references to the same immutable information.
A huge advantage of treating such equal-value, same-type objects
interchangeably (by giving them identical hashcodes) is that they can be used
with Object Pooling to reduce memory and improve performance.
If the object types are not immutable, however — i.e., if it is possible for
their values to be modified (such as with setters or other mutators) then
whether they should have the same hashcode depends on how they are used. Do
they have other data fields that are not being taken into consideration when
the hash is calculated? Do they have transient fields that are not maintained
across serialization?
Hashcodes usually (not always) should be persistent through the life cycle of
the object. If you put an object in a hashmap, the internal bucket it gets
dropped into will be based on the hash and you (normally) don’t want that
changing while the object reference is stored in the map.
I can not recall enough detail about the PDFBox codebase or the COSxxxx
wrappers in particular to be able to assert how these points apply so I am just
offering these concepts here for folks to keep in mind.
I'm afraid it is more complicated than that and I myself didn't think about all
pitfalls.
@Mel It is correct that immutable objects sharing the same value and type should
be equal and have the same hashcode at least from a programmers point of view.
Those simple COSBase-objects such as COSBoolean, COSInteger, COSFloat and
COSName seem to match those characteristics. But the base class COSBase has to
changeable attributes ("key" and "direct") so that instances of those classes
aren't immutable and don't qualify for your definition.
But after reading yours and Maruans answer I rethink my proposal and withdraw
it. The whole thing isn't that easy. I guess we have to overhaul to whole
direct/indirect stuff and some more. And maybe we will find a more elegant way
to represent COS-objects, so that we don't have think about equals and hashCode
anymore.
Looks like there are still a lot interesting challenges left ;-)
Andreas
I’m sure you guys will make the right design decision. And thanks a ton for
all the work you guys have done over the years.
Mel
Dr. Mel Martinez
m.marti...@ll.mit.edu<mailto:m.marti...@ll.mit.edu>
On Mar 5, 2022, at 10:30 AM, Andreas Lehmkuehler
<andr...@lehmi.de<mailto:andr...@lehmi.de>> wrote:
Hi,
I'm not sure if we dicussed that topic in the past or if I simply mixed it up with a discussion
about "equals" and "="
However, PDFBOX-5286 shows the we have an issue with objects which aren't the
same but are treated as the same because of the same hash. This is true for all
simple objects such as COSInteger, COSFLoat, COSBoolean and COSName.
Think about the following two indirect /Length objects
100 0 obj
512
endobj
200 0 obj
512
endobj
* there two different COSObjects "100 0" and "200 0"
* both COSObjects have different hashes
* both COSObjects are referencing a COSInteger holding the same value "512"
* both COSIntegers are different objects
* both COSIntegers have the SAME hash, as the current implementation of
hashCode is based on the value of the COSInteger
Or some pseudo code
COSObject(100,0) != COSObject(200,0)
COSInteger(100,0) != COSInteger(200,0)
COSObject(100,0).hashCode != COSObject(200,0).hashCode
COSInteger(100,0).hashCode == COSInteger(200,0).hashCode
COSInteger(100,0).equals(COSInteger(200,0) == true
IMHO we should change the implementation of hashCode so that different objects
will have different hashCodes.
I expect some side effects
* we are using a lot of hash-based collections and I'm afraid there may be some
cases where the fact of having the same hash for different objects is wanted
(knowingly or not)
* we have to remove the static instances for COSInteger values in a range from
-100 to 256 which will result in an increased number of COSInteger instances
* there are just two static instances of COSBoolean ("true" and "false") which
have to be replaced too
* COSName is caching a lot of values as static instances as well, which should
be removed as well
* looks like COSFloat shouldn't be a problem
WDYT? Should we simply start with COSFloat and COSInteger and see how it ends
up?
Andreas
---------------------------------------------------------------------
To unsubscribe, e-mail:
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail:
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org