Hi,

@Mel good to see you are still around

Am 07.03.22 um 14:55 schrieb Martinez, Mel - 0441 - MITLL:
I’m mainly a lurker on PDFBox these days, as I have moved to focus on other 
things (though ironically my latest tasking does indirectly make use of PDFBox 
again!) and I am not familiar with the details of the code but I would offer 
the following advice on this:

If two objects are:

a) the same type and
b) have the same value (i.e., evaluate:  obj1.equals(obj2) == true ) and
c) are immutable (i.e., their value cannot be changed once constructed)

Then they should have the same hashcode.

Conceptually such objects represent the same exact, immutable, information.  
This is why two String objects that both hold the same character sequence have 
the same hashcode.   These sort of immutable objects are considered 
interchangeable when they have the same data value.  Code execution is exactly 
the same regardless of which object instance you use for a given sequence of 
code.  Numbers such as Integers, Longs, Floats and Doubles and Booleans also 
all represent immutable information and the same rules apply.  The number “5” 
is informationally identical throughout the universe and indeed all references 
to “5” are really references to the same immutable information.

A huge advantage of treating such equal-value, same-type objects 
interchangeably (by giving them identical hashcodes) is that they can be used 
with Object Pooling to reduce memory and improve performance.

If the object types are not immutable, however — i.e., if it is possible for 
their values to be modified (such as with setters or other mutators) then 
whether they should have the same hashcode depends on how they are used.  Do 
they have other data fields that are not being taken into consideration when 
the hash is calculated?   Do they have transient fields that are not maintained 
across serialization?

Hashcodes usually (not always) should be persistent through the life cycle of 
the object.  If you put an object in a hashmap, the internal bucket it gets 
dropped into will be based on the hash and you (normally) don’t want that 
changing while the object reference is stored in the map.

I can not recall enough detail about the PDFBox codebase or the COSxxxx 
wrappers in particular to be able to assert how these points apply so I am just 
offering these concepts here for folks to keep in mind.
I'm afraid it is more complicated than that and I myself didn't think about all pitfalls.

@Mel It is correct that immutable objects sharing the same value and type should be equal and have the same hashcode at least from a programmers point of view.

Those simple COSBase-objects such as COSBoolean, COSInteger, COSFloat and COSName seem to match those characteristics. But the base class COSBase has to changeable attributes ("key" and "direct") so that instances of those classes aren't immutable and don't qualify for your definition.

But after reading yours and Maruans answer I rethink my proposal and withdraw it. The whole thing isn't that easy. I guess we have to overhaul to whole direct/indirect stuff and some more. And maybe we will find a more elegant way to represent COS-objects, so that we don't have think about equals and hashCode anymore.

Looks like there are still a lot interesting challenges left ;-)

Andreas


I’m sure you guys will make the right design decision.   And thanks a ton for 
all the work you guys have done over the years.

Mel

Dr. Mel Martinez
m.marti...@ll.mit.edu<mailto:m.marti...@ll.mit.edu>


On Mar 5, 2022, at 10:30 AM, Andreas Lehmkuehler 
<andr...@lehmi.de<mailto:andr...@lehmi.de>> wrote:

Hi,

I'm not sure if we dicussed that topic in the past or if I simply mixed it up with a discussion 
about "equals" and "="

However, PDFBOX-5286 shows the we have an issue with objects which aren't the 
same but are treated as the same because of the same hash. This is true for all 
simple objects such as COSInteger, COSFLoat, COSBoolean and COSName.

Think about the following two indirect /Length objects

100 0 obj
512
endobj


200 0 obj
512
endobj

* there two different COSObjects "100 0" and "200 0"
* both COSObjects have different hashes
* both COSObjects are referencing a COSInteger holding the same value "512"
* both COSIntegers are different objects
* both COSIntegers have the SAME hash, as the current implementation of 
hashCode is based on the value of the COSInteger

Or some pseudo code

COSObject(100,0) != COSObject(200,0)
COSInteger(100,0) != COSInteger(200,0)
COSObject(100,0).hashCode != COSObject(200,0).hashCode
COSInteger(100,0).hashCode == COSInteger(200,0).hashCode
COSInteger(100,0).equals(COSInteger(200,0) == true

IMHO we should change the implementation of hashCode so that different objects 
will have different hashCodes.

I expect some side effects
* we are using a lot of hash-based collections and I'm afraid there may be some 
cases where the fact of having the same hash for different objects is wanted 
(knowingly or not)
* we have to remove the static instances for COSInteger values in a range from 
-100 to 256 which will result in an increased number of COSInteger instances
* there are just two static instances of COSBoolean ("true" and "false") which 
have to be replaced too
* COSName is caching a lot of values as static instances as well, which should 
be removed as well
* looks like COSFloat shouldn't be a problem

WDYT? Should we simply start with COSFloat and COSInteger and see how it ends 
up?

Andreas










---------------------------------------------------------------------
To unsubscribe, e-mail: 
dev-unsubscr...@pdfbox.apache.org<mailto:dev-unsubscr...@pdfbox.apache.org>
For additional commands, e-mail: 
dev-h...@pdfbox.apache.org<mailto:dev-h...@pdfbox.apache.org>




---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to