Re: Default hash code generation strategy for new binary objects

Dmitriy Setrakyan Thu, 29 Sep 2016 09:08:11 -0700

Alexey, in your opinion, what will be faster, the binary array comparison
or field-by-field comparison?


On Thu, Sep 29, 2016 at 8:39 AM, Alexey Goncharuk <
alexey.goncha...@gmail.com> wrote:

> Folks, let me point out a few obvious (or not) things
>
> A set of fields participating in hashCode and equals is impossible to
> change without cluster restart. Imagine a new client adding a field F to
> key structure (A, B) so that new key is (A, B, F). In this case key (1, 2,
> 0) will be treated as a different key w/respect to (1, 2, 3), while the new
> client expects them to be the same. I think the set of fields defining the
> key _must_ be used in both hash code and equals calculation and cannot
> change over time. Having said that, we can use binary array comparison as
> object equality test. Presence of new 'garbage' fields in key looks useless
> to me.
>
> So, we should support:
> 1) All the things Dmitriy pointed out, in other words - multi-field key
> with affinity key fields
> 2) Ability to instantiate the key using a text-only environment
> 3) Ability to define the structure of a key using XML or DDL
>
> I think all of these can be done via combination of HashCodeResolver
> interface + default resolver which can read type configuration.
>
> 2016-09-29 3:54 GMT+03:00 Dmitriy Setrakyan <dsetrak...@apache.org>:
>
> > Guys,
> >
> > We need to look at 3 cases:
> >
> > a) key is just one field
> > b) key is multiple fields
> > c) key is one or multiple fields, with possibility of an alternate
> affinity
> > key
> >
> > For (a) and (b), whenever a type is defined in XML, and further in DML, a
> > user will specify which fields are part of the key. In that case, we
> should
> > just grab those fields and calculate the hashcode automatically.
> >
> > For (c), a user should specify in XML, and further in DML, which fields
> > should be used for affinity, in addition to the key fields. In that case,
> > we again should grab those fields and calculate the hashcodes for the
> > primary key and the affinity key.
> >
> > I really am not sure if there are other ways of doing it. Am I missing
> > something?
> >
> > D.
> >
> >
> > On Wed, Sep 28, 2016 at 5:14 PM, Denis Magda <dma...@gridgain.com>
> wrote:
> >
> > > Let me show the picture I have in my mind:
> > >
> > > Primary key is a must for all INSERT and MERGE operations. If it’s not
> > set
> > > then an INSERT/MERGE fails.
> > > If a primary key is a boxed/unboxed primitive (int, Integer, String,
> > Date,
> > > etc.) then the key value is used for hashCode calculation. At the same
> > time
> > > the key will be an affinity key.
> > > If a primary key is a custom object then it’s value can be passed as a
> > > param directly from Java, .Net, C++ in a way like “INSERT (_key,
> field1,
> > > field2) VALUES (?, val1, val2)”. In this scenario we will call hashCode
> > > directly on the key's value. In addition, we will be able to get an
> > > affinity key since we know key’s type class descriptor
> > > (BinaryClassDescriptor).
> > > If a primary key is still a custom key and we want to insert its value
> > > from an SQL console, PHP, Tableu, etc. then we can’t pass the key’s
> value
> > > as is. Here we’re trying to apply a workaround by listing key's fields
> in
> > > INSERT/MERGE and the task is to properly re-construct the key on our
> side
> > > using only specific fields.
> > >
> > > Is my understanding correct for all the bullets above?
> > >
> > > If so then, yes, I would agree that we need to list these fields in a
> > > configuration and the default hash code resolver will use them as well.
> > > Moreover, we have to pin point an affinity field. So, the question is
> > what
> > > the configuration we should use.
> > >
> > > Community, any other thoughts/ideas?
> > >
> > > —
> > > Denis
> > >
> > > > On Sep 28, 2016, at 4:16 PM, Alexander Paschenko <
> > > alexander.a.pasche...@gmail.com> wrote:
> > > >
> > > > Also MERGE.
> > > >
> > > > 2016-09-29 2:10 GMT+03:00 Denis Magda <dma...@gridgain.com>:
> > > >> You need a hash code only for INSERT operation, right?
> > > >>
> > > >> —
> > > >> Denis
> > > >>
> > > >>> On Sep 28, 2016, at 3:47 PM, Alexander Paschenko <
> > > alexander.a.pasche...@gmail.com> wrote:
> > > >>>
> > > >>> But what if the user works from some kind of console and just types
> > > >>> the queries as text in full and does not bind params via JDBC or
> > > >>> something alike? What if there's no binary object? I don't see why
> we
> > > >>> should keep the user from usual cache gets in this case. I really
> > like
> > > >>> the idea of supplying the values of distinct fields, thus freeing
> the
> > > >>> user of the need to mess with objects and builders, AND then just
> > > >>> calculating hash code as suggested before - say, via explicitly
> > > >>> listing participating fields in XML or by marking them with
> transient
> > > >>> keyword or some annotation.
> > > >>> Actually, I believe that's the only case when we need to generate
> any
> > > >>> hash codes - when the class is present, we can just get hash code
> > from
> > > >>> its implementation of its method. When there's no class, we
> generate.
> > > >>> And all that is solely for SQL. For the rest - just throw an
> > exception
> > > >>> when there's no hash code manually set for binary object. I don't
> see
> > > >>> why we should try to generate anything when the user already is
> using
> > > >>> Ignite in full, not just via limited interface of SQL.
> > > >>>
> > > >>> 2016-09-29 0:31 GMT+03:00 Denis Magda <dma...@gridgain.com>:
> > > >>>> Hmm, this is a good question.
> > > >>>>
> > > >>>> If a user doesn’t provide a _key when an INSERT is executed for me
> > it
> > > means that he is not going to use the key later for cache.get/put,
> > DELETE,
> > > UPDATE and other possible operation simply because he doesn’t know how
> to
> > > reconstruct the key back in his code. If he wants to use the primary
> key
> > in
> > > the rest of operations then he must provide it at INSERT time.
> > > >>>>
> > > >>>> Do we need this key only for a case when an object is being
> inserted
> > > into a cache? If it’s so I would auto-generate a key using ‘long’ as a
> > key
> > > type. I do remember that we provided the auto-generation for Spark
> module
> > > in a some way that may be useful here.
> > > >>>>
> > > >>>> —
> > > >>>> Denis
> > > >>>>
> > > >>>>> On Sep 28, 2016, at 9:53 AM, Alexander Paschenko <
> > > alexander.a.pasche...@gmail.com> wrote:
> > > >>>>>
> > > >>>>> Denis,
> > > >>>>>
> > > >>>>> That's not what I was asking about.
> > > >>>>> Currently DML implementation allows for dymanic instantiation of
> > > keys,
> > > >>>>> in other words, user does not have to provide value for
> > object-typed
> > > >>>>> _key column - instead, he may supply just field values based on
> > which
> > > >>>>> _key will be dynamically instantiated/binary built. And that's
> the
> > > >>>>> whole point of this discussion as I see it: what to do when we've
> > > >>>>> binary built classless key that we build ourselves inside SQL
> > engine
> > > >>>>> and don't know how to compute hash code for it?
> > > >>>>>
> > > >>>>> - Alex
> > > >>>>>
> > > >>>>> 2016-09-28 19:48 GMT+03:00 Denis Magda <dma...@gridgain.com>:
> > > >>>>>> Alexander,
> > > >>>>>>
> > > >>>>>> As I guess if we have a key without a class then it will be
> > > constructed using a BinaryBuilder instance and it’s user responsibility
> > to
> > > set the has code at the end with BinaryBuilder.hasCode method. Sure,
> all
> > > this cases must be well-documented in both Java Doc API and Apache
> Ignite
> > > documentation.
> > > >>>>>>
> > > >>>>>> —
> > > >>>>>> Denis
> > > >>>>>>
> > > >>>>>>> On Sep 28, 2016, at 9:33 AM, Alexander Paschenko <
> > > alexander.a.pasche...@gmail.com> wrote:
> > > >>>>>>>
> > > >>>>>>> Dmitry, Denis,
> > > >>>>>>>
> > > >>>>>>> OK, but I think it's necessary to address also the cases when
> > > there's
> > > >>>>>>> no actual class for the key, and its fields are simply declared
> > in
> > > >>>>>>> XML. In this case, there are no fields to be marked transient.
> > > What do
> > > >>>>>>> we do then? List transient fields in XML separately?
> > > >>>>>>>
> > > >>>>>>> - Alex
> > > >>>>>>>
> > > >>>>>>> 2016-09-28 4:15 GMT+03:00 Dmitriy Setrakyan <
> > dsetrak...@apache.org
> > > >:
> > > >>>>>>>> Agree with Denis.
> > > >>>>>>>>
> > > >>>>>>>> - by default, all non-transient key fields should participate
> in
> > > the
> > > >>>>>>>> hashcode generation
> > > >>>>>>>> - when working on DDL, then the primary key fields should
> > > participate in
> > > >>>>>>>> the hashcode
> > > >>>>>>>> - we should add a resolver to override the default behavior
> > > (please
> > > >>>>>>>> propose the interface in Jira)
> > > >>>>>>>> - we should print out a warning, only once per type, the the
> > > hashcode
> > > >>>>>>>> has been automatically generated based on which fields and
> which
> > > formula
> > > >>>>>>>>
> > > >>>>>>>> D.
> > > >>>>>>>>
> > > >>>>>>>> On Tue, Sep 27, 2016 at 5:42 PM, Denis Magda <
> > dma...@gridgain.com>
> > > wrote:
> > > >>>>>>>>
> > > >>>>>>>>> Hi Alexander,
> > > >>>>>>>>>
> > > >>>>>>>>> Vladimir’s proposal sounds reasonable to me. However we must
> > > keep in mind
> > > >>>>>>>>> one important thing. Binary objects were designed to address
> > the
> > > following
> > > >>>>>>>>> disadvantages a regular serializer, like optimized
> marshaller,
> > > has:
> > > >>>>>>>>> necessity to deserialize an object on a server side every
> time
> > > it’s needed.
> > > >>>>>>>>> necessity to hold an object in both serialized and
> deserialized
> > > forms on
> > > >>>>>>>>> the server node.
> > > >>>>>>>>> necessity to restart the whole cluster each time an object
> > > version is
> > > >>>>>>>>> changed (new field is added or an old one is removed).
> > > >>>>>>>>> If it will be needed to perform step 3 for a default
> > > implementation of the
> > > >>>>>>>>> binary resolver just because the resolver has to consider new
> > > fields or
> > > >>>>>>>>> ignore old ones then such an implementation sucks. Overall,
> the
> > > default
> > > >>>>>>>>> implementation should use the reflection coming over all the
> > > fields a key
> > > >>>>>>>>> has ignoring the ones that are marked with “transient”
> keyword.
> > > If a user
> > > >>>>>>>>> wants to control the default resolver's logic then he can
> label
> > > all the
> > > >>>>>>>>> fields that mustn’t be of a final has code value with
> > > “transient”. This has
> > > >>>>>>>>> to be well-documented for sure.
> > > >>>>>>>>>
> > > >>>>>>>>> Makes sense?
> > > >>>>>>>>>
> > > >>>>>>>>> —
> > > >>>>>>>>> Denis
> > > >>>>>>>>>
> > > >>>>>>>>>> On Sep 26, 2016, at 12:40 PM, Alexander Paschenko <
> > > >>>>>>>>> alexander.a.pasche...@gmail.com> wrote:
> > > >>>>>>>>>>
> > > >>>>>>>>>> Hello Igniters,
> > > >>>>>>>>>>
> > > >>>>>>>>>> As DML support is near, it's critical that we agree on how
> we
> > > generate
> > > >>>>>>>>>> hash codes for new keys in presence of binary marshaller.
> > > Actually,
> > > >>>>>>>>>> this discussion isn't new - please see its beginning here:
> > > >>>>>>>>>>
> > > >>>>>>>>>> http://apache-ignite-developers.2346864.n4.nabble.
> > > >>>>>>>>> com/All-BinaryObjects-created-by-BinaryObjectBuilder-stored-
> > > >>>>>>>>> at-the-same-partition-by-default-td8042.html
> > > >>>>>>>>>>
> > > >>>>>>>>>> Still, I'm creating this new thread to make getting to the
> > final
> > > >>>>>>>>>> solution as simple and fast as possible.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I remind everyone that the approach that has got the least
> > > critics was
> > > >>>>>>>>>> the one proposed by Vladimir Ozerov:
> > > >>>>>>>>>>
> > > >>>>>>>>>> <quote>
> > > >>>>>>>>>> I think we can do the following:
> > > >>>>>>>>>> 1) Add "has hash code" flag as Denis suggested.
> > > >>>>>>>>>> 2) If object without a hash code is put to cache, throw an
> > > exception.
> > > >>>>>>>>>> 3) Add *BinaryEqualsHashCodeResolver *interface.
> > > >>>>>>>>>> 4) Add default implementation which will auto-generate hash
> > > code. *Print
> > > >>>>>>>>> a
> > > >>>>>>>>>> warning when auto-generation occurs*, so that user is aware
> > > that he is
> > > >>>>>>>>>> likely to have problems with normal GETs/PUTs.
> > > >>>>>>>>>> 5) Add another implementation which will use encoded string
> to
> > > calculate
> > > >>>>>>>>> a
> > > >>>>>>>>>> hash code. E.g. *new BinaryEqualsHashCodeResolver("{a} *
> 31 +
> > > {b}")*.
> > > >>>>>>>>>> Originally proposed by Yakov some time ago.
> > > >>>>>>>>>> </quote>
> > > >>>>>>>>>>
> > > >>>>>>>>>> After that, Sergi suggested that instead of a "formula" we
> > keep
> > > just a
> > > >>>>>>>>>> list of the "fields" that participate in hash code
> evaluation,
> > > and
> > > >>>>>>>>>> with that list, we simply calculate hash code just like IDE
> > > does -
> > > >>>>>>>>>> with all its bit shifts and additions.
> > > >>>>>>>>>>
> > > >>>>>>>>>> I'm planning on settling down with this combined Vlad-Sergi
> > > approach.
> > > >>>>>>>>>> Any objections?
> > > >>>>>>>>>>
> > > >>>>>>>>>> And an extra question I have: Vlad, you suggest that we both
> > > throw an
> > > >>>>>>>>>> exception on cache code absence and that we might generate
> it
> > > as the
> > > >>>>>>>>>> last resort. Do I understand you correctly that you suggest
> > > generating
> > > >>>>>>>>>> random code only in context of SQL, but throw exception for
> > keys
> > > >>>>>>>>>> without codes on ordinary put?
> > > >>>>>>>>>>
> > > >>>>>>>>>> And yes, built-in hash codes for JDK types are supported as
> > > well as
> > > >>>>>>>>>> items 1-2 from Vlad's list (there's already fixed issue of
> > > IGNITE-3633
> > > >>>>>>>>>> for the flag and its presence check).
> > > >>>>>>>>>>
> > > >>>>>>>>>> - Alex
> > > >>>>>>>>>
> > > >>>>>>>>>
> > > >>>>>>
> > > >>>>
> > > >>
> > >
> > >
> >
>

Re: Default hash code generation strategy for new binary objects

Reply via email to