Re: New proposal for type definitons

Jacob Barrett Thu, 05 Jan 2017 10:32:51 -0800

If we are simply looking at ways to avoid the PDX type bloat then some
quick wins would be:
Presort JSON field names or remove the ordering dependency in PDX type
matching. I looked into removing or working around the ordering a while ago
when dealing with GPDB integration.
Stop this silliness of trying to conserver space by putting small numbers
in to smaller int fields and force all JSON numbers to be serialized as
BigDecimal. JSON does not define any other type of number so why are we
trying to?
Don't parse time in JSON. There is no standard or type for time in JSON.


This won't solve all bloat like that resulting for added or removed fields
but having a superset of those fields defined in the PDX metadata will only
cause memory bloat in storage. If your PDX type defines 100 fields but your
pdx instance only populates 1 the serialized form still records the null
values for the other 99 fields. If your 1 field happens to be the last
field then your performance goes to crap too since a getValue call has to
walk the entire structure from the first variable length field to find the
field you are looking for in the stream. You are way better off with more
types that define these smaller subsets of the document then one superset.
Optimizations should be made in the lookup in the PDX registry.

-Jake


On Wed, Jan 4, 2017 at 12:12 PM Udo Kohlmeyer <u...@apache.org> wrote:

> I like the idea of the self-describing and that if you are willing to
> take the memory hit, then storing the type definition as part of the
> data entry works. Although it WILL cause huge headaches:
>
>   * significant increase in memory
>   * performance degradation not only in the processing of the entry but
>     every network hop it would incur
>
> My proposal, even if incredibly veiled under the JSON banner, was a type
> catalogue that would replace the current PDXTypeRegistry and that could
> form that basis of a greater data type service. One that does not only
> help with the serialization but also in converting from one type to
> another (JSON->Pdx) and formatting (import/export)  of decimal and date
> fields. I even had the idea that it could store things like secure
> fields (masking and obscuring of non-authorized access). But I see that
> this falls squarely in the realm of security and should not be put in
> the catalogue.
>
> I believe that we could live in the "best-of-both-worlds" were you could
> define (or have it automatically define) a type definition. Then the
> current logic would continue working as is. IF one then decides that it
> is too much effort or the structure cannot be concretely defined or one
> just does not care, then the "self-describing" entry type can be used.
> With the added memory footprint,etc...
>
> Serialization has always been something that, was supposed to the
> pluggable. The serialization framework would just take the data entry
> and (de)serialize it. The serialization framework would use the type
> catalogue to (de)serialize the data, like we currently do with Pdx. The
> order of fields for the data is specified and we know how to
> (de)serialize the data.
>
> Improving the current JSONFormatter was really a start, a way how we can
> improve the definition and usage of other, non-POJO like, structures. We
> are currently facing the following problems:
>
>   * Too many types created due to inconsistently structured JSON documents
>   * DateTime fields incorrectly processed due to missing the time
>     component on import. (Import/Export)
>   * Decimal field formatting
>
> @jake, I like the idea of having FieldReadable interface (we can work on
> the naming though). Then we can start getting some conformity around how
> we access data regardless of what type of object stored.
>
>
> On 1/4/17 07:14, William Markito Oliveira wrote:
> > I think bson already stores the field names within the serialized data
> > values, which is indeed more generic but would of course take more space.
> >
> > These conversations are very interesting, specially considering how many
> > popular serialization formats exists out there (Parquet, Avro, Protobuf,
> > etc...) but I'm not sure the serialization itself was the main thing with
> > Udo's proposal and more the problem that today JSONFormatter + PDXTypes
> is
> > the only way to do it and it could cause the "explosion of types" on
> > unstructured data.
> >
> > Seems to me that fixing the JSONFormatter to be smarter about it is a
> quick
> > path but it would not address the whole picture of making serialization
> > options modular in Geode which could be it's own new proposal as well.
> >   Just a thought.
> >
> > On Tue, Jan 3, 2017 at 7:21 PM, Jacob Barrett<jbarr...@pivotal.io>
> wrote:
> >
> >> I don't know that I would be concerned with optimization of unstructured
> >> data from the start. Given that the data is unstructured it means that
> it
> >> can be restructured at a later time. You could have a lazy task running
> on
> >> the server the restructures unstructured data to be more uniform and
> >> compact.
> >>
> >> I also don't think there are many good reasons to try wedge this into
> PDX.
> >> The only reason I see for wedging this into PDX is to avoid progress on
> >> modularizing and extending Geode.
> >>
> >> If all the where we access fields on a stored object, query, indexing,
> >> etc., where made a bit more generic then any object that supports a
> simple
> >> getValue(field) like interface could be accessed without
> deserialization or
> >> specialization.
> >>
> >> Consider:
> >> public interface FieldReadable {
> >> public object getValue(String field);
> >> }
> >>
> >> You could have an implementation that can getValue on PDX, POJO, JSON,
> >> BSON, XML, etc. There is no concern at this level with underlying
> storage
> >> type or the original unserialized form of the object (if any).
> >>
> >> -Jake
> >>
> >>
> >>
> >>
> >> On Tue, Jan 3, 2017 at 4:46 PM Dan Smith<dsm...@pivotal.io>  wrote:
> >>
> >>> Hi Hitesh,
> >>>
> >>> There are a few different ways to store self describing data. One way
> >> might
> >>> be to just store the json string, or convert it to bson, and then
> enhance
> >>> the query engine to handle those formats. Another way might be extend
> PDX
> >>> to support self describing serialized values. We xould add a
> >> selfDescribing
> >>> boolean flag to RegionService.createPdxInstanceFactory. If that flag is
> >>> set, we will not register the PDX type in the type registry but instead
> >>> store it as part of the value. The JSONFormatter could set that flag to
> >>> true or expose it as an option.
> >>>
> >>> Storing self describing documents is a different approach than Udo's
> >>> original proposal. I do agree there is value in being able to store
> >>> consistently structured json documents the way we do now to save
> memory.
> >> I
> >>> think maybe I would be happier if the original proposal was more of an
> >>> external tool or wrapper focused on sanitizing json documents without
> >> being
> >>> concerned with type ids or a central registry service. I could picture
> >> just
> >>> having a single sanitize method that takes a json string and a standard
> >>> json
> >>> schema<http://json-schema.org/>  and returns a cleaned up json
> document.
> >>> That seems like it would be a lot easier to implement and wouldn't
> >> require
> >>> the user to add typeIds to their json documents.
> >>>
> >>> I still feel like storing self describing values might serve more
> users.
> >> It
> >>> is probably more work than a simple sanitize method like above though.
> >>>
> >>> -Dan
> >>>
> >>>
> >>> On Tue, Jan 3, 2017 at 4:07 PM, Hitesh Khamesra
> >>> <hitesh...@yahoo.com.invalid
> >>>> wrote:
> >>>>>> If we give people the option to store
> >>>> and query self describing values, then users with inconsistent json
> >> could
> >>>> just use that option and pay the extra storage cost.
> >>>> Dan, are you saying expose some interface to serialize/de and "query
> >> the
> >>>> some field in data - getFieldValue(fieldname)" dtata?  Some sort of
> >>>> ExternalSerializer with getFieldValue() capability.
> >>>>
> >>>>
> >>>>        From: Dan Smith<dsm...@pivotal.io>
> >>>>   To:dev@geode.apache.org
> >>>>   Sent: Wednesday, December 21, 2016 6:20 PM
> >>>>   Subject: Re: New proposal for type definitons
> >>>>
> >>>> I'm assuming the type ids here are a different set than the type ids
> >> used
> >>>> with regular PDX serialization so they won't conflict if the pdx
> >> registry
> >>>> assigns 1 to some class and a user puts @typeId: 1 in their json?
> >>>>
> >>>> I'm concerned that this won't really address the type explosion issue.
> >>>> Users that are able to go to the effort of adding these typeIds to all
> >> of
> >>>> their json are probably users that can produce consistently formatted
> >>> json
> >>>> in the first place. Users that have inconsistently formatted json are
> >>>> probably not going to want or be able to add these type ids.
> >>>>
> >>>> It might be better for us to pursue a way to store arbitrary documents
> >>> that
> >>>> are self describing. Our current approach for json documents is
> >> assuming
> >>>> that the documents are all consistently formatted. We are infer a
> >> schema
> >>>> for their documents store the field names in the type registry and the
> >>>> field values in the serialized data. If we give people the option to
> >>> store
> >>>> and query self describing values, then users with inconsistent json
> >> could
> >>>> just use that option and pay the extra storage cost.
> >>>>
> >>>> -Dan
> >>>>
> >>>> On Tue, Dec 20, 2016 at 4:53 PM, Udo Kohlmeyer<ukohlme...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hey there,
> >>>>>
> >>>>> I've just completed a new proposal on the wiki for a new mechanism
> >> that
> >>>>> could be used to define a type definition for an object.
> >>>>> https://cwiki.apache.org/confluence/display/GEODE/Custom+
> >>>>> External+Type+Definition+Proposal+for+JSON
> >>>>>
> >>>>> Primarily the new type definition proposal will hopefully help with
> >> the
> >>>>> "structuring" of JSON document definitions in a manner that will
> >> allow
> >>>>> users to submit JSON documents for data types without the need to
> >>> provide
> >>>>> every field of the whole domain object type.
> >>>>>
> >>>>> Please review and comment as required.
> >>>>>
> >>>>> --Udo
> >>>>>
> >>>>>
> >>>>
> >>>>
> >
>
>

Re: New proposal for type definitons

Reply via email to