Re: New proposal for type definitons

Udo Kohlmeyer Wed, 04 Jan 2017 12:12:56 -0800

I like the idea of the self-describing and that if you are willing totake the memory hit, then storing the type definition as part of thedata entry works. Although it WILL cause huge headaches:


 * significant increase in memory
 * performance degradation not only in the processing of the entry but
   every network hop it would incur

My proposal, even if incredibly veiled under the JSON banner, was a typecatalogue that would replace the current PDXTypeRegistry and that couldform that basis of a greater data type service. One that does not onlyhelp with the serialization but also in converting from one type toanother (JSON->Pdx) and formatting (import/export) of decimal and datefields. I even had the idea that it could store things like securefields (masking and obscuring of non-authorized access). But I see thatthis falls squarely in the realm of security and should not be put inthe catalogue.

I believe that we could live in the "best-of-both-worlds" were you coulddefine (or have it automatically define) a type definition. Then thecurrent logic would continue working as is. IF one then decides that itis too much effort or the structure cannot be concretely defined or onejust does not care, then the "self-describing" entry type can be used.With the added memory footprint,etc...

Serialization has always been something that, was supposed to thepluggable. The serialization framework would just take the data entryand (de)serialize it. The serialization framework would use the typecatalogue to (de)serialize the data, like we currently do with Pdx. Theorder of fields for the data is specified and we know how to(de)serialize the data.

Improving the current JSONFormatter was really a start, a way how we canimprove the definition and usage of other, non-POJO like, structures. Weare currently facing the following problems:


 * Too many types created due to inconsistently structured JSON documents
 * DateTime fields incorrectly processed due to missing the time
   component on import. (Import/Export)
 * Decimal field formatting

@jake, I like the idea of having FieldReadable interface (we can work onthe naming though). Then we can start getting some conformity around howwe access data regardless of what type of object stored.



On 1/4/17 07:14, William Markito Oliveira wrote:

I think bson already stores the field names within the serialized data
values, which is indeed more generic but would of course take more space.

These conversations are very interesting, specially considering how many
popular serialization formats exists out there (Parquet, Avro, Protobuf,
etc...) but I'm not sure the serialization itself was the main thing with
Udo's proposal and more the problem that today JSONFormatter + PDXTypes is
the only way to do it and it could cause the "explosion of types" on
unstructured data.

Seems to me that fixing the JSONFormatter to be smarter about it is a quick
path but it would not address the whole picture of making serialization
options modular in Geode which could be it's own new proposal as well.
  Just a thought.

On Tue, Jan 3, 2017 at 7:21 PM, Jacob Barrett<jbarr...@pivotal.io>  wrote:

I don't know that I would be concerned with optimization of unstructured
data from the start. Given that the data is unstructured it means that it
can be restructured at a later time. You could have a lazy task running on
the server the restructures unstructured data to be more uniform and
compact.

I also don't think there are many good reasons to try wedge this into PDX.
The only reason I see for wedging this into PDX is to avoid progress on
modularizing and extending Geode.

If all the where we access fields on a stored object, query, indexing,
etc., where made a bit more generic then any object that supports a simple
getValue(field) like interface could be accessed without deserialization or
specialization.

Consider:
public interface FieldReadable {
public object getValue(String field);
}

You could have an implementation that can getValue on PDX, POJO, JSON,
BSON, XML, etc. There is no concern at this level with underlying storage
type or the original unserialized form of the object (if any).

-Jake




On Tue, Jan 3, 2017 at 4:46 PM Dan Smith<dsm...@pivotal.io>  wrote:

Hi Hitesh,

There are a few different ways to store self describing data. One way

might

be to just store the json string, or convert it to bson, and then enhance
the query engine to handle those formats. Another way might be extend PDX
to support self describing serialized values. We xould add a

selfDescribing

boolean flag to RegionService.createPdxInstanceFactory. If that flag is
set, we will not register the PDX type in the type registry but instead
store it as part of the value. The JSONFormatter could set that flag to
true or expose it as an option.

Storing self describing documents is a different approach than Udo's
original proposal. I do agree there is value in being able to store
consistently structured json documents the way we do now to save memory.

think maybe I would be happier if the original proposal was more of an
external tool or wrapper focused on sanitizing json documents without

being

concerned with type ids or a central registry service. I could picture

just

having a single sanitize method that takes a json string and a standard
json
schema<http://json-schema.org/>  and returns a cleaned up json document.
That seems like it would be a lot easier to implement and wouldn't

require

the user to add typeIds to their json documents.

I still feel like storing self describing values might serve more users.

It

is probably more work than a simple sanitize method like above though.

-Dan


On Tue, Jan 3, 2017 at 4:07 PM, Hitesh Khamesra
<hitesh...@yahoo.com.invalid

wrote:

If we give people the option to store

and query self describing values, then users with inconsistent json

could

just use that option and pay the extra storage cost.
Dan, are you saying expose some interface to serialize/de and "query

the

some field in data - getFieldValue(fieldname)" dtata?  Some sort of
ExternalSerializer with getFieldValue() capability.

       From: Dan Smith<dsm...@pivotal.io>
  To:dev@geode.apache.org
  Sent: Wednesday, December 21, 2016 6:20 PM
  Subject: Re: New proposal for type definitons

I'm assuming the type ids here are a different set than the type ids

used

with regular PDX serialization so they won't conflict if the pdx

registry

assigns 1 to some class and a user puts @typeId: 1 in their json?

I'm concerned that this won't really address the type explosion issue.
Users that are able to go to the effort of adding these typeIds to all

of

their json are probably users that can produce consistently formatted

json

in the first place. Users that have inconsistently formatted json are
probably not going to want or be able to add these type ids.

It might be better for us to pursue a way to store arbitrary documents

that

are self describing. Our current approach for json documents is

assuming

that the documents are all consistently formatted. We are infer a

schema

for their documents store the field names in the type registry and the
field values in the serialized data. If we give people the option to

store

and query self describing values, then users with inconsistent json

could

just use that option and pay the extra storage cost.

-Dan

On Tue, Dec 20, 2016 at 4:53 PM, Udo Kohlmeyer<ukohlme...@gmail.com>
wrote:

Hey there,

I've just completed a new proposal on the wiki for a new mechanism

that

could be used to define a type definition for an object.
https://cwiki.apache.org/confluence/display/GEODE/Custom+
External+Type+Definition+Proposal+for+JSON

Primarily the new type definition proposal will hopefully help with

the

"structuring" of JSON document definitions in a manner that will

allow

users to submit JSON documents for data types without the need to

provide

every field of the whole domain object type.

Please review and comment as required.

--Udo

Re: New proposal for type definitons

Reply via email to