I like the idea of the self-describing and that if you are willing to take the memory hit, then storing the type definition as part of the data entry works. Although it WILL cause huge headaches:

 * significant increase in memory
 * performance degradation not only in the processing of the entry but
   every network hop it would incur

My proposal, even if incredibly veiled under the JSON banner, was a type catalogue that would replace the current PDXTypeRegistry and that could form that basis of a greater data type service. One that does not only help with the serialization but also in converting from one type to another (JSON->Pdx) and formatting (import/export) of decimal and date fields. I even had the idea that it could store things like secure fields (masking and obscuring of non-authorized access). But I see that this falls squarely in the realm of security and should not be put in the catalogue.

I believe that we could live in the "best-of-both-worlds" were you could define (or have it automatically define) a type definition. Then the current logic would continue working as is. IF one then decides that it is too much effort or the structure cannot be concretely defined or one just does not care, then the "self-describing" entry type can be used. With the added memory footprint,etc...

Serialization has always been something that, was supposed to the pluggable. The serialization framework would just take the data entry and (de)serialize it. The serialization framework would use the type catalogue to (de)serialize the data, like we currently do with Pdx. The order of fields for the data is specified and we know how to (de)serialize the data.

Improving the current JSONFormatter was really a start, a way how we can improve the definition and usage of other, non-POJO like, structures. We are currently facing the following problems:

 * Too many types created due to inconsistently structured JSON documents
 * DateTime fields incorrectly processed due to missing the time
   component on import. (Import/Export)
 * Decimal field formatting

@jake, I like the idea of having FieldReadable interface (we can work on the naming though). Then we can start getting some conformity around how we access data regardless of what type of object stored.


On 1/4/17 07:14, William Markito Oliveira wrote:
I think bson already stores the field names within the serialized data
values, which is indeed more generic but would of course take more space.

These conversations are very interesting, specially considering how many
popular serialization formats exists out there (Parquet, Avro, Protobuf,
etc...) but I'm not sure the serialization itself was the main thing with
Udo's proposal and more the problem that today JSONFormatter + PDXTypes is
the only way to do it and it could cause the "explosion of types" on
unstructured data.

Seems to me that fixing the JSONFormatter to be smarter about it is a quick
path but it would not address the whole picture of making serialization
options modular in Geode which could be it's own new proposal as well.
  Just a thought.

On Tue, Jan 3, 2017 at 7:21 PM, Jacob Barrett<jbarr...@pivotal.io>  wrote:

I don't know that I would be concerned with optimization of unstructured
data from the start. Given that the data is unstructured it means that it
can be restructured at a later time. You could have a lazy task running on
the server the restructures unstructured data to be more uniform and
compact.

I also don't think there are many good reasons to try wedge this into PDX.
The only reason I see for wedging this into PDX is to avoid progress on
modularizing and extending Geode.

If all the where we access fields on a stored object, query, indexing,
etc., where made a bit more generic then any object that supports a simple
getValue(field) like interface could be accessed without deserialization or
specialization.

Consider:
public interface FieldReadable {
public object getValue(String field);
}

You could have an implementation that can getValue on PDX, POJO, JSON,
BSON, XML, etc. There is no concern at this level with underlying storage
type or the original unserialized form of the object (if any).

-Jake




On Tue, Jan 3, 2017 at 4:46 PM Dan Smith<dsm...@pivotal.io>  wrote:

Hi Hitesh,

There are a few different ways to store self describing data. One way
might
be to just store the json string, or convert it to bson, and then enhance
the query engine to handle those formats. Another way might be extend PDX
to support self describing serialized values. We xould add a
selfDescribing
boolean flag to RegionService.createPdxInstanceFactory. If that flag is
set, we will not register the PDX type in the type registry but instead
store it as part of the value. The JSONFormatter could set that flag to
true or expose it as an option.

Storing self describing documents is a different approach than Udo's
original proposal. I do agree there is value in being able to store
consistently structured json documents the way we do now to save memory.
I
think maybe I would be happier if the original proposal was more of an
external tool or wrapper focused on sanitizing json documents without
being
concerned with type ids or a central registry service. I could picture
just
having a single sanitize method that takes a json string and a standard
json
schema<http://json-schema.org/>  and returns a cleaned up json document.
That seems like it would be a lot easier to implement and wouldn't
require
the user to add typeIds to their json documents.

I still feel like storing self describing values might serve more users.
It
is probably more work than a simple sanitize method like above though.

-Dan


On Tue, Jan 3, 2017 at 4:07 PM, Hitesh Khamesra
<hitesh...@yahoo.com.invalid
wrote:
If we give people the option to store
and query self describing values, then users with inconsistent json
could
just use that option and pay the extra storage cost.
Dan, are you saying expose some interface to serialize/de and "query
the
some field in data - getFieldValue(fieldname)" dtata?  Some sort of
ExternalSerializer with getFieldValue() capability.


       From: Dan Smith<dsm...@pivotal.io>
  To:dev@geode.apache.org
  Sent: Wednesday, December 21, 2016 6:20 PM
  Subject: Re: New proposal for type definitons

I'm assuming the type ids here are a different set than the type ids
used
with regular PDX serialization so they won't conflict if the pdx
registry
assigns 1 to some class and a user puts @typeId: 1 in their json?

I'm concerned that this won't really address the type explosion issue.
Users that are able to go to the effort of adding these typeIds to all
of
their json are probably users that can produce consistently formatted
json
in the first place. Users that have inconsistently formatted json are
probably not going to want or be able to add these type ids.

It might be better for us to pursue a way to store arbitrary documents
that
are self describing. Our current approach for json documents is
assuming
that the documents are all consistently formatted. We are infer a
schema
for their documents store the field names in the type registry and the
field values in the serialized data. If we give people the option to
store
and query self describing values, then users with inconsistent json
could
just use that option and pay the extra storage cost.

-Dan

On Tue, Dec 20, 2016 at 4:53 PM, Udo Kohlmeyer<ukohlme...@gmail.com>
wrote:

Hey there,

I've just completed a new proposal on the wiki for a new mechanism
that
could be used to define a type definition for an object.
https://cwiki.apache.org/confluence/display/GEODE/Custom+
External+Type+Definition+Proposal+for+JSON

Primarily the new type definition proposal will hopefully help with
the
"structuring" of JSON document definitions in a manner that will
allow
users to submit JSON documents for data types without the need to
provide
every field of the whole domain object type.

Please review and comment as required.

--Udo






Reply via email to