Re: [GSoC] Proposal for discussion about Serialization requirements and requesting for Review

Madhusudan C.S Sun, 29 Mar 2009 09:51:57 -0700

Hi Russell,
   I am extremely thankful to you for spending your invaluable time for
doing a review (err... should I say post-mortem? ;-) ) of my complete
proposal. I had kept my fingers crossed for someone who knew about the
technical aspects of it to do it since most of my friends did only a
language review (some of them even gave up seeing the length :( ). I am also
equally thankful to Malcolm for it.

After a lot of thinking, reviewing and studying how other serializers, apart
from Django serializers, in different languages and frameworks such as PHP,
Python(pickle), Java, Turbogears(TurboJSON) and Boost work, the whole of
yesterday, I have come up with some ideas which mostly departs from what I
have proposed earlier. From the top view I still propose to solve the same
problems I suggested in my initial proposal along with considering the
bigger problems you suggested. Again this is a very rough draft of my ideas
and requires a lot of refining by discussing with you and rest of the
community.

Thanks to ideas on the Wiki. Reference to ModelAdmin there gave me some
ideas to think further. Though this is not a copy, I have borrowed some
ideas from other serializers I studied yesterday. Also I have ensured as far
as possible that this doesn't break the existing Serializer and fixtures in
any way, but only adds on to it. Please point out if I have gone against
this somewhere.

The bigger issue is that we need to be able to easily
> reconfigure the output format of serializers to suit the
> specific requirements of other data consumers.

The idea that I propose below is mostly to tackle this bigger issue which
you pointed out throughout.

Let us consider same 2 models as before:

class Poll2(models.Model):
    question = models.CharField(max_length=200)
    pub_date = models.DateTimeField('date published')

class Choice2(models.Model):
    poll = models.ForeignKey(Poll)
    choice = models.CharField(max_length=200)
    votes = models.IntegerField()

The user now will be able to construct a class on the lines of ModelAdmin
for specifying custom serialization formats. I propose the API based on the
following ideas.
The user will be given an option to define a Serializer class that inherits
from the framework's serializers classes, Base, XML, Python, YAML and JSON.
For the moment, to avoid confusion, let me call the new Serializer
newserialzer (But this is only tentative, decision as to whether we must
rename the framework or just the classes can be finalized later). From what
I have understand, Python mainly consists of basic datatypes of single value
or the data structures like List, Tuple and Dictionary. Most other complex
data types/structures are derived from these types and thus represented with
those notations.

So our base class defines a set of class attributes that define the notation
for these fields which are same as the Python notations, for example
ListSeparators will be a 3-tuple containing enclosing notations and the List
item separator ('[', ']', ','). Similarly Dictionary Separtors is a 4-tuple
('{', '}', ',', ':'). The last item is for key:value separation. Similarly
more specialized cases will be defined for YAML and JSON classes. We can use
this approach to XML too. For this case we can pass a tuple of strings with
this format.
list_separator = ('<list-name>', '</list-name>', '<>list-value</>')
dict_separtorr = ('<dict-name>', </dict-name>', '<dict-key=dict-value></>')
It is important to note here that list-name, dict-name, list-value,
dict-value, dict-key are all indicative and are a part of the API(A better
naming convention will be developed) and they are not the place holders for
some other value there. As in, those are the names that must be always used
consistently, which will be evident from the below examples.

The user can now inherit from one of these classes in his app depending upon
the his requirements and over-ride these class attributes as per the format
he wants. The API rougly looks like this for Serializing the Poll class, in
a format similar to JSON notation.

class PollSerializer(newserializer.JSONSerializer):
    list_separator = ('{%', '%}', ':')
    dict_separator = ('{{', '}}', ':', '|')

In addition to this the user can specify the fields to be selected, by
over-riding a class attribute, fields. This attribute is a tuple of strings
where each item is the name of the field to be serialized. The above class
can now be written as follows:

class PollSerializer(newserializer.JSONSerializer):
    list_separator = ('{%', '%}', ':')
    dict_separator = ('{{', '}}', ':', '|')
    fields = ('question', 'pub_date')

Additionally a class attribute named exclude_fields, a tuple of strings, is
added which is just complimentary of fields attribute(Thanks to
DjangoFullSerializers for giving this idea).

To solve the ticket #5711, I propose a method extra_fields() which returns a
dictionary. It must return dictionary instead of a tuple because most of the
times the extra fields are computed/derived fields. Example below:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def extra_fields(self):
       pub_date_recent = pub_date > '2009-03-15'
       return {'is_recent': pub_date_recent}

One can also specify how a Primary Key can be serialized with the method def
pk_serialize() which returns a dictionary. This should address the ticket
#102. Example below:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def pk_serialize(self):
       return {'pk': pk_value, 'pk name': 'id'}

The dictionary can contain any number of items, but the stress is for the
use of *pk_value* at least once to serialize the PK value somewhere. I am
still unsure, if I should make this a method or an attribute. Can some one
kindly give suggestions?

The serialized output after over-riding the pk_serialize() method looks
something like below.
 {
        "pk": 1,
        "pk name": 'id'
        "model": "testapp.poll2",
        "fields": {
            "pub_date": "2009-03-01 06:00:00",
            "question": "What's Up?"
        }
 }

An additional model_extras() method can be overridden, which by default
returns nothing in the Parent classes. But in the over-ridden method of the
derived class this can return a dictionary of values which are added to the
Model's serialized data. An example of this can be version number of the
serialized format. API example:

class PollSerializer(newserializer.JSONSerializer):
   #...
   def model_extras(self):
       return {'version': '2.1'}

Finally coming to the big thing, Ticket #4656, I propose 3 Class attributes
for this. First one being select_related (as per your suggestion) which is a
dictionary. The key of the dictionary being the name of the Relation
Attribute and the value is a dictionary. This dictionary can have keys -
'fields' or 'excludefields', whose values are tuples of strings, which
indicate the name of the fields in that model to be selected or excluded. If
this dictionary is empty, it serializes the entire model, by using its
Serialization class similar to this one, if at all defined or using the
existing serializers.

Example:
class ChoiceSerializer(newserializer.JSONSerializer):
    #...
    select_related = {'poll': {'fields': ('question')}}

NOTE: I am not very sure if I can implement this in the SoC timeline, but I
will include it in the API proposal, if I run out of time I will continue
with this after GSoC. If time permits, well and good, I will implement this
too. The value of 'fields' key in the above dictionary is a tuple of strings
which clearly means I cannot follow a relation on that model. So I wish to
also allow dictionaries in this tuple along with the strings. This
dictionary is again a select_related kind of nested dictionary which can
follow the relation in that realtion and so on.
For the Book, Author, City example you gave, it can looks like this:
class BookSerializer(newserializer.JSONSerializer):
    #...
    select_related = {
        'author': {
            'fields': ('name', 'age', {
                'city':{
                    'fields': ('cityname', ...)
                }
            })
        }
    }
 *END NOTE*

Rest of the following are in the SoC timeline.
The second of the 3 attributes, is the inline_related attribute which can be
set to True. In the parent class this is false. If it is set to true,
Serializer will serialize the select_related relations inline.

The third attribute is the reverse_related. It is again a dictionary,
similar in structure to the select_related dictionary, with keys being the
name of the Model that relates to this model. For example:

class PollSerializer(newserializer.JSONSerializer):
   #...
   reverse_related = {'choice': {
       'fields': ('choice', 'votes')
   }}

Last but not the least always exists ;-)

The user registers this PollSerializer class with our serializer framwork,
similar to ModelAdmin as:
serializer.register.model(Poll, PollSerializer)

Now a question arises, what if the user wants to change only the
serialization format i.e notation, nothing else in the entire app? Should he
do the donkey's coding job of copy pasting list_separtor and dict_separator?
I feel he need not. For that I propose the following. The solution is to
define a Serializer class, say AppnameSerializer with what ever app specific
customization he wants(provided by the API) and the call
serializer.register.app(AppName, AppnameSerializer).

This can be extended to multiple apps and too. If he wants to customize a
set of apps, he can say:
serializer.register.app(multiple_apps=(App1Name, App2Name, ...),
AppSetSerializer).

On Sat, Mar 28, 2009 at 12:17 PM, Russell Keith-Magee <
[email protected]> wrote:

>
> On Fri, Mar 27, 2009 at 1:48 AM, Madhusudan C.S <[email protected]>
> wrote:
> > Hi all,
> > *Note: *
> >   Django doesn't serialize inherited Model fields in the Child Model. I
> > asked
> > on IRC why this decision was taken but got no response. I searched the
> > devel list too, but did not get anything on it. I want to add it to my
> > proposal, but before doing it I wanted to know why this decision was
> > taken. Will it be a workable and necessary solution to add that to my
> > proposal?
>
> Malcolm has already addressed this, and his analysis is pretty much
> spot on. I would only add that the current behaviour can also be
> explained by looking at the heritage of the fixture system.
> Historically, Django's fixtures have been used as a way of serializing
> output for transfer between two Django installations (for example, as
> test fixtures). To this end, the serializers have concentrated on
> replicating a very database-like structure - that is, the structures
> that are serialized closely match the underlying database structures.
> In an inheritance situation, child tables don't contain all the data
> from the parent table; hence, neither do the serialized structures.
>
> Obviously, this focus on representing the database misses an obvious
> alternate use case - occasions where serialization is required to
> communicate to some other data consumer, such as an AJAX framework. In
> my 'big picture' of the ideal serialization SoC project, this is the
> problem that needs to be fixed. More on in later comments.

Ok got it. This can be taken care by *fields* class attribute in the above
API.

> > Same is the case for Ticket #10201. Can someone please tell me why
> > microsecond data was dropped?
>
> Again, Malcolm is on the money. If you can come up with a fix that
> enables non-millisecond deprived databases to maintain microseconds,
> I'm sure it would be a welcome inclusion. Thinking about it, this
> shouldn't actually be that hard to achieve.

I am still not very sure of how to implement this. The only approach I can
think ATM is the hard-coded approach.
if database_type == mysql:  #during deserialization
    get rid of microseconds info.
But I don't feel it is an elegant solution. There may be a better one which
I am not able to think as of now. So I will exclude it for now. If I can get
a solution or some one suggests a solution, it anyways doesn't hurt
implementing it?

>   The project is planned to be completed in 9 phases.
> ...
> >   2. Finalizing Design and Coding Phase I (May 22th – May 31st )
> >   3. Testing Phase I (June 1st – June 5th )
>
> As a prior warning - I'm very skeptical of anyone that proposes a
> "test" phase that isn't integrated with the "build" phase. If you're
> not testing at the same time you are building, then you don't know you
> have the right result? If you test after you build, what happens when
> your test reveals a problem with your implementation?
>
> I know line items like this make accountant types happy, but it just
> doesn't wash with me. If your implementation, including tests, will
> take 3 weeks, then say three weeks. Don't say 2 weeks implementation
> followed by a 1 week test.

I have not provided the full schedule of my revised proposal, but just the
APIs, since I feel this is an entirely new approach to Serialization and
requires some refining still after which I can prepare good schedule plan.
He He I understood what you meant (then I think I am of the accountant types
;-) since I love that kind of split up). I am correcting it anyways,
understood the problem you indicated.

This is a very rough schedule, no way close to complete.
>From May 22
1. Create newserialization framework classes. Add list_separator and
dict_separator fields. Make sure everything is sane and works correctly as
before without breaking existing serializers with all defaults - 4 weeks.
2. Add on additional APIs support. Namely methods and attributes such as
fields, exclude_fields, extra_fields(), pk_serialize(), model_extras() and
test them - 3 weeks.
3. Add support for follwing relations, select_related, inline_related,
reverse_related class attributes - 4 weeks
4. Write user and developer documentation, minor issues and bug fixing,
communicating and dicussing with the community and code scrubbing - 2 weeks.

Thanks for taking the time to put together such a comprehensive
> proposal. I hope my comments haven't left you too despondent. :-)
>

No way. I am too happy that you pointed out where I lack seeing the big
picture. I in fact took it positively and I always do so when some one
points out my mistakes. I understand that some one points out mistakes only
for my good. Hope my above work reflects it :(

> However, all is not lost.

I am of the same opinion too. I want to be a Django contributor and I want
to be a Django GSoC student too (period)

While it would be advantageous to have a
> complete API proposal before starting work, it isn't completely
> necessary. What would be necessary at a minimum is a set of use cases
> to provide some sort of scope for what you would like to achieve
> (i.e., develop a serialization API that would allow for the following
> serialization use cases). Once we have a set of use cases, we can
> establish the options that we have for an API, and develop that API
> during the 'getting to know you' phase, and even during the initial
> development phase of the GSoC project.
>
> Of course, if you already have any ideas on how to specify
> user-customizable serialization formats, feel free to knock our socks
> off :-)
>

Hope I have covered most of the things I have learnt and can be done.

P.S. (I think it is not very easy to come up with a revolutionary idea in
one single day. So I don't claim it is revolutionary, but I claim it is
better than what exists now and what I proposed initially.)
-- 
Thanks and regards,
 Madhusudan.C.S

Blogs at: www.madhusudancs.info
Official Email ID: [email protected]

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to 
[email protected]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: [GSoC] Proposal for discussion about Serialization requirements and requesting for Review

Reply via email to