> I started writing the draft for a full proposal, however, I don't have
> time to finish it today as I have to revise for tomorrow's exam. I
> will try to finish it in 12 hours at most since I know I'm already
> posting it a little bit too late to make it possible to review it
> thoroughly.

Heh, didn't manage the 12 hours, well, at least it is still
today... (hey, almost to the minute two days until deadline!)

I have yet to add some more references to the document.
The same URL as before still holds if anyone prefers it:
http://people.ksp.sk/~johnny64/GSoC-full-proposal


GSoC 2011 Proposal: Composite Fields
====================================

About me
--------

My name is Michal Petrucha. I'm an undergrad student of computer science
at the Faculty of Mathematics, Physics and Informatics, Comenius
University, Bratislava, Slovakia. As a high school student I participated
in programming contests such as the Olympiad in Informatics.

While developping an application for internal use by the organizers of
several Slovak high school programming contests I got into a situation
where having support for composite primary keys would help greatly,
therefore I decided to implement it with some extra added value.


Synopsis
--------

Django's ORM is a powerful tool which suits perfectly most use-cases,
however, there are cases where having exactly one primary key column per
table induces unnecessary redundancy.

One such case is the many-to-many intermediary model. Even though the pair
of ForeignKeys in this model identifies uniquely each relationship, an
additional field is required by the ORM to identify individual rows. While
this isn't a real problem when the underlying database schema is created
by Django, it becomes an obstacle as soon as one tries to develop a Django
application using a legacy database.

Since there is already a lot of code relying on the pk property of model
instances and the ability to use it in QuerySet filters, it is necessary
to implement a mechanism to allow filtering of several actual fields by
specifying a single filter.

The proposed solution is a virtual field type, CompositeField. This field
type will enclose several real fields within one single object. From the
public API perspective this field type will share several characteristics
of other field types, namely:

- CompositeField.unique
    This will create a unique index on the enclosed fields in the
    database, deprecating the 'unique_together' Meta attribute.

- CompositeField.db_index
    This option will create a non-unique index on the enclosed fields.

- CompositeField.primary_key
    This option will tell the ORM that the primary key for the model is
    composed of the enclosed fields.

- Retrieval and assignment
    Retrieval of the CompositeField value will return a namedtuple
    containing the actual values of underlying fields; assignment will
    assign given values to the underlying fields.

- QuerySet filtering
    Supplying an iterable the same way as with assignment to an
    'exact'-type filter will match only those instances where each
    underlying field value equals the corresponding supplied value.


Implementation
--------------

Specifying a CompositeField in a Model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The constructor of a CompositeField will accept the supported options as
keyword parameters and the enclosed fields will be specified as positional
parameters. The order in which they are specified will determine their
order in the namedtuple representing the CompositeField value (i. e. when
retrieving and assigning the CompositeField's value; see example below).

unique and db_index
~~~~~~~~~~~~~~~~~~~
Implementing these will require some modifications in the backend code.
The table creation code will have to handle virtual fields as well as
local fields in the table creation and index creation routines
respectively.

When the code handling CompositeField.unique is finished, the
models.options.Options class will have to be modified to create a unique
CompositeField for each tuple in the Meta.unique_together attribute. The
code handling unique checks in models.Model will also have to be updated
to reflect the change.

Retrieval and assignment
~~~~~~~~~~~~~~~~~~~~~~~~

Jacob has actually already provided a skeleton of the code that takes care
of this as seen in [1]. I'll only summarize the behaviour in a brief
example of my own.

    class SomeModel(models.Model):
        first_field = models.IntegerField()
        second_field = models.CharField(max_length=100)
        composite = models.CompositeField(first_field, second_field)

    >>> instance = new SomeModel(first_field=47, second_field="some string")
    >>> instance.composite
    SomeModel_composite(first_field=47, second_field='some string')
    >>> instance.composite = (74, "other string")
    >>> instance.first_field, instance.second_field
    (74, 'other string')

One thing that bugs me is the name of the namedtuple -- it is a class,
which means InitialCaps is the right way, however, its name is partly
composed of a field name which underscore-separated words fit better. This
is just a cosmetic detail though.

QuerySet filtering
~~~~~~~~~~~~~~~~~~

This is where the real fun begins.

The fundamental problem here is that Q objects which are used all over the
code that handles filtering are designed to describe single field lookups.
On the other hand, CompositeFields will require a way to describe several
individual field lookups by a single expression.

Since the Q objects themselves have no idea about fields at all and the
actual field resolution from the filter conditions happens deeper down the
line, inside models.sql.query.Query, this is where we can handle the
filters properly.

There is already some basic machinery inside Query.add_filter and
Query.setup_joins that is in use by GenericRelations, this is
unfortunately not enough. The optional extra_filters field method will be
of great use here, though it will have to be extended.

Currently the only parameters it gets are the list of joins the
filter traverses, the position in the list and a negate parameter
specifying whether the filter is negated. The GenericRelation instance can
determine the value of the content type (which is what the extra_filters
method is used for) easily based on the model it belongs to.

This is not the case for a CompositeField -- it doesn't have any idea
about the values used in the query. Therefore a new parameter has to be
added to the method so that the CompositeField can construct all the
actual filters from the iterable containing the values.

Afterwards the handling inside Query is pretty straightforward. For
CompositeFields (and virtual fields in general) there is no value to be
used in the where node, the extra_filters are responsible for all
filtering, but since the filter should apply to a single object even after
join traversals, the aliases will be set up while handling the "root"
filter and then reused for each one of the extra_filters.

CompositeField.primary_key
~~~~~~~~~~~~~~~~~~~~~~~~~~

As with db_index and unique, the backend table generating code will have
to be updated to set the PRIMARY KEY to a tuple. In this case, however,
the impact on the rest of the ORM and some other parts of Django is more
serious.

A (hopefully) complete list of things affected by this is:
- the admin: the possibility to pass the value of the primary key as a
  parameter inside the URL is a necessity to be able to work with a model
- contenttypes: since the admin uses GenericForeignKeys to log activity,
  there will have to be some support
- forms: more precisely, ModelForms and their ModelChoiceFields
- relationship fields: ForeignKey, ManyToManyField and OneToOneField will
  need a way to point to a model with a CompositeField as its primary key

Let's look at each one of them in more detail.

Admin
~~~~~

The solution that has been proposed so many times in the past is to extend
the quote function used in the admin to also quote the comma and then use
an unquoted comma as the separator. Even though this solution looks ugly
to some, I don't think there is much choice -- there needs to be a way to
separate the values and in theory, any character could be contained inside
a value so we can't really avoid choosing one and escaping it.

GenericForeignKeys
~~~~~~~~~~~~~~~~~~

As I said, this is used in the admin, which means we can't have full admin
support without also making GenericForeignKeys work with CompositeFields.
The solution I'm proposing is the same as in admin URLs: escaping the
comma and using it as the separator. This will leave us with a string,
which means the object_id field will have to be capable of storing
strings.

That is not an issue for the admin since it uses a TextField. It will be a
limitation for this special case.

ModelChoiceFields
~~~~~~~~~~~~~~~~~

Again, we need a way to specify the value as a parameter passed in the
form. The same escaping solution can be used even here.

Relationship fields
~~~~~~~~~~~~~~~~~~~

This turns out to be, not too surprisingly, the toughest problem. The fact
that related fields are spread across about fifteen different classes,
most of which are quite nontrivial, makes the whole bundle pretty fragile,
which means the changes have to be made carefully not to break anything.

What we need to achieve is that the ForeignKey, ManyToManyField and
OneToOneField detect when their target field is a CompositeField in
several situations and act accordingly since this will require different
handling than regular fields that map directly to database columns.

The first one to look at is ForeignKey since the other two rely on its
functionality, OneToOneField being its descendant and ManyToManyField
using ForeignKeys in the intermediary model. Once the ForeignKeys work,
OneToOneField should require minimal to no changes since it inherits
almost everything from ForeignKey.

The easiest part is that for composite related fields, the db_type will be
None since the data will be stored elsewhere.

ForeignKey and OneToOneField will also be able to create the underlying
fields automatically when added to the model. I'm proposing the following
default names: "fkname_targetname" where "fkname" is the name of the
ForeignKey field and "targetname" is the name of the remote field name
corresponding to the local one. I'm open to other suggestions on this.

There will also be a way to override the default names using a new field
option "enclosed_fields". This option will expect a tuple of fields each
of whose corresponds to one individual field in the same order as
specified in the target CompositeField. This option will be ignored for
non-composite ForeignKeys.

The trickiest part, however, will be relation traversals in QuerySet
lookups. Currently the code in models.sql.query.Query that creates joins
only joins on single columns. To be able to span a composite relationship
the code that generates joins will have to recognize column tuples and add
a constraint for each pair of corresponding columns with the same aliases
in all conditions.

For the sake of completeness, ForeignKey will also have an extra_filters
method allowing to filter by a related object or its primary key.

With all this infrastructure set up, ManyToMany relationships using
composite fields will be easy enough. Intermediary model creation will
work thanks to automatic underlying field creation for composite fields
and traversal in both directions will be supported by the query code.


Other considerations
--------------------

This infrastructure will allow reimplementing the GenericForeignKey as a
CompositeField at a later stage. Thanks to the modifications in the
joining code it should also be possible to implement bidirectional generic
relationship traversal in QuerySet filters. This is, however, out of scope
of this project.

CompositeFields will have the serialize option set to False to prevent
their serialization. Otherwise the enclosed fields would be serialized
twice which would not only infer redundancy but also ambiguity.

Also CompositeFields will be ignored in ModelForms by default, for two
reasons: 
- otherwise the same field would be inside the form twice
- there aren't really any form fields usable for tuples and a fieldset
  would require even more out-of-scope machinery

The CompositeField will not allow enclosing other CompositeFields. The
only exception might be the case of composite ForeignKeys which could also
be implemented after successful finish of this project. With this feature
the autogenerated intermediary M2M model could make the two ForeignKeys
its primary key, dropping the need to have a redundant id AutoField.

Estimates and timeline
----------------------

As I will have quite a few exams at school throughout June, I won't be
able to commit myself fully to the project for the first month and will
spend approximately 20 hours per week during this period. By the end of
the exam period, however, I intend to have sped up to about 30-35 hours
per week.

The proposed timeline is as follows:

week  1 (May 23. - May 29.):
- basic CompositeField implementation with assignment and retrieval
- documentation for the new field type API

week  2 (May 30. - Jun  5.):
- creation of indexes on the database
- unique conditions checking regression tests

week  3 (Jun  6. - Jun 12.):
- query code refactoring to make it possible to support the required
  extra_filters
- lookups by CompositeFields

week  4 (Jun 13. - Jun 19.):
- creation of a composite primary key
- more tests and taking care of any missing/forgotten documentation so far

week  5 (Jun 20. - Jun 26.):
- ModelForms and GFK support for composite primary keys

week  6 (Jun 27. - Jul  3.):
- full support in the admin

week  7 (Jul  4. - Jul 10.):
- fixing any documentation discrepancies and making sure everything is
  tested thoroughly
- exploring the related fields in detail and working up a detailed plan
  for the following changes

----> midterm
  By the time midterm evaluation arrives, everything except for
  relationship fields should be in production-ready state.

week  8 (Jul 11. - Jul 17.):
- implementing composite primary key support in all the
  RelatedObjectDescriptors

week  9 (Jul 18. - Jul 24.):
- query joins refactoring
- support for ForeignKey relationship traversals

week 10 (Jul 25. - Jul 31.):
- making sure OneToOne and ManyToMany work as well

weeks 11&12 (Aug  1. - Aug  14.):
- writing even more tests for the relationships
- finishing any missing documentation

----> pencils down

As can be seen from the proposed timeline, there is a separation between
the part that leads up to admin support for composite primary keys and the
relationship part. In my opinion the first part is more likely to be used
in practice than the second part so the main emphasis will be put on it in
case I discover unexpected difficulties. However, looking at the timeline
broken down into small parts I'm confident all proposed features should be
possible in the given time.

Contact
-------

This e-mail address, michal.petru...@ksp.sk, is probably the most reliable
way.
Jabber: johnn...@swissjabber.org
IRC: koniiiik @ #django and #django-dev


References
----------

[1] https://groups.google.com/d/msg/django-developers/Y0aAb792cTw/pGt8WFCmFhYJ

Attachment: signature.asc
Description: Digital signature

Reply via email to