> I started writing the draft for a full proposal, however, I don't have > time to finish it today as I have to revise for tomorrow's exam. I > will try to finish it in 12 hours at most since I know I'm already > posting it a little bit too late to make it possible to review it > thoroughly.
Heh, didn't manage the 12 hours, well, at least it is still today... (hey, almost to the minute two days until deadline!) I have yet to add some more references to the document. The same URL as before still holds if anyone prefers it: http://people.ksp.sk/~johnny64/GSoC-full-proposal GSoC 2011 Proposal: Composite Fields ==================================== About me -------- My name is Michal Petrucha. I'm an undergrad student of computer science at the Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia. As a high school student I participated in programming contests such as the Olympiad in Informatics. While developping an application for internal use by the organizers of several Slovak high school programming contests I got into a situation where having support for composite primary keys would help greatly, therefore I decided to implement it with some extra added value. Synopsis -------- Django's ORM is a powerful tool which suits perfectly most use-cases, however, there are cases where having exactly one primary key column per table induces unnecessary redundancy. One such case is the many-to-many intermediary model. Even though the pair of ForeignKeys in this model identifies uniquely each relationship, an additional field is required by the ORM to identify individual rows. While this isn't a real problem when the underlying database schema is created by Django, it becomes an obstacle as soon as one tries to develop a Django application using a legacy database. Since there is already a lot of code relying on the pk property of model instances and the ability to use it in QuerySet filters, it is necessary to implement a mechanism to allow filtering of several actual fields by specifying a single filter. The proposed solution is a virtual field type, CompositeField. This field type will enclose several real fields within one single object. From the public API perspective this field type will share several characteristics of other field types, namely: - CompositeField.unique This will create a unique index on the enclosed fields in the database, deprecating the 'unique_together' Meta attribute. - CompositeField.db_index This option will create a non-unique index on the enclosed fields. - CompositeField.primary_key This option will tell the ORM that the primary key for the model is composed of the enclosed fields. - Retrieval and assignment Retrieval of the CompositeField value will return a namedtuple containing the actual values of underlying fields; assignment will assign given values to the underlying fields. - QuerySet filtering Supplying an iterable the same way as with assignment to an 'exact'-type filter will match only those instances where each underlying field value equals the corresponding supplied value. Implementation -------------- Specifying a CompositeField in a Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The constructor of a CompositeField will accept the supported options as keyword parameters and the enclosed fields will be specified as positional parameters. The order in which they are specified will determine their order in the namedtuple representing the CompositeField value (i. e. when retrieving and assigning the CompositeField's value; see example below). unique and db_index ~~~~~~~~~~~~~~~~~~~ Implementing these will require some modifications in the backend code. The table creation code will have to handle virtual fields as well as local fields in the table creation and index creation routines respectively. When the code handling CompositeField.unique is finished, the models.options.Options class will have to be modified to create a unique CompositeField for each tuple in the Meta.unique_together attribute. The code handling unique checks in models.Model will also have to be updated to reflect the change. Retrieval and assignment ~~~~~~~~~~~~~~~~~~~~~~~~ Jacob has actually already provided a skeleton of the code that takes care of this as seen in [1]. I'll only summarize the behaviour in a brief example of my own. class SomeModel(models.Model): first_field = models.IntegerField() second_field = models.CharField(max_length=100) composite = models.CompositeField(first_field, second_field) >>> instance = new SomeModel(first_field=47, second_field="some string") >>> instance.composite SomeModel_composite(first_field=47, second_field='some string') >>> instance.composite = (74, "other string") >>> instance.first_field, instance.second_field (74, 'other string') One thing that bugs me is the name of the namedtuple -- it is a class, which means InitialCaps is the right way, however, its name is partly composed of a field name which underscore-separated words fit better. This is just a cosmetic detail though. QuerySet filtering ~~~~~~~~~~~~~~~~~~ This is where the real fun begins. The fundamental problem here is that Q objects which are used all over the code that handles filtering are designed to describe single field lookups. On the other hand, CompositeFields will require a way to describe several individual field lookups by a single expression. Since the Q objects themselves have no idea about fields at all and the actual field resolution from the filter conditions happens deeper down the line, inside models.sql.query.Query, this is where we can handle the filters properly. There is already some basic machinery inside Query.add_filter and Query.setup_joins that is in use by GenericRelations, this is unfortunately not enough. The optional extra_filters field method will be of great use here, though it will have to be extended. Currently the only parameters it gets are the list of joins the filter traverses, the position in the list and a negate parameter specifying whether the filter is negated. The GenericRelation instance can determine the value of the content type (which is what the extra_filters method is used for) easily based on the model it belongs to. This is not the case for a CompositeField -- it doesn't have any idea about the values used in the query. Therefore a new parameter has to be added to the method so that the CompositeField can construct all the actual filters from the iterable containing the values. Afterwards the handling inside Query is pretty straightforward. For CompositeFields (and virtual fields in general) there is no value to be used in the where node, the extra_filters are responsible for all filtering, but since the filter should apply to a single object even after join traversals, the aliases will be set up while handling the "root" filter and then reused for each one of the extra_filters. CompositeField.primary_key ~~~~~~~~~~~~~~~~~~~~~~~~~~ As with db_index and unique, the backend table generating code will have to be updated to set the PRIMARY KEY to a tuple. In this case, however, the impact on the rest of the ORM and some other parts of Django is more serious. A (hopefully) complete list of things affected by this is: - the admin: the possibility to pass the value of the primary key as a parameter inside the URL is a necessity to be able to work with a model - contenttypes: since the admin uses GenericForeignKeys to log activity, there will have to be some support - forms: more precisely, ModelForms and their ModelChoiceFields - relationship fields: ForeignKey, ManyToManyField and OneToOneField will need a way to point to a model with a CompositeField as its primary key Let's look at each one of them in more detail. Admin ~~~~~ The solution that has been proposed so many times in the past is to extend the quote function used in the admin to also quote the comma and then use an unquoted comma as the separator. Even though this solution looks ugly to some, I don't think there is much choice -- there needs to be a way to separate the values and in theory, any character could be contained inside a value so we can't really avoid choosing one and escaping it. GenericForeignKeys ~~~~~~~~~~~~~~~~~~ As I said, this is used in the admin, which means we can't have full admin support without also making GenericForeignKeys work with CompositeFields. The solution I'm proposing is the same as in admin URLs: escaping the comma and using it as the separator. This will leave us with a string, which means the object_id field will have to be capable of storing strings. That is not an issue for the admin since it uses a TextField. It will be a limitation for this special case. ModelChoiceFields ~~~~~~~~~~~~~~~~~ Again, we need a way to specify the value as a parameter passed in the form. The same escaping solution can be used even here. Relationship fields ~~~~~~~~~~~~~~~~~~~ This turns out to be, not too surprisingly, the toughest problem. The fact that related fields are spread across about fifteen different classes, most of which are quite nontrivial, makes the whole bundle pretty fragile, which means the changes have to be made carefully not to break anything. What we need to achieve is that the ForeignKey, ManyToManyField and OneToOneField detect when their target field is a CompositeField in several situations and act accordingly since this will require different handling than regular fields that map directly to database columns. The first one to look at is ForeignKey since the other two rely on its functionality, OneToOneField being its descendant and ManyToManyField using ForeignKeys in the intermediary model. Once the ForeignKeys work, OneToOneField should require minimal to no changes since it inherits almost everything from ForeignKey. The easiest part is that for composite related fields, the db_type will be None since the data will be stored elsewhere. ForeignKey and OneToOneField will also be able to create the underlying fields automatically when added to the model. I'm proposing the following default names: "fkname_targetname" where "fkname" is the name of the ForeignKey field and "targetname" is the name of the remote field name corresponding to the local one. I'm open to other suggestions on this. There will also be a way to override the default names using a new field option "enclosed_fields". This option will expect a tuple of fields each of whose corresponds to one individual field in the same order as specified in the target CompositeField. This option will be ignored for non-composite ForeignKeys. The trickiest part, however, will be relation traversals in QuerySet lookups. Currently the code in models.sql.query.Query that creates joins only joins on single columns. To be able to span a composite relationship the code that generates joins will have to recognize column tuples and add a constraint for each pair of corresponding columns with the same aliases in all conditions. For the sake of completeness, ForeignKey will also have an extra_filters method allowing to filter by a related object or its primary key. With all this infrastructure set up, ManyToMany relationships using composite fields will be easy enough. Intermediary model creation will work thanks to automatic underlying field creation for composite fields and traversal in both directions will be supported by the query code. Other considerations -------------------- This infrastructure will allow reimplementing the GenericForeignKey as a CompositeField at a later stage. Thanks to the modifications in the joining code it should also be possible to implement bidirectional generic relationship traversal in QuerySet filters. This is, however, out of scope of this project. CompositeFields will have the serialize option set to False to prevent their serialization. Otherwise the enclosed fields would be serialized twice which would not only infer redundancy but also ambiguity. Also CompositeFields will be ignored in ModelForms by default, for two reasons: - otherwise the same field would be inside the form twice - there aren't really any form fields usable for tuples and a fieldset would require even more out-of-scope machinery The CompositeField will not allow enclosing other CompositeFields. The only exception might be the case of composite ForeignKeys which could also be implemented after successful finish of this project. With this feature the autogenerated intermediary M2M model could make the two ForeignKeys its primary key, dropping the need to have a redundant id AutoField. Estimates and timeline ---------------------- As I will have quite a few exams at school throughout June, I won't be able to commit myself fully to the project for the first month and will spend approximately 20 hours per week during this period. By the end of the exam period, however, I intend to have sped up to about 30-35 hours per week. The proposed timeline is as follows: week 1 (May 23. - May 29.): - basic CompositeField implementation with assignment and retrieval - documentation for the new field type API week 2 (May 30. - Jun 5.): - creation of indexes on the database - unique conditions checking regression tests week 3 (Jun 6. - Jun 12.): - query code refactoring to make it possible to support the required extra_filters - lookups by CompositeFields week 4 (Jun 13. - Jun 19.): - creation of a composite primary key - more tests and taking care of any missing/forgotten documentation so far week 5 (Jun 20. - Jun 26.): - ModelForms and GFK support for composite primary keys week 6 (Jun 27. - Jul 3.): - full support in the admin week 7 (Jul 4. - Jul 10.): - fixing any documentation discrepancies and making sure everything is tested thoroughly - exploring the related fields in detail and working up a detailed plan for the following changes ----> midterm By the time midterm evaluation arrives, everything except for relationship fields should be in production-ready state. week 8 (Jul 11. - Jul 17.): - implementing composite primary key support in all the RelatedObjectDescriptors week 9 (Jul 18. - Jul 24.): - query joins refactoring - support for ForeignKey relationship traversals week 10 (Jul 25. - Jul 31.): - making sure OneToOne and ManyToMany work as well weeks 11&12 (Aug 1. - Aug 14.): - writing even more tests for the relationships - finishing any missing documentation ----> pencils down As can be seen from the proposed timeline, there is a separation between the part that leads up to admin support for composite primary keys and the relationship part. In my opinion the first part is more likely to be used in practice than the second part so the main emphasis will be put on it in case I discover unexpected difficulties. However, looking at the timeline broken down into small parts I'm confident all proposed features should be possible in the given time. Contact ------- This e-mail address, michal.petru...@ksp.sk, is probably the most reliable way. Jabber: johnn...@swissjabber.org IRC: koniiiik @ #django and #django-dev References ---------- [1] https://groups.google.com/d/msg/django-developers/Y0aAb792cTw/pGt8WFCmFhYJ
signature.asc
Description: Digital signature