Re: Using subdocument _id fields for multi-environment support
TL;DR: It has been surprisingly difficult As per what has already been done for the units and services collections, we will continue with the approach of using uuid:id style string ids while also adding separate env-UUID and collection specific identifier fields. Jesse and I have been making the changes to use subdocument ids for the units and services collections this week at the sprint and we've come up against some unexpected issues. We found that one part of the mgo/txn package wasn't happy using struct ids and have been working with Gustavo to fix that. This isn't a show-stopper but has slowed us down. We also found unexpected friction with the implementation of the watchers and entity life. These areas deeply assume that our document ids are strings and fixing them requires wide-ranging and often ugly changes which will take significant time to get right. It's been brick wall after brick wall. We discussed with Tim, Will, John and Ian yesterday and given that it's important that multi-environment support lands soon and given that the watchers are going to completely change in the not too distant future[1], we have abandoned the approach of using subdocument idfs for multi-environment support. The benefits of using subdocuments ids are outweighed by the chan - Menno [1] opening up the possibility of surrogate keys as document ids, where we need application domain fields to exist fields outside of the _id. On 1 October 2014 22:11, Menno Smits menno.sm...@canonical.com wrote: On 2 October 2014 01:31, Kapil Thangavelu kapil.thangav...@canonical.com wrote: it feels a little strange to use a mutable object for an immutable field. that said it does seem functional. although the immutability speaks to the first disadvantage noted for the separate fields namely becoming out of sync, which afaics isn't something that's possible with the current model, ie. a change of name needs to generate a new doc. Names (previous _id) are unique in usage minus the extant bug that unit ids are reused. even without that the benefits to avoiding the duplicate doc data and manual parse on every _id seem like clear wins for subdoc _ids. Just to be really sure, I added a test that exercises the case of one of the _id fields changing. See TestAttemptedIdUpdate in the (just updated) gist. MongoDB stops us from doing anything stupid (as expected). -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Using subdocument _id fields for multi-environment support
On 1 October 2014 11:25, Menno Smits menno.sm...@canonical.com wrote: MongoDB allows the _id field to be a subdocument so Tim asked me to experiment with this to see if it might be a cleaner way to approach the multi-environment conversion before we update any more collections. The code for these experiments can be found here: https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the output as a comment on the gist). What I've found suggests that using a subdocument for the _id is a better way forward. This approach means that each field value is only stored once so there's no chance of the document key being out of sync with other fields and there's no unnecessary redundancy in the amount of data being stored. The fields in the _id subdocument are easy to access individually and can be queried separately if required. It is also possible to create indexes on specific fields in the _id subdocument if necessary for performance reasons. Using a subdocument for the _id is taught and recommended in the MongoDB courseware. In particular, the index is more useful to the query planner. If the fields are separate, then mongodb will end up querying by unit name and then filtering the results by environment (but that won't matter much in this case). -- Stuart Bishop stuart.bis...@canonical.com -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Using subdocument _id fields for multi-environment support
On 1 October 2014 19:31, Kapil Thangavelu kapil.thangav...@canonical.com wrote: every _id seem like clear wins for subdoc _ids. Although i'm curious what effect this data struct has on mongo resource reqs at scale vs the compound string, as mongo tries keeps _id sets in mem, when it doesn't fit in mem, perf becomes unpredictable (aka bad) as there's two io per doc fetch (id, and doc) and extra io on insert to verify uniqueness. I think it is the index that needs to be kept in RAM, rather than the actual _id, so it will be a win here. Instead of having 3 indexes to keep in RAM to stop performance sucking (_id, unit, environment), we now just have a single fatter one. -- Stuart Bishop stuart.bis...@canonical.com -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Using subdocument _id fields for multi-environment support
On 1 October 2014 13:31, Kapil Thangavelu kapil.thangav...@canonical.com wrote: it feels a little strange to use a mutable object for an immutable field. The field is neither more or less mutable than the original approach. Strings are mutable too. FWIW it would be entirely possible, if deemed desirable, to represent the id as a struct in Go but encode it as a string in mongo, by implementing bson.Setter and bson.Getter interfaces on the type. It sounds like that's not necessary though. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Re: Using subdocument _id fields for multi-environment support
I'm very keen on this. Thanks Menno (and Tim); unless anyone comes up with substantial objections, let's go with this. Cheers William On Wed, Oct 1, 2014 at 6:25 AM, Menno Smits menno.sm...@canonical.com wrote: Team Onyx has been busy preparing for multi-environment state server support. One piece of this is updating almost all of Juju's collections to include the environment UUID in document identifiers so that data for multiple environments can co-exist in the same collection even when they otherwise have same identifier (machine id, service name, unit name etc). Based on discussions on juju-dev a while back[1] we have started this doing this by prepending the environment UUID to the _id field and adding extra fields which provide the environment UUID and old _id value separately for easier querying and handling. So far, services and units have been migrated. Where previously a service document looked like this: type serviceDoc struct { Name string `bson:_id` Seriesstring ... it nows looks like this: type serviceDoc struct { DocID string `bson:_id` // env uuid:wordpress/0 Name string `bson:name` // wordpress/0 EnvUUID string `bson:env-uuid` // env uuid Seriesstring ... Unit documents have undergone a similar transformation. This approach works but has a few downsides: it's possible for the local id (Name in this case) and EnvUUID fields to become out of sync with the corresponding values the make up the _id. If that ever happens very bad things could occur. it somewhat unnecessarily increases the document size, requiring that we effectively store some values twice it requires slightly awkward transformations between UUID prefixed and unprefixed IDs throughout the code MongoDB allows the _id field to be a subdocument so Tim asked me to experiment with this to see if it might be a cleaner way to approach the multi-environment conversion before we update any more collections. The code for these experiments can be found here: https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the output as a comment on the gist). What I've found suggests that using a subdocument for the _id is a better way forward. This approach means that each field value is only stored once so there's no chance of the document key being out of sync with other fields and there's no unnecessary redundancy in the amount of data being stored. The fields in the _id subdocument are easy to access individually and can be queried separately if required. It is also possible to create indexes on specific fields in the _id subdocument if necessary for performance reasons. Using this approach, a service document would end up looking something like this: type serviceDoc struct { IDserviceId `bson:_id` Seriesstring ... } type serviceId struct { EnvUUID string `bson:env-uuid` Namestring } There was some concern in the original email thread about whether subdocument style _id fields would work with sharding. My research and experiments suggest that there is no issue here. There are a few types of indexes that can't be used with sharding, primarily multikey indexes, but I can't see us using these for _id values. A multikey index is used by MongoDB when a field used as part of an index is an array - it's highly unlikely that we're going to use arrays in _id fields. Hashed indexes are a good basis for well-balanced shards according to the MongoDB docs so I wanted to be sure that it's OK to create a hashed index for subdocument style fields. It turns out there's no issue here (see TestHashedIndex in the gist). Using subdocuments for _id fields is not going to prevent us from using MongoDB's sharding features in the future if we need to. Apart from having to rework the changes already made to the services and units collections[2], I don't see any downsides to this approach. Can anyone think of something I might be overlooking? - Menno [1] - subject was RFC: mongo _id fields in the multi-environment juju server world [2] - this work will have to be done before 1.21 has a stable release because the units and services changes have already landed. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev
Using subdocument _id fields for multi-environment support
Team Onyx has been busy preparing for multi-environment state server support. One piece of this is updating almost all of Juju's collections to include the environment UUID in document identifiers so that data for multiple environments can co-exist in the same collection even when they otherwise have same identifier (machine id, service name, unit name etc). Based on discussions on juju-dev a while back[1] we have started this doing this by prepending the environment UUID to the _id field and adding extra fields which provide the environment UUID and old _id value separately for easier querying and handling. So far, services and units have been migrated. Where previously a service document looked like this: type serviceDoc struct { Name string `bson:_id` Seriesstring ... it nows looks like this: type serviceDoc struct { DocID string `bson:_id` // env uuid:wordpress/0 Name string `bson:name` // wordpress/0 EnvUUID string `bson:env-uuid` // env uuid Seriesstring ... Unit documents have undergone a similar transformation. This approach works but has a few downsides: - it's possible for the local id (Name in this case) and EnvUUID fields to become out of sync with the corresponding values the make up the _id. If that ever happens very bad things could occur. - it somewhat unnecessarily increases the document size, requiring that we effectively store some values twice - it requires slightly awkward transformations between UUID prefixed and unprefixed IDs throughout the code MongoDB allows the _id field to be a subdocument so Tim asked me to experiment with this to see if it might be a cleaner way to approach the multi-environment conversion before we update any more collections. The code for these experiments can be found here: https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the output as a comment on the gist). What I've found suggests that using a subdocument for the _id is a better way forward. This approach means that each field value is only stored once so there's no chance of the document key being out of sync with other fields and there's no unnecessary redundancy in the amount of data being stored. The fields in the _id subdocument are easy to access individually and can be queried separately if required. It is also possible to create indexes on specific fields in the _id subdocument if necessary for performance reasons. Using this approach, a service document would end up looking something like this: type serviceDoc struct { IDserviceId `bson:_id` Seriesstring ... } type serviceId struct { EnvUUID string `bson:env-uuid` Namestring } There was some concern in the original email thread about whether subdocument style _id fields would work with sharding. My research and experiments suggest that there is no issue here. There are a few types of indexes that can't be used with sharding, primarily multikey indexes, but I can't see us using these for _id values. A multikey index is used by MongoDB when a field used as part of an index is an array - it's highly unlikely that we're going to use arrays in _id fields. Hashed indexes are a good basis for well-balanced shards according to the MongoDB docs so I wanted to be sure that it's OK to create a hashed index for subdocument style fields. It turns out there's no issue here (see TestHashedIndex in the gist). Using subdocuments for _id fields is not going to prevent us from using MongoDB's sharding features in the future if we need to. Apart from having to rework the changes already made to the services and units collections[2], I don't see any downsides to this approach. Can anyone think of something I might be overlooking? - Menno [1] - subject was RFC: mongo _id fields in the multi-environment juju server world [2] - this work will have to be done before 1.21 has a stable release because the units and services changes have already landed. -- Juju-dev mailing list Juju-dev@lists.ubuntu.com Modify settings or unsubscribe at: https://lists.ubuntu.com/mailman/listinfo/juju-dev