Re: Using subdocument _id fields for multi-environment support

2014-10-10 Thread Menno Smits
TL;DR: It has been surprisingly difficult As per what has already been done
for the units and services collections, we will continue with the approach
of using uuid:id style string ids while also adding separate env-UUID
and collection specific identifier fields.

Jesse and I have been making the changes to use subdocument ids for the
units and services collections this week at the sprint and we've come up
against some unexpected issues.

We found that one part of the mgo/txn package wasn't happy using struct ids
and have been working with Gustavo to fix that. This isn't a show-stopper
but has slowed us down.

We also found unexpected friction with the implementation of the watchers
and entity life. These areas deeply assume that our document ids are
strings and fixing them requires wide-ranging and often ugly changes which
will take significant time to get right. It's been brick wall after brick
wall. We discussed with Tim, Will, John and Ian yesterday and given that
it's important that multi-environment support lands soon and given that the
watchers are going to completely change in the not too distant future[1],
we have abandoned the approach of using subdocument idfs for
multi-environment support. The benefits of using subdocuments ids are
outweighed by the chan



- Menno

[1] opening up the possibility of surrogate keys as document ids, where we
need application domain fields to exist fields outside of the _id.


On 1 October 2014 22:11, Menno Smits menno.sm...@canonical.com wrote:



 On 2 October 2014 01:31, Kapil Thangavelu kapil.thangav...@canonical.com
 wrote:

 it feels a little strange to use a mutable object for an immutable field.
 that said it does seem functional. although the immutability speaks to the
 first disadvantage noted for the separate fields namely becoming out of
 sync, which afaics isn't something that's possible with the current model,
 ie. a change of name needs to generate a new doc. Names (previous _id) are
 unique in usage minus the extant bug that unit ids are reused. even without
 that the benefits to avoiding the duplicate doc data and manual parse on
 every _id seem like clear wins for subdoc _ids.


 Just to be really sure, I added a test that exercises the case of one of
 the _id fields changing. See TestAttemptedIdUpdate in the (just updated)
 gist. MongoDB stops us from doing anything stupid (as expected).


-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Using subdocument _id fields for multi-environment support

2014-10-01 Thread Stuart Bishop
On 1 October 2014 11:25, Menno Smits menno.sm...@canonical.com wrote:

 MongoDB allows the _id field to be a subdocument so Tim asked me to
 experiment with this to see if it might be a cleaner way to approach the
 multi-environment conversion before we update any more collections. The code
 for these experiments can be found here:
 https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the output
 as a comment on the gist).

 What I've found suggests that using a subdocument for the _id is a better
 way forward. This approach means that each field value is only stored once
 so there's no chance of the document key being out of sync with other fields
 and there's no unnecessary redundancy in the amount of data being stored.
 The fields in the _id subdocument are easy to access individually and can be
 queried separately if required. It is also possible to create indexes on
 specific fields in the _id subdocument if necessary for performance reasons.

Using a subdocument for the _id is taught and recommended in the
MongoDB courseware. In particular, the index is more useful to the
query planner. If the fields are separate, then mongodb will end up
querying by unit name and then filtering the results by environment
(but that won't matter much in this case).

-- 
Stuart Bishop stuart.bis...@canonical.com

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Using subdocument _id fields for multi-environment support

2014-10-01 Thread Stuart Bishop
On 1 October 2014 19:31, Kapil Thangavelu
kapil.thangav...@canonical.com wrote:

 every _id seem like clear wins for subdoc _ids. Although i'm curious what
 effect this data struct has on mongo resource reqs at scale vs the compound
 string, as mongo tries keeps _id sets in mem, when it doesn't fit in mem,
 perf becomes unpredictable (aka bad) as there's two io per doc fetch (id,
 and doc) and extra io on insert to verify uniqueness.

I think it is the index that needs to be kept in RAM, rather than the
actual _id, so it will be a win here. Instead of having 3 indexes to
keep in RAM to stop performance sucking (_id, unit, environment), we
now just have a single fatter one.

-- 
Stuart Bishop stuart.bis...@canonical.com

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Using subdocument _id fields for multi-environment support

2014-10-01 Thread roger peppe
On 1 October 2014 13:31, Kapil Thangavelu
kapil.thangav...@canonical.com wrote:
 it feels a little strange to use a mutable object for an immutable field.

The field is neither more or less mutable than the original approach.
Strings are mutable too.

FWIW it would be entirely possible, if deemed desirable, to represent
the id as a struct in Go but encode it as a string in mongo,
by implementing bson.Setter and bson.Getter interfaces on the type.
It sounds like that's not necessary though.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: Using subdocument _id fields for multi-environment support

2014-10-01 Thread William Reade
I'm very keen on this. Thanks Menno (and Tim); unless anyone comes up
with substantial objections, let's go with this.

Cheers
William

On Wed, Oct 1, 2014 at 6:25 AM, Menno Smits menno.sm...@canonical.com wrote:
 Team Onyx has been busy preparing for multi-environment state server
 support. One piece of this is updating almost all of Juju's collections to
 include the environment UUID in document identifiers so that data for
 multiple environments can co-exist in the same collection even when they
 otherwise have same identifier (machine id, service name, unit name etc).

 Based on discussions on juju-dev a while back[1] we have started this doing
 this by prepending the environment UUID to the _id field and adding extra
 fields which provide the environment UUID and old _id value separately for
 easier querying and handling.

 So far, services and units have been migrated. Where previously a service
 document looked like this:

 type serviceDoc struct {
  Name  string `bson:_id`
  Seriesstring
  ...

 it nows looks like this:

 type serviceDoc struct {
  DocID string `bson:_id`   // env uuid:wordpress/0
  Name  string `bson:name`  // wordpress/0
  EnvUUID   string `bson:env-uuid`  // env uuid
  Seriesstring
  ...

 Unit documents have undergone a similar transformation.

 This approach works but has a few downsides:

 it's possible for the local id (Name in this case) and EnvUUID fields to
 become out of sync with the corresponding values the make up the _id. If
 that ever happens very bad things could occur.
 it somewhat unnecessarily increases the document size, requiring that we
 effectively store some values twice
 it requires slightly awkward transformations between UUID prefixed and
 unprefixed IDs throughout the code

 MongoDB allows the _id field to be a subdocument so Tim asked me to
 experiment with this to see if it might be a cleaner way to approach the
 multi-environment conversion before we update any more collections. The code
 for these experiments can be found here:
 https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the output
 as a comment on the gist).

 What I've found suggests that using a subdocument for the _id is a better
 way forward. This approach means that each field value is only stored once
 so there's no chance of the document key being out of sync with other fields
 and there's no unnecessary redundancy in the amount of data being stored.
 The fields in the _id subdocument are easy to access individually and can be
 queried separately if required. It is also possible to create indexes on
 specific fields in the _id subdocument if necessary for performance reasons.

 Using this approach, a service document would end up looking something like
 this:

 type serviceDoc struct {
  IDserviceId `bson:_id`
  Seriesstring
  ...
 }

 type serviceId struct {
   EnvUUID string `bson:env-uuid`
   Namestring
 }

 There was some concern in the original email thread about whether
 subdocument style _id fields would work with sharding. My research and
 experiments suggest that there is no issue here. There are a few types of
 indexes that can't be used with sharding, primarily multikey indexes, but
 I can't see us using these for _id values. A multikey index is used by
 MongoDB when a field used as part of an index is an array - it's highly
 unlikely that we're going to use arrays in _id fields.

 Hashed indexes are a good basis for well-balanced shards according to the
 MongoDB docs so I wanted to be sure that it's OK to create a hashed index
 for subdocument style fields. It turns out there's no issue here (see
 TestHashedIndex in the gist).

 Using subdocuments for _id fields is not going to prevent us from using
 MongoDB's sharding features in the future if we need to.

 Apart from having to rework the changes already made to the services and
 units collections[2], I don't see any downsides to this approach. Can anyone
 think of something I might be overlooking?

 - Menno


 [1] - subject was RFC: mongo _id fields in the multi-environment juju
 server world

 [2] - this work will have to be done before 1.21 has a stable release
 because the units and services changes have already landed.



 --
 Juju-dev mailing list
 Juju-dev@lists.ubuntu.com
 Modify settings or unsubscribe at:
 https://lists.ubuntu.com/mailman/listinfo/juju-dev


-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Using subdocument _id fields for multi-environment support

2014-09-30 Thread Menno Smits
Team Onyx has been busy preparing for multi-environment state server
support. One piece of this is updating almost all of Juju's collections to
include the environment UUID in document identifiers so that data for
multiple environments can co-exist in the same collection even when they
otherwise have same identifier (machine id, service name, unit name etc).

Based on discussions on juju-dev a while back[1] we have started this doing
this by prepending the environment UUID to the _id field and adding extra
fields which provide the environment UUID and old _id value separately for
easier querying and handling.

So far, services and units have been migrated. Where previously a service
document looked like this:

type serviceDoc struct {
 Name  string `bson:_id`
 Seriesstring
 ...

it nows looks like this:

type serviceDoc struct {
 DocID string `bson:_id`   // env uuid:wordpress/0
 Name  string `bson:name`  // wordpress/0
 EnvUUID   string `bson:env-uuid`  // env uuid
 Seriesstring
 ...

Unit documents have undergone a similar transformation.

This approach works but has a few downsides:

   - it's possible for the local id (Name in this case) and EnvUUID
   fields to become out of sync with the corresponding values the make up the
   _id. If that ever happens very bad things could occur.
   - it somewhat unnecessarily increases the document size, requiring that
   we effectively store some values twice
   - it requires slightly awkward transformations between UUID prefixed and
   unprefixed IDs throughout the code

MongoDB allows the _id field to be a subdocument so Tim asked me to
experiment with this to see if it might be a cleaner way to approach the
multi-environment conversion before we update any more collections. The
code for these experiments can be found here:
https://gist.github.com/mjs/2959bb3e90a8d4e7db50 (I've included the output
as a comment on the gist).

What I've found suggests that using a subdocument for the _id is a better
way forward. This approach means that each field value is only stored once
so there's no chance of the document key being out of sync with other
fields and there's no unnecessary redundancy in the amount of data being
stored. The fields in the _id subdocument are easy to access individually
and can be queried separately if required. It is also possible to create
indexes on specific fields in the _id subdocument if necessary for
performance reasons.

Using this approach, a service document would end up looking something like
this:

type serviceDoc struct {
 IDserviceId `bson:_id`
 Seriesstring
 ...
}

type serviceId struct {
  EnvUUID string `bson:env-uuid`
  Namestring
}

There was some concern in the original email thread about whether
subdocument style _id fields would work with sharding. My research and
experiments suggest that there is no issue here. There are a few types of
indexes that can't be used with sharding, primarily multikey indexes, but
I can't see us using these for _id values. A multikey index is used by
MongoDB when a field used as part of an index is an array - it's highly
unlikely that we're going to use arrays in _id fields.

Hashed indexes are a good basis for well-balanced shards according to the
MongoDB docs so I wanted to be sure that it's OK to create a hashed index
for subdocument style fields. It turns out there's no issue here (see
TestHashedIndex in the gist).

Using subdocuments for _id fields is not going to prevent us from using
MongoDB's sharding features in the future if we need to.

Apart from having to rework the changes already made to the services and
units collections[2], I don't see any downsides to this approach. Can
anyone think of something I might be overlooking?

- Menno


[1] - subject was RFC: mongo _id fields in the multi-environment juju
server world

[2] - this work will have to be done before 1.21 has a stable release
because the units and services changes have already landed.
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev