Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-07 Thread roger peppe
On 4 July 2014 15:17, Gustavo Niemeyer gust...@niemeyer.net wrote:
 On Fri, Jul 4, 2014 at 10:32 AM, roger peppe roger.pe...@canonical.com 
 wrote:
 It won't be possible to shard the transaction log.

 Why not?

I had assumed that because every client needs to see every transaction
there would likely be no benefit to sharding the log, although
technically you could shard on transaction id. I'd be
delighted to be shown that my assumption is wrong though.
Perhaps the round-robin might really help.

 The thing I'm trying to get across is: until we know one way or
 another, I believe it would be better to choose the (much) simpler
 option and use the (potential weeks of) dev time for other things.

 We know it's a bad idea. Besides everything else I mentioned, there
 are _huge_ MongoDB databases out there being that depend on sharding
 to scale.. we're talking hundreds of machines. It seems very naive to
 go with a model that loses the benefits of all the lessons the MongoDB
 development team learned with those use cases, and the work they have
 done to support them well.

 We have been there in Canonical. Ask folks about the CouchDB story.

Thanks for pointing this out. If we manage to hugely scale juju using mongodb
I will be very happy. I still think we should do some measurements to
convince us that we actually have some hope of doing so though.
My own measurements left me less than convinced of the
possibility, although it's been a while since I did them.

  cheers,
rog.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-07 Thread Gustavo Niemeyer
On Mon, Jul 7, 2014 at 10:09 AM, roger peppe roger.pe...@canonical.com wrote:
 I had assumed that because every client needs to see every transaction
 there would likely be no benefit to sharding the log, although
 technically you could shard on transaction id. I'd be

Clients don't need to see every transaction. Only those that affect
the documents they are acting on.

 Thanks for pointing this out. If we manage to hugely scale juju using mongodb
 I will be very happy. I still think we should do some measurements to
 convince us that we actually have some hope of doing so though.
 My own measurements left me less than convinced of the
 possibility, although it's been a while since I did them.

When you measured a sharded setup, what was the outcome?


gustavo @ http://niemeyer.net

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-07 Thread roger peppe
On 7 July 2014 14:27, Gustavo Niemeyer gust...@niemeyer.net wrote:
 On Mon, Jul 7, 2014 at 10:09 AM, roger peppe roger.pe...@canonical.com 
 wrote:
 I had assumed that because every client needs to see every transaction
 there would likely be no benefit to sharding the log, although
 technically you could shard on transaction id. I'd be

 Clients don't need to see every transaction. Only those that affect
 the documents they are acting on.

Is it actually possible to shard the transaction log based on the documents
the transactions are acting on? I couldn't see a straightforward
way to do it, with the existing transaction log structure at least.

 Thanks for pointing this out. If we manage to hugely scale juju using mongodb
 I will be very happy. I still think we should do some measurements to
 convince us that we actually have some hope of doing so though.
 My own measurements left me less than convinced of the
 possibility, although it's been a while since I did them.

 When you measured a sharded setup, what was the outcome?

I simply measured operation rate (of some actual juju operations)
on a non-sharded setup. I saw around 60 operations per second.
It may well have been that I was testing an inefficient setup, or
that my mongo settings were inadequate.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-07 Thread roger peppe
On 7 July 2014 16:59, Gustavo Niemeyer gust...@niemeyer.net wrote:
 On Mon, Jul 7, 2014 at 12:26 PM, roger peppe roger.pe...@canonical.com 
 wrote:
 On 7 July 2014 14:27, Gustavo Niemeyer gust...@niemeyer.net wrote:
 On Mon, Jul 7, 2014 at 10:09 AM, roger peppe roger.pe...@canonical.com 
 wrote:
 I had assumed that because every client needs to see every transaction
 there would likely be no benefit to sharding the log, although
 technically you could shard on transaction id. I'd be

 Clients don't need to see every transaction. Only those that affect
 the documents they are acting on.

 Is it actually possible to shard the transaction log based on the documents
 the transactions are acting on?

 That's unrelated to what you said above, or to my response.

 Either way, we can shard transaction documents, and we can add a shard
 key to them if necessary.

The latter might turn out to be quite awkward, though there's
probably a nice solution I don't see.

Suppose we've got three environments, A, B and C.

We have transactions that span {A, B}, {B, C} and {C, A}.

How can we choose a consistent shard key for all those
transactions?

 Thanks for pointing this out. If we manage to hugely scale juju using 
 mongodb
 I will be very happy. I still think we should do some measurements to
 convince us that we actually have some hope of doing so though.
 My own measurements left me less than convinced of the
 possibility, although it's been a while since I did them.

 When you measured a sharded setup, what was the outcome?

 I simply measured operation rate (of some actual juju operations)
 on a non-sharded setup.

 Okay, so the measurements that left you unconvinced that sharding
 might help to scale up were not using sharding.

If we struggle to meet the requirements for a single environment,
we're unlikely to meet them when we're running several environments
per shard, which is surely necessary if we're to scale up.

 I saw around 60 operations per second.
 It may well have been that I was testing an inefficient setup, or
 that my mongo settings were inadequate.

 I cannot really comment on that. What I can say is:

 1. The txn package can run transactions on the order of a few hundred
 per second on my measurements on MongoDB 2.2

 2. Sharding allows sending load to independent replica sets

 3. MongoDB performance is improving release over release, and there's
 more coming (http://goo.gl/qPE9LB)

 4. Nothing will work without effort.

I hope it can work for us.

I really do.

I just worry that without actually doing some measurement in advance,
we may spend a lot of time working on this stuff and find that it was all for
nought because we're fundamentally bottlenecked somewhere
we didn't anticipate.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-07 Thread Gustavo Niemeyer
On Mon, Jul 7, 2014 at 2:03 PM, roger peppe roger.pe...@canonical.com wrote:
 The latter might turn out to be quite awkward, though there's
 probably a nice solution I don't see.

 Suppose we've got three environments, A, B and C.

 We have transactions that span {A, B}, {B, C} and {C, A}.

 How can we choose a consistent shard key for all those
 transactions?

What is a consistent shard key and why does it matter?

 Okay, so the measurements that left you unconvinced that sharding
 might help to scale up were not using sharding.

 If we struggle to meet the requirements for a single environment,
 we're unlikely to meet them when we're running several environments
 per shard, which is surely necessary if we're to scale up.

That's unsound reasoning for the context. It implies that to be able
to meet a load demand with many serving machines we must be able to
meet the load demand with a single serving machine. Not true.

 I hope it can work for us.

 I really do.

I do as well.

 I just worry that without actually doing some measurement in advance,
 we may spend a lot of time working on this stuff and find that it was all for
 nought because we're fundamentally bottlenecked somewhere
 we didn't anticipate.

By all means, please do measure and collect as much data as necessary
to have a good design. We won't see any performance improvements
without a reasonable understanding of how the system works and
performs.


gustavo @ http://niemeyer.net

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-07 Thread Ian Booth


On 04/07/14 23:56, Gustavo Niemeyer wrote:
 On Thu, Jul 3, 2014 at 10:01 PM, Tim Penhey tim.pen...@canonical.com wrote:
 As far as I know (and I may be wrong), if you are adding a document to
 the mongo collection, and you do not specify an _id field, mongo will
 create a unique value for you.
 
 That's right in most cases, and a requirement for replication.
 
 1. change the _id field to be a composed field where it is the
 concatenation of the environment id and the existing id or name field.
 If we do take this approach, I strongly recommend having the fields that
 make up the key be available by themselves elsewhere in the document
 structure.
 
 I'd go with this, including your suggestion of splitting the data
 apart in proper fields. Sounds straightforward and comfortable to deal
 with.


I'm late to this discussion, but I am generally -1 on assigning any business
meaning to document ids - too many bad experiences in the past where this has
severely impacted the ability to change implementation detail for performance or
other reasons. Such ids should be opaque and considered a database
implementation detail, often referred to as surrogate keys. Natural keys should
be implemented as separate fields, backed by unique indices if required. If a
composite key is required as an additional field, that's fine, as is a field
added to facilitate sharding.

Unfortunately, I think the option of using a surrogate key may be infeasible
given how the mongo transaction assertions are implemented. The DocExsists
assertion is done based on document id, which forces the use of natural keys for
id. I wonder if there's a way around this?

So given the above, I don't think there's a choice and option 1 seems to be the
best option as others have said. But I wish we had the choice to use surrogate 
ids.



-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread roger peppe
On 4 July 2014 02:01, Tim Penhey tim.pen...@canonical.com wrote:
 Hi folks,

 Very shortly we are going to start on the work to be able to store
 multiple environments within a single mongo database.

 Most of our current entities are stored in the database with their name
 or id fields serialized to bson as the _id field.

 As far as I know (and I may be wrong), if you are adding a document to
 the mongo collection, and you do not specify an _id field, mongo will
 create a unique value for you.

 In our new world, things that used to be unique, like machines,
 services, units etc, are now only unique when paired with the
 environment id.

 It seems we have a number of options here.

 1. change the _id field to be a composed field where it is the
 concatenation of the environment id and the existing id or name field.
 If we do take this approach, I strongly recommend having the fields that
 make up the key be available by themselves elsewhere in the document
 structure.

 2. let mongo create the _id field, and we ensure uniqueness over the
 pair of values with a unique index. One think I am unsure about with
 this approach is how we currently do our insertion checks, where we do a
 document does not exist check.  We wouldn't be able to do this as a
 transaction assertion as it can only check for _id values.  How fast are
 the indices updated?  Can having a unique index for a document work for
 us?  I'm hoping it can if this is the way to go.

 3. use a composite _id field such that the document may start like this:
   { _id: { env_uuid: blah, name: foo}, ...
 This gives the benefit of existence checks, and real names for the _id
 parts.

 Thoughts? Opinions? Recommendations?

There is another possiblity: we could just use a different collection
name prefix
for each environment. There is no hard limit on the number of collections
in mongo (see http://docs.mongodb.org/manual/reference/limits/).

That is, instead of using the current hard-coded collection names
(machines, relations, etc) we'd prefix them with the environment id;
either the UUID or an id stored elsewhere.

This would entail very few changes to the existing code.

If we think that most operations on an environment will continue to
be specific to that environment, I think this has a few advantages.
Specifically, it minimises cross-talk between environments - one
large environment with heavy traffic will not unduly influence the others.

- for a small environment, table indexes remain small and lookups fast
even though the total number of entries might be huge.

- each environment could have a separate mongo txn log, so one busy
environment that's constantly adding transactions will not necessarily
slow down all the others. There is, in general, no need for sequential
consistency between
environments.

- database isolation between environments is an advantage when things
go wrong - it's easier to fix or delete individual environments if their
tables are isolated from one another.

The disadvantage is that you can't perform transactions that span multiple
environments. I think that's something we probably would not want to
do much anyway, but YMMV.

I suggest that, at the least, taking this approach would be a quick
road to making the state work with multiple environments. It
would not preclude a move to changing to use composite keys
in the future.

  cheers,
rog.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread William Reade
My expectation is that:

1) We certainly need the environment UUID as a separate field for the shard
key.
2) We *also* need the environment UUID as an _id prefix to keep our
watchers sane.
2a) If we had separate collections per environment, we wouldn't; but AIUI,
scaling mongo by adding collections tends to end badly (I don't have direct
experience here myself; but it does indeed seem that we'd start consuming
namespaces at a pretty terrifying rate, and I'm inclined to trust those who
have done this and failed.)
2b) I'd ordinarily dislike the duplication across the _id and uuid fields,
but there's a clear reason for doing so here, so I'm not going to complain.
I *will* continue to complain about documents that duplicate info across
fields in order to save a few runtime microseconds here and there ;).

If someone with direct experience can chip in reassuringly I *might* be
prepared to back off on the N-collections-per-environment thing, but I'm
certainly not willing to take it so far as to separate the txn logs and
thus discard consistency across environments: I think there will certainly
be references between individual hosted environments and the initial
environment.

So, in short, I think Tim's (1) is the way to go. But *please* don't
duplicate data that doesn't have to be -- the UUID is fine, the name is
not. If we really end up spending a lot of time extracting names from _id
fields we can cache them in the state documents -- but we don't need
redundant copies in the DB, and we *really* don't need to make our lives
harder by giving our data unnecessary opportunities for inconsistency.

Cheers
William



On Fri, Jul 4, 2014 at 6:42 AM, John Meinel j...@arbash-meinel.com wrote:

 According to the mongo docs:
 http://docs.mongodb.org/manual/core/document/#record-documents
 The field name _id is reserved for use as a primary key; its value must
 be unique in the collection, is immutable, and may be of any type other
 than an array.

 That makes it sound like we *could* use an object for the _id field and do
 _id = {env_uuid:, name:}

 Though I thought the purpose of doing something like that is to allow
 efficient sharding in a multi-environment world.

 Looking here: http://docs.mongodb.org/manual/core/sharding-shard-key/
 The shard key must be indexed (which is just fine for us w/ the primary
 _id field or with any other field on the documents), and The index on the
 shard key *cannot* be a *multikey index
 http://docs.mongodb.org/manual/core/index-multikey/#index-type-multikey.*
 I don't really know what that means in the case of wanting to shard based
 on an object instead of a simple string, but it does sound like it might be
 a problem.
 Anyway, for purposes of being *unique* we may need to put environ uuid in
 there, but for the purposes of sharding we could just put it on another
 field and index that field.

 John
 =:-



 On Fri, Jul 4, 2014 at 5:01 AM, Tim Penhey tim.pen...@canonical.com
 wrote:

 Hi folks,

 Very shortly we are going to start on the work to be able to store
 multiple environments within a single mongo database.

 Most of our current entities are stored in the database with their name
 or id fields serialized to bson as the _id field.

 As far as I know (and I may be wrong), if you are adding a document to
 the mongo collection, and you do not specify an _id field, mongo will
 create a unique value for you.

 In our new world, things that used to be unique, like machines,
 services, units etc, are now only unique when paired with the
 environment id.

 It seems we have a number of options here.

 1. change the _id field to be a composed field where it is the
 concatenation of the environment id and the existing id or name field.
 If we do take this approach, I strongly recommend having the fields that
 make up the key be available by themselves elsewhere in the document
 structure.

 2. let mongo create the _id field, and we ensure uniqueness over the
 pair of values with a unique index. One think I am unsure about with
 this approach is how we currently do our insertion checks, where we do a
 document does not exist check.  We wouldn't be able to do this as a
 transaction assertion as it can only check for _id values.  How fast are
 the indices updated?  Can having a unique index for a document work for
 us?  I'm hoping it can if this is the way to go.

 3. use a composite _id field such that the document may start like this:
   { _id: { env_uuid: blah, name: foo}, ...
 This gives the benefit of existence checks, and real names for the _id
 parts.

 Thoughts? Opinions? Recommendations?

 BTW, I think that if we can make 3 work, then it is the best approach.

 Tim

 --
 Juju-dev mailing list
 Juju-dev@lists.ubuntu.com
 Modify settings or unsubscribe at:
 https://lists.ubuntu.com/mailman/listinfo/juju-dev



 --
 Juju-dev mailing list
 Juju-dev@lists.ubuntu.com
 Modify settings or unsubscribe at:
 https://lists.ubuntu.com/mailman/listinfo/juju-dev


-- 
Juju-dev mailing list

Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread John Meinel
I would think that if we have to put environ-uuid into the _id field, then
we wouldn't need yet-another field to shard on (at least if we put it at
the beginning of the field).

John
=:-


On Fri, Jul 4, 2014 at 2:24 PM, William Reade william.re...@canonical.com
wrote:

 My expectation is that:

 1) We certainly need the environment UUID as a separate field for the
 shard key.
 2) We *also* need the environment UUID as an _id prefix to keep our
 watchers sane.
 2a) If we had separate collections per environment, we wouldn't; but AIUI,
 scaling mongo by adding collections tends to end badly (I don't have direct
 experience here myself; but it does indeed seem that we'd start consuming
 namespaces at a pretty terrifying rate, and I'm inclined to trust those who
 have done this and failed.)
 2b) I'd ordinarily dislike the duplication across the _id and uuid fields,
 but there's a clear reason for doing so here, so I'm not going to complain.
 I *will* continue to complain about documents that duplicate info across
 fields in order to save a few runtime microseconds here and there ;).

 If someone with direct experience can chip in reassuringly I *might* be
 prepared to back off on the N-collections-per-environment thing, but I'm
 certainly not willing to take it so far as to separate the txn logs and
 thus discard consistency across environments: I think there will certainly
 be references between individual hosted environments and the initial
 environment.

 So, in short, I think Tim's (1) is the way to go. But *please* don't
 duplicate data that doesn't have to be -- the UUID is fine, the name is
 not. If we really end up spending a lot of time extracting names from _id
 fields we can cache them in the state documents -- but we don't need
 redundant copies in the DB, and we *really* don't need to make our lives
 harder by giving our data unnecessary opportunities for inconsistency.

 Cheers
 William



 On Fri, Jul 4, 2014 at 6:42 AM, John Meinel j...@arbash-meinel.com
 wrote:

 According to the mongo docs:
 http://docs.mongodb.org/manual/core/document/#record-documents
 The field name _id is reserved for use as a primary key; its value must
 be unique in the collection, is immutable, and may be of any type other
 than an array.

 That makes it sound like we *could* use an object for the _id field and
 do _id = {env_uuid:, name:}

 Though I thought the purpose of doing something like that is to allow
 efficient sharding in a multi-environment world.

 Looking here: http://docs.mongodb.org/manual/core/sharding-shard-key/
 The shard key must be indexed (which is just fine for us w/ the primary
 _id field or with any other field on the documents), and The index on the
 shard key *cannot* be a *multikey index
 http://docs.mongodb.org/manual/core/index-multikey/#index-type-multikey.*
 I don't really know what that means in the case of wanting to shard based
 on an object instead of a simple string, but it does sound like it might be
 a problem.
 Anyway, for purposes of being *unique* we may need to put environ uuid in
 there, but for the purposes of sharding we could just put it on another
 field and index that field.

 John
 =:-



 On Fri, Jul 4, 2014 at 5:01 AM, Tim Penhey tim.pen...@canonical.com
 wrote:

 Hi folks,

 Very shortly we are going to start on the work to be able to store
 multiple environments within a single mongo database.

 Most of our current entities are stored in the database with their name
 or id fields serialized to bson as the _id field.

 As far as I know (and I may be wrong), if you are adding a document to
 the mongo collection, and you do not specify an _id field, mongo will
 create a unique value for you.

 In our new world, things that used to be unique, like machines,
 services, units etc, are now only unique when paired with the
 environment id.

 It seems we have a number of options here.

 1. change the _id field to be a composed field where it is the
 concatenation of the environment id and the existing id or name field.
 If we do take this approach, I strongly recommend having the fields that
 make up the key be available by themselves elsewhere in the document
 structure.

 2. let mongo create the _id field, and we ensure uniqueness over the
 pair of values with a unique index. One think I am unsure about with
 this approach is how we currently do our insertion checks, where we do a
 document does not exist check.  We wouldn't be able to do this as a
 transaction assertion as it can only check for _id values.  How fast are
 the indices updated?  Can having a unique index for a document work for
 us?  I'm hoping it can if this is the way to go.

 3. use a composite _id field such that the document may start like this:
   { _id: { env_uuid: blah, name: foo}, ...
 This gives the benefit of existence checks, and real names for the _id
 parts.

 Thoughts? Opinions? Recommendations?

 BTW, I think that if we can make 3 work, then it is the best approach.

 Tim

 --
 Juju-dev 

Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread roger peppe
On 4 July 2014 11:24, William Reade william.re...@canonical.com wrote:
 My expectation is that:

 1) We certainly need the environment UUID as a separate field for the shard
 key.
 2) We *also* need the environment UUID as an _id prefix to keep our watchers
 sane.
 2a) If we had separate collections per environment, we wouldn't; but AIUI,
 scaling mongo by adding collections tends to end badly (I don't have direct
 experience here myself; but it does indeed seem that we'd start consuming
 namespaces at a pretty terrifying rate, and I'm inclined to trust those who
 have done this and failed.)
 2b) I'd ordinarily dislike the duplication across the _id and uuid fields,
 but there's a clear reason for doing so here, so I'm not going to complain.
 I *will* continue to complain about documents that duplicate info across
 fields in order to save a few runtime microseconds here and there ;).

 If someone with direct experience can chip in reassuringly I *might* be
 prepared to back off on the N-collections-per-environment thing, but I'm
 certainly not willing to take it so far as to separate the txn logs and thus
 discard consistency across environments: I think there will certainly be
 references between individual hosted environments and the initial
 environment.

It can be a great advantage when scaling to be able to partition the
transactions across different parts of the database. If we want this to
be able to scale, I think we *have* to make it work without requiring
transactions across environments. There is no way that we can scale
as far as we'd like to by using a single mongo replica set for all environments.

This talk is about mysql, not mongo, but I believe some of the lessons
are relevant to us. https://www.youtube.com/watch?v=qATTTSg6zXk

By my calculations, with a maximum-sized namespace file, a single
mongo should be able to support over 9 environments
using a separate collection-set for each environment.

From my recollection of juju performance, we will be lucky to scale
a single mongo up to 1000 environments, let alone 9, so I suspect we'd never
get remotely that far. Perhaps there are other disadvantages
from having many collections though.

It would be nice if we could make this crucial architectural decision in
the light of some actual measurements. We may all have some kind
of gut feeling for how this might perform, but without actually measuring,
we just don't know.

As usual, my first reaction is KISS.

  cheers,
rog.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread John Meinel

 ...

 It can be a great advantage when scaling to be able to partition the
 transactions across different parts of the database. If we want this to
 be able to scale, I think we *have* to make it work without requiring
 transactions across environments. There is no way that we can scale
 as far as we'd like to by using a single mongo replica set for all
 environments.


You generally shard across replica sets, and if you shard by environ uuid
(say by putting it as a prefix on all the _ids) then each of those is a
different write master.

It seems conceptually easier than trying to route to a different collection
set. Certainly sharding will be easier to rebalance (I think) than moving
the collections around.

John
=:-


 This talk is about mysql, not mongo, but I believe some of the lessons
 are relevant to us. https://www.youtube.com/watch?v=qATTTSg6zXk

 By my calculations, with a maximum-sized namespace file, a single
 mongo should be able to support over 9 environments
 using a separate collection-set for each environment.

 From my recollection of juju performance, we will be lucky to scale
 a single mongo up to 1000 environments, let alone 9, so I suspect we'd
 never
 get remotely that far. Perhaps there are other disadvantages
 from having many collections though.

 It would be nice if we could make this crucial architectural decision in
 the light of some actual measurements. We may all have some kind
 of gut feeling for how this might perform, but without actually measuring,
 we just don't know.

 As usual, my first reaction is KISS.

   cheers,
 rog.

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread John Weldon
On Fri, Jul 4, 2014 at 6:56 AM, Gustavo Niemeyer gust...@niemeyer.net
wrote:

  1. change the _id field to be a composed field where it is the
  concatenation of the environment id and the existing id or name field.
  If we do take this approach, I strongly recommend having the fields that
  make up the key be available by themselves elsewhere in the document
  structure.

 I'd go with this, including your suggestion of splitting the data
 apart in proper fields. Sounds straightforward and comfortable to deal
 with.


I'd be interested in trying this approach with Actions.  We've gone back
and forth
between encoding units *only* in the _id or *also* in the document.
Both have pro's and con's, but it seems to me that a composite _id
would address most of the con's on each approach.

I'm also interested in figuring out how the watchers will work in this
approach.
The Actions watcher is a StringsWatcher, and the .Changes() are []string

I'm assuming that will have to become a more specialised watcher where
.Changes() returns a list of objects representing the composite key? Also
how the watcher detects relevant events might have to be adjusted somewhat.



--
John Weldon
-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread Gustavo Niemeyer
On Fri, Jul 4, 2014 at 6:01 AM, roger peppe roger.pe...@canonical.com wrote:
 There is another possiblity: we could just use a different collection
 name prefix for each environment. There is no hard limit on the number
 of collections in mongo (see 
 http://docs.mongodb.org/manual/reference/limits/).

For sharding and for good space management in general it's better to
have data in a collection that gets automatically managed by the
cluster. It's also much simpler to deal with in general, even if it
does require code changes to get started.

 - for a small environment, table indexes remain small and lookups fast
 even though the total number of entries might be huge.

Same as above: when it gets _huge_ you need sharding either way, and
it's easier and more efficient to manage a single collection than 10k.

 - each environment could have a separate mongo txn log, so one busy
 environment that's constantly adding transactions will not necessarily
 slow down all the others. There is, in general, no need for sequential
 consistency between
 environments.

With txn there's no sequential consistency even within the same
environment, if you're touching different documents.

 - database isolation between environments is an advantage when things
 go wrong - it's easier to fix or delete individual environments if their
 tables are isolated from one another.

Sure, it prevents bad mistakes caused by not taking the environment id
in consideration, but deleting foo:* is just as easy.

 I suggest that, at the least, taking this approach would be a quick
 road to making the state work with multiple environments. It
 would not preclude a move to changing to use composite keys
 in the future.

We already know it's a bad idea today. Let's please not do that mistake.


gustavo @ http://niemeyer.net

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-04 Thread Gustavo Niemeyer
On Fri, Jul 4, 2014 at 10:32 AM, roger peppe roger.pe...@canonical.com wrote:
 It won't be possible to shard the transaction log.

Why not?

 The thing I'm trying to get across is: until we know one way or
 another, I believe it would be better to choose the (much) simpler
 option and use the (potential weeks of) dev time for other things.

We know it's a bad idea. Besides everything else I mentioned, there
are _huge_ MongoDB databases out there being that depend on sharding
to scale.. we're talking hundreds of machines. It seems very naive to
go with a model that loses the benefits of all the lessons the MongoDB
development team learned with those use cases, and the work they have
done to support them well.

We have been there in Canonical. Ask folks about the CouchDB story.


gustavo @ http://niemeyer.net

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


RFC: mongo _id fields in the multi-environment juju server world

2014-07-03 Thread Tim Penhey
Hi folks,

Very shortly we are going to start on the work to be able to store
multiple environments within a single mongo database.

Most of our current entities are stored in the database with their name
or id fields serialized to bson as the _id field.

As far as I know (and I may be wrong), if you are adding a document to
the mongo collection, and you do not specify an _id field, mongo will
create a unique value for you.

In our new world, things that used to be unique, like machines,
services, units etc, are now only unique when paired with the
environment id.

It seems we have a number of options here.

1. change the _id field to be a composed field where it is the
concatenation of the environment id and the existing id or name field.
If we do take this approach, I strongly recommend having the fields that
make up the key be available by themselves elsewhere in the document
structure.

2. let mongo create the _id field, and we ensure uniqueness over the
pair of values with a unique index. One think I am unsure about with
this approach is how we currently do our insertion checks, where we do a
document does not exist check.  We wouldn't be able to do this as a
transaction assertion as it can only check for _id values.  How fast are
the indices updated?  Can having a unique index for a document work for
us?  I'm hoping it can if this is the way to go.

3. use a composite _id field such that the document may start like this:
  { _id: { env_uuid: blah, name: foo}, ...
This gives the benefit of existence checks, and real names for the _id
parts.

Thoughts? Opinions? Recommendations?

BTW, I think that if we can make 3 work, then it is the best approach.

Tim

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev


Re: RFC: mongo _id fields in the multi-environment juju server world

2014-07-03 Thread John Meinel
According to the mongo docs:
http://docs.mongodb.org/manual/core/document/#record-documents
The field name _id is reserved for use as a primary key; its value must be
unique in the collection, is immutable, and may be of any type other than
an array.

That makes it sound like we *could* use an object for the _id field and do
_id = {env_uuid:, name:}

Though I thought the purpose of doing something like that is to allow
efficient sharding in a multi-environment world.

Looking here: http://docs.mongodb.org/manual/core/sharding-shard-key/
The shard key must be indexed (which is just fine for us w/ the primary _id
field or with any other field on the documents), and The index on the
shard key *cannot* be a *multikey index
http://docs.mongodb.org/manual/core/index-multikey/#index-type-multikey.*
I don't really know what that means in the case of wanting to shard based
on an object instead of a simple string, but it does sound like it might be
a problem.
Anyway, for purposes of being *unique* we may need to put environ uuid in
there, but for the purposes of sharding we could just put it on another
field and index that field.

John
=:-



On Fri, Jul 4, 2014 at 5:01 AM, Tim Penhey tim.pen...@canonical.com wrote:

 Hi folks,

 Very shortly we are going to start on the work to be able to store
 multiple environments within a single mongo database.

 Most of our current entities are stored in the database with their name
 or id fields serialized to bson as the _id field.

 As far as I know (and I may be wrong), if you are adding a document to
 the mongo collection, and you do not specify an _id field, mongo will
 create a unique value for you.

 In our new world, things that used to be unique, like machines,
 services, units etc, are now only unique when paired with the
 environment id.

 It seems we have a number of options here.

 1. change the _id field to be a composed field where it is the
 concatenation of the environment id and the existing id or name field.
 If we do take this approach, I strongly recommend having the fields that
 make up the key be available by themselves elsewhere in the document
 structure.

 2. let mongo create the _id field, and we ensure uniqueness over the
 pair of values with a unique index. One think I am unsure about with
 this approach is how we currently do our insertion checks, where we do a
 document does not exist check.  We wouldn't be able to do this as a
 transaction assertion as it can only check for _id values.  How fast are
 the indices updated?  Can having a unique index for a document work for
 us?  I'm hoping it can if this is the way to go.

 3. use a composite _id field such that the document may start like this:
   { _id: { env_uuid: blah, name: foo}, ...
 This gives the benefit of existence checks, and real names for the _id
 parts.

 Thoughts? Opinions? Recommendations?

 BTW, I think that if we can make 3 work, then it is the best approach.

 Tim

 --
 Juju-dev mailing list
 Juju-dev@lists.ubuntu.com
 Modify settings or unsubscribe at:
 https://lists.ubuntu.com/mailman/listinfo/juju-dev

-- 
Juju-dev mailing list
Juju-dev@lists.ubuntu.com
Modify settings or unsubscribe at: 
https://lists.ubuntu.com/mailman/listinfo/juju-dev