[SQL] idea for a geographically distributed database: how best to implement?
Hello, I've got a database for a website which is a variant of the 'show stuff near to me' sort of thing. Rather than host this database on a single server, I have a scheme in mind to break the database up geographically so that each one can run comfortably on a small server, but I'm not sure about the best way of implementing it. Here's the scheme: Imagine that the country is split into an array of square cells. Each cell contains a database that stores information about people who live in the area covered by the cell. There's one problem with this scheme. What happens if you live near the edge of a cell? My solution is that any inserted data which lies near to the edge of cell A is *also* inserted in the database of the relevant neighbouring cell - let's say cell B. Thus, if someone lives in cell B, but close to the border with cell A, they'll see the data that is geographically close to them, even if it lies in cell A. Is this a common pattern? I could, of course, simply find every insert, update and delete in the application and alter the code to explicitly update all the relevant databases, but is there a more elegant way of simply saying: "Do this transaction on both Database A and Database B" monotonically? I've had a look at some replication solutions, but they all seem to involve replicating an entire database. The advantage of my scheme is that if I can distribute my application over large numbers of small servers, I'll end up with more bangs for the buck, and it'll be much easier to manage growth by managing the number of servers, and number of cells hosted on each server. Thanks for any suggestions! Andy Ballingall ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [SQL] idea for a geographically distributed database: how best to implement?
Hi again, I don't think I've explained my idea well enough: >You might want to consider using latitude and longitude or zip codes or taking more of a traditional "GIS" approach rather than duplicating data across redundant databases. I do use coordinates to position data. The databases aren't redundant. Each database serves a particular region, the aim being to ensure that no one database gets too big for the machines in my farm to cope with. >Another issue is that you end up having to query every database to find proximity... Suppose you have 500 "cells" ? You now have to invoke some sort of RPC mechanism on 499 other RDBMS, manage 500 user names and passwords, retrieve potentially 500 recordsets, merge the data, etc... 1. It doesn't matter how many cells there are. A user's details are only found in the database covering the area he lives, and possibly in the neighbouring databases (at most 3, if he lives in the corner of the cell). 2. The neighbouring cells of cell A never change, therefore you don't need to do anything clever to identify which other databases to write to in these edge conditions, since when the user logs in, the relationship information is read just once (just like the user's name and picture) 3. No merging ever needs to take place during queries. Here's an example. Let's say, you, a resident of cell A, say 'show me who lives nearby!'. The database of cell A has *all* the information. Why? Well, when a resident in nearby cell B registered, his details were also added to cell A at that time. Database A doesn't know, or care, how the data it holds got there. 4. Which database do I use to serve all your requests? Well, yes, once, I have to work out which one it is, based on where you are, but that's almost no work at all (see next point...) 5. I didn't mention before that my scheme *does* also require a national database, but the size of that is well defined, as the only jobs it needs to do are: a) Know the names of all the places, b) Know the definition of the existing cells, and their positions - to make it easy to manage the cell structure, and to know where to send you when you login c) A set of global sequences used to generate unique ids for all the other databases. When your record goes into cell A and cell B, the id of this record is the *same*... >Your problems will continue to get more and more complex... You are better off with a single RDBMS and a single schema... I currently have a single database. It is clearly simpler. However, I also have a headache about what machinery to buy so that: a) it starts off cheap b) It is easy to expand, if and when the website takes off. Yes, if I knew how many people would use this service, and how quickly it would expand, I'd be able to raise the capital to buy a huge nationwide server. But I don't. So the answer 'how big does your DB server need to be' is simply 'I can only make a very poor guess'. This doesn't look good in a business plan. Far better if I can demonstrate an application which can start off on a single simple server, and can elegantly expand with the user base, with the farm of servers growing over time. Unlike many data sets, mine is almost totally partitioned geographically. There is only *one* little detail - that of visibility of data in neighbouring cells, and that is sorted out with my idea of duplicating information between neighbours. Hope that fills in some gaps... Thanks for your comments, Andy "Andy Ballingall" <[EMAIL PROTECTED]> wrote in message news:[EMAIL PROTECTED] > Hello, > > I've got a database for a website which is a variant of the 'show stuff > near > to me' sort of thing. > > Rather than host this database on a single server, I have a scheme in mind > to break the database up geographically so that each one can run > comfortably > on a small server, but I'm not sure about the best way of implementing it. > > Here's the scheme: > > > Imagine that the country is split into an array of square cells. > Each cell contains a database that stores information about people who > live > in the area covered by the cell. > > There's one problem with this scheme. What happens if you live near the > edge > of a cell? > > My solution is that any inserted data which lies near to the edge of cell > A > is *also* inserted in the database of the relevant neighbouring cell - > let's > say cell B. > > Thus, if someone lives in cell B, but close to the border with cell A, > they'll see the data that is geographically close to > them, even if it lies in cell A. > > > > Is this a common pattern? > > I could, of course, sim
Re: [SQL] idea for a geographically distributed database: how best
>Andy, i agree with what codeWarrior says. But if you are interested in replication, dbmirror is very elegant (altho not as trendy) simple, and highly customizable replication solution. I have heavily modified dbmirror to even support Asynchronous Conditional row grained Foreign key dependency Traversal oriented Lazy Replication!!! (which ofcourse nobody wants :) I'll go and check it out, so even if I don't want to use it, I'll at least know what it means! (I considered suggesting using an acronym for it, but 'ACRGFKDTOLR' made my lips bleed when I tried to say it). Thanks, Andy > > > > > "Andy Ballingall" <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] > > Hello, > > > > I've got a database for a website which is a variant of the 'show stuff > > near > > to me' sort of thing. > > > > Rather than host this database on a single server, I have a scheme in mind > > to break the database up geographically so that each one can run > > comfortably > > on a small server, but I'm not sure about the best way of implementing it. > > > > Here's the scheme: > > > > > > Imagine that the country is split into an array of square cells. > > Each cell contains a database that stores information about people who > > live > > in the area covered by the cell. > > > > There's one problem with this scheme. What happens if you live near the > > edge > > of a cell? > > > > My solution is that any inserted data which lies near to the edge of cell > > A > > is *also* inserted in the database of the relevant neighbouring cell - > > let's > > say cell B. > > > > Thus, if someone lives in cell B, but close to the border with cell A, > > they'll see the data that is geographically close to > > them, even if it lies in cell A. > > > > > > > > Is this a common pattern? > > > > I could, of course, simply find every insert, update and delete in the > > application and alter the code to explicitly update all the relevant > > databases, but is there a more elegant way of simply saying: "Do this > > transaction on both Database A and Database B" monotonically? > > > > I've had a look at some replication solutions, but they all seem to > > involve > > replicating an entire database. The advantage of my scheme is that if I > > can > > distribute my application over large numbers of small servers, I'll end up > > with more bangs for the buck, and it'll be much easier to manage growth by > > managing the number of servers, and number of cells hosted on each server. > > > > Thanks for any suggestions! > > Andy Ballingall > > > > > > ---(end of broadcast)--- > > TIP 3: Have you checked our extensive FAQ? > > > > http://www.postgresql.org/docs/faq > > > > > > ---(end of broadcast)--- > TIP 9: In versions below 8.0, the planner will ignore your desire to >choose an index scan if your joining column's datatypes do not >match > -- -Achilleus ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.3/173 - Release Date: 16/11/2005 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [SQL] idea for a geographically distributed database: how best to implement?
Thanks Aidan, That's very interesting! I've not used slony yet, so I'll setup some tests and try exactly this mechanism. What I like about it is that the changes that have to occur to either add a cell or to split an existing cell into two to manage growing traffic have a localised effect on the web of database interdependencies, making it more straightforward to do maintenance without the website being taken offline for most people. They'd only notice downtime if they happened to live in or next to the cells being messed around with. Not having to touch the actual web application is nice. The other advantage over my manual hack is that the replicated data is just that - replicated, so when each database does a backup, it's only the master data that is getting backed up and restored. In my manual scheme, following restoration after a node failure, I'd have to do a postprocessing step to work out which restored data was actually data slaved from a neighbouring cell, and refresh it in case the neighbouring data has been updated since the node went down. (I hadn't considered backup and restore properly up till now, but I'm going to add that to the list of good reasons for taking advantage of my 'distantly disjoint' data set. If my national data is shared between 500 cells, then each database is going to be titchy, and backup will take 'no time at all', as in parallel, 500 streams of database data flow first to node local storage and thence to safer, remote storage) Cheers, Andy - -Original Message- From: Aidan Van Dyk [mailto:[EMAIL PROTECTED] Sent: 17 November 2005 19:54 To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: idea for a geographically distributed database: how best to implement? Andy, This is *very* interesting scenario. Definitely peaked my interest on this one. I haven't tried this scenario before, but I have used "inheritted" tables to do "splitups". Basically, just partition your data, and replicate certain partitions between various neighbours. http://www.postgresql.org/docs/8.1/interactive/ddl-partitioning.html With Slony, you can get "master slave" replication of particular tables "sets" (not necessarily whole databases). So I haven't developped this idea, or tried it, but you might be able to play with it and make it work. On *each* database, you have a head table, with no records: CREATE TABLE complete (); And a bunch of "cell" tables that inherit the "complete" table: CREATE TABLE cell_a () INHERITS "complete"; CREATE TABLE cell_b () INHERITS "complete"; Basically, you *always* work on the "complete" table, with the following rules: CREATE RULE cell_replicate_insert ON INSERT TO complete DO INSTEAD INSERT INTO cell_a (...); CREAT RULE cell_replicate_update ON UPDATE TO complete DO INSTEAD UPDATE cell_a ...; Use SlonyI to replecate cella to NODE B (master->slave), and on NODE B, do something similar (except work on b, and replicate b -> NODE A). This way, all your web/interface code everywhere is using the "complete" table for any insert/update/select. When insert/updateing, it will always go to the "local" table, which Slony "replicates" out to the neighbouring nodes. And when selecting on it (the "complete" table), you get data from your local table, and any "replicated" neighbouring node's tables which are in the local schema inheriting from "complete" and SlonyI is updating. Good luck, and if you get anywhere with this, let me know! Andy Ballingall wrote: > Hello, > > I've got a database for a website which is a variant of the 'show stuff > near to me' sort of thing. > > Rather than host this database on a single server, I have a scheme in mind > to break the database up geographically so that each one can run > comfortably on a small server, but I'm not sure about the best way of > implementing it. > > Here's the scheme: > > > Imagine that the country is split into an array of square cells. > Each cell contains a database that stores information about people who > live in the area covered by the cell. > > There's one problem with this scheme. What happens if you live near the > edge of a cell? > > My solution is that any inserted data which lies near to the edge of cell > A is *also* inserted in the database of the relevant neighbouring cell - > let's say cell B. > > Thus, if someone lives in cell B, but close to the border with cell A, > they'll see the data that is geographically close to > them, even if it lies in cell A. > > > > Is this a common pattern? > > I cou
[SQL] idea for a geographically distributed database: how best to implement?
Hello, I’ve got a database for a website which is a variant of the ‘show stuff near to me’ sort of thing. Rather than host this database on a single server, I have a scheme in mind to break the database up geographically so that each one can run comfortably on a small server, but I’m not sure about the best way of implementing it. Here’s the scheme: Imagine that the country is split into an array of square cells. Each cell contains a database that stores information about people who live in the area covered by the cell. There’s one problem with this scheme. What happens if you live near the edge of a cell? My solution is that any inserted data which lies near to the edge of cell A is *also* inserted in the database of the relevant neighbouring cell – let’s say cell B. Thus, if someone lives in cell B, but close to the border with cell A, they’ll see the data that is geographically close to them, even if it lies in cell A. Is this a common pattern? I could, of course, simply find every insert, update and delete in the application and alter the code to explicitly update all the relevant databases, but is there a more elegant way of simply saying: “Do this transaction on both Database A and Database B” monotonically? I’ve had a look at some replication solutions, but they all seem to involve replicating an entire database. The advantage of my scheme is that if I can distribute my application over large numbers of small servers, I’ll end up with more bangs for the buck, and it’ll be much easier to manage growth by managing the number of servers, and number of cells hosted on each server. Thanks for any suggestions! Andy Ballingall 'pgsql-sql@postgresql.org'
[SQL] Is it possible to redirect an update/insert/delete to a different table?
Hello, Is there a way that I can specify that a request to change to a particular table is redirected instead to a different table? I’ve looked through rules, and as far as I can make out, they are only useful for explicit actions. I’m looking for something that behaves as though it simply substitutes the table name for a different table name before executing the command, no matter what the command looks like. Thanks Andy Ballingall
Re: [SQL] Is it possible to redirect an update/insert/delete to a different table?
Hello Peter, I'm glad it's possible, but I can't see how from the documentation. Say if I have a table called 'apples' and a table called 'pears'. What would the rule look like that would remap all updates on apples so that they were applied to pears instead? Thanks, Andy ----- Andy Ballingall wrote: > I've looked through rules, and as far as I can make out, they are > only useful for explicit actions. I'm looking for something that > behaves as though it simply substitutes the table name for a > different table name before executing the command, no matter what the > command looks like. You can write a couple of rules for that. -- Peter Eisentraut http://developer.postgresql.org/~petere/ -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.4/176 - Release Date: 20/11/2005 ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [SQL] Is it possible to redirect an update/insert/delete to a different table?
Hello Jaime, I'm still not quite clear. Say I have a number of different updates on a table 'apples' in my code, including: UPDATE apples set pips=6 and color='yellow' where id=3; UPDATE apples set size=10 where id=6; What would a rule look like which, when *any* update is attempted on the apples table, will instead apply the update to a different table - 'pears'. I get this far: CREATE rule pears_instead_of_apples AS ON UPDATE TO apples DO INSTEAD UPDATE INTO pears .; What do I put where the . is, so that the rule will transparently update the pears table with whatever values happened to be defined by the original update command? Is there a special keyword that I've missed? Regards, Andy Ballingall -Original Message- From: Jaime Casanova [mailto:[EMAIL PROTECTED] Sent: 20 November 2005 14:23 To: [EMAIL PROTECTED] Cc: pgsql-sql@postgresql.org Subject: Re: Is it possible to redirect an update/insert/delete to a different table? > I've looked through rules, and as far as I can make out, they are only > useful for explicit actions. I'm looking for something that behaves as > though it simply substitutes the table name for a different table name > before executing the command, no matter what the command looks like. > Make 3 rules (INSERT/UPDATE/DELETE), 4 if you want SELECT as well... -- regards, Jaime Casanova (DBA: DataBase Aniquilator ;) -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.4/176 - Release Date: 20/11/2005 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [SQL] Is it possible to redirect an update/insert/delete to a different table?
Hi Andreas, The rule you've given only works for an update which changes the name. If I do another update which changed the colour instead of the name, that rule wouldn't do the right thing. Instead, I'm looking for something which, with a single 'rule' (or whatever the mechanism ends up being), intercepts *any* update on apples, and applies the changes to the pears table instead, as if the only change that occurred was a change of table name. I can achieve this in the application which generates the sql commands, but someone else suggested it was possible with rules, but it may not be the case. Thanks! Andy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Andreas Kretschmer Sent: 20 November 2005 16:17 To: pgsql-sql@postgresql.org Subject: Re: [SQL] Is it possible to redirect an update/insert/delete to a different table? Andy Ballingall <[EMAIL PROTECTED]> schrieb: > Hello Peter, > > I'm glad it's possible, but I can't see how from the documentation. > > Say if I have a table called 'apples' and a table called 'pears'. > > What would the rule look like that would remap all updates on apples so that > they were applied to pears instead? create rule apples_pears_update as on update to apples do instead update pears set name= NEW.name where id=NEW.id ; test=# select * from apples ; id | name +-- 1 | a (1 row) test=# select * from pears ; id | name +-- 1 | b (1 row) test=# update apples set name = 'c' where id = 1; UPDATE 1 test=# select * from pears ; id | name +-- 1 | c (1 row) http://www.postgresql.org/docs/8.1/interactive/rules-update.html HTH, Andreas -- Really, I'm not out to destroy Microsoft. That will just be a completely unintentional side effect. (Linus Torvalds) Kaufbach, Saxony, Germany, Europe. N 51.05082°, E 13.56889° ---(end of broadcast)--- TIP 6: explain analyze is your friend -- No virus found in this incoming message. Checked by AVG Free Edition. Version: 7.1.362 / Virus Database: 267.13.4/176 - Release Date: 20/11/2005 ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] Is it possible to redirect an update/insert/delete to a different table?
>Try it. [snipped example] Ah. Basically, you set up the rule to assign every column, and if the update doesn't redefine some columns, then it still works. I didn't understand that you could get the rule to work generically like this. I'll presume that the rule will need amending if the table column definition later changes. (E.g. if I add 'stalk_length' to my apples and pears tables)... Thanks very much for your help. Andy ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [SQL] idea for a geographically distributed database: how best to implement?
David Bath wrote: > There are a couple of philosophical perspectives I've come across in > previous > work with cadastral data that may be useful...[snipped] Thanks, David In this particular application, structures such as postcode sectors, administrative boundaries etc. are not really of much importance, as most stuff is a simple coordinate based searches. Even with the problem partitioned into disjoint regions, within each region, the search remains trivial, as all the data that the user is allowed to access will be stored with that region (this includes data replicated from neighbouring regions). In this context, the interesting task isn't so much the actual database searching, or the exact definition of the disjoint regions. The interesting task is to define a system which can dynamically remap the hosting of regions to specific servers, so that no one server gets too busy. As demand grows, I simply plug in more 4 blades and press the 'reconfigure' button (Sorry - I was dreaming for a moment...) The only limiters are the number of servers available and the activity within a single region (which must be servable by a single server), but given the highly localised nature of the application, the regions can be very small, and I don't expect to ever see a region with more than 1GB of data - the aim being for all the data to be resident in RAM. So far, I've already seen some issues. I've been looking at slony-1 to handle the replication between adjacent regions, and not only is it asynchronous (I was hoping otherwise...slony-2 seems a long way off), but changing the db schema has ramifications too. (I.e. changing the schema means redefining each replication). Still - no show stoppers yet. Thanks for your insights, Andy ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly