D'Arcy J.M. Cain wrote: > On Sat, 13 Mar 2010 23:42:31 -0800 > Jonathan Gardner <jgard...@jonathangardner.net> wrote: >> On Fri, Mar 12, 2010 at 11:23 AM, Paul Rubin <no.em...@nospam.invalid> wrote: >>> "D'Arcy J.M. Cain" <da...@druid.net> writes: >>>> Just curious, what database were you using that wouldn't keep up with >>>> you? I use PostgreSQL and would never consider going back to flat >>>> files. >>> Try making a file with a billion or so names and addresses, then >>> compare the speed of inserting that many rows into a postgres table >>> against the speed of copying the file. > > That's a straw man argument. Copying an already built database to > another copy of the database won't be significantly longer than copying > an already built file. In fact, it's the same operation. > >> Also consider how much work it is to partition data from flat files >> versus PostgreSQL tables. > > Another straw man. I'm sure you can come up with many contrived > examples to show one particular operation faster than another. > Benchmark writers (bad ones) do it all the time. I'm saying that in > normal, real world situations where you are collecting billions of data > points and need to actually use the data that a properly designed > database running on a good database engine will generally be better than > using flat files. > >>>> The only thing I can think of that might make flat files faster is >>>> that flat files are buffered whereas PG guarantees that your >>>> information is written to disk before returning >>> Don't forget all the shadow page operations and the index operations, >>> and that a lot of these operations require reading as well as writing >>> remote parts of the disk, so buffering doesn't help avoid every disk >>> seek. > > Not sure what a "shadow page operation" is but index operations are > only needed if you have to have fast access to read back the data. If > it doesn't matter how long it takes to read the data back then don't > index it. I have a hard time believing that anyone would want to save > billions of data points and not care how fast they can read selected > parts back or organize the data though. > >> Plus the fact that your other DB operations slow down under the load. > > Not with the database engines that I use. Sure, speed and load are > connected whether you use databases or flat files but a proper database > will scale up quite well. > A common complaint about large database loads taking a long time comes about because of trying to commit the whole change as a single transaction. Such an approach can indeed causes stresses on the database system, but aren't usually necessary.
I don't know about PostgreSQL's capabilities in this area but I do know that Oracle (which claims to be all about performance, though in fact I believe PostgreSQL is its equal in many applications) allows you to switch off the various time-consuming features such as transaction logging in order to make bulk updates faster. I also question how many databases would actually find a need to store addresses for a sixth of the world's population, but this objection is made mostly for comic relief: I understand that tables of such a size are necessary sometimes. There was a talk at OSCON two years ago by someone who was using PostgreSQL to process 15 terabytes of medical data. I'm sure he'd have been interested in suggestions that flat files were the answer to his problem ... Another talk a couple of years before that discussed how PostgreSQL was superior to Oracle in handling a three-terabyte data warehouse (though it conceded Oracle's superiority in handling the production OLTP system on which the warehouse was based - but that's four years ago). http://images.omniti.net/omniti.com/~jesus/misc/BBPostgres.pdf Of course if you only need sequential access to the data then the relational approach may be overkill. I would never argue that relational is the best approach for all data and all applications, but it's often better than its less-informed critics realize. regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 See PyCon Talks from Atlanta 2010 http://pycon.blip.tv/ Holden Web LLC http://www.holdenweb.com/ UPCOMING EVENTS: http://holdenweb.eventbrite.com/ -- http://mail.python.org/mailman/listinfo/python-list