Re: [PERFORM] Large rows number, and large objects
On 6/19/11 4:37 AM, Samuel Gendler wrote: On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa ildefonso.cama...@gmail.com mailto:ildefonso.cama...@gmail.com wrote: Greetings, I have been thinking a lot about pgsql performance when it is dealing with tables with lots of rows on one table (several millions, maybe thousands of millions). Say, the Large Object use case: one table has large objects (have a pointer to one object). The large object table stores the large object in 2000 bytes chunks (iirc), so, if we have something like 1TB of data stored in large objects, the large objects table would have something like 550M rows, if we get to 8TB, we will have 4400M rows (or so). I have read at several places that huge tables should be partitioned, to improve performance now, my first question comes: does the large objects system automatically partitions itself? if no: will Large Objects system performance degrade as we add more data? (I guess it would). You should consider partitioning your data in a different way: Separate the relational/searchable data from the bulk data that is merely being stored. Relational databases are just that: relational. The thing they do well is to store relationships between various objects, and they are very good at finding objects using relational queries and logical operators. But when it comes to storing bulk data, a relational database is no better than a file system. In our system, each object is represented by a big text object of a few kilobytes. Searching that text file is essential useless -- the only reason it's there is for visualization and to pass on to other applications. So it's separated out into its own table, which only has the text record and a primary key. We then use other tables to hold extracted fields and computed data about the primary object, and the relationships between the objects. That means we've effectively partitioned our data into searchable relational data and non-searchable bulk data. The result is that we have around 50 GB of bulk data that's never searched, and about 1GB of relational, searchable data in a half-dozen other tables. With this approach, there's no need for table partitioning, and full table scans are quite reasonable. Craig
[PERFORM] hstore - Implementation and performance issues around its operators
Hi, We did a benchmark comparing a Key-Value-Pairs stored as EAV db schema versus hstore. The results are promising in favor of hstore but there are some question which remain. 1. Obviously the '@' has to be used in order to let use the GiST index. Why is the '-' operator not supported by GiST ('-' is actually mentioned in all examples of the doc.)? 2. Currently the hstore elements are stored in order as they are coming from the insert statement / constructor. Why are the elements not ordered i.e. why is the hstore not cached in all hstore functions (like hstore_fetchval etc.)? 3. In the source code 'hstore_io.c' one finds the following enigmatic note: ... very large hstore values can't be output. this could be fixed, but many other data types probably have the same issue. What is the max. length of a hstore (i.e. the max. length of the sum of all elements in text representation)? 4. Last, I don't fully understand the following note in the hstore doc. (http://www.postgresql.org/docs/current/interactive/hstore.html ): Notice that the old names are reversed from the convention formerly followed by the core geometric data types! Why names? Why not rather 'operators' or 'functions'? What does this reversed from the convention mean concretely? Yours, Stefan -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Degrading PostgreSQL 8.4 write performance
Load testing of postgresql 8.4 for OLTP application suitability showed that throughput of the database significantly degraded over time from thousands of write transactions per second to almost zero. A typical postgres benchmarking gotcha is : - you start with empty tables - the benchmark fills them - query plans which were prepared based on stats of empty (or very small) tables become totally obsolete when the table sizes grow - therefore everything becomes very slow as the tables grow So you should disconnect/reconnect or issue a DISCARD ALL periodically on each connection, and of course periodically do some VACUUM ANALYZE (or have autovacuum do that for you). -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Large rows number, and large objects
Hi! Thanks (you both, Samuel and Craig) for your answers! On Sun, Jun 19, 2011 at 11:19 AM, Craig James craig_ja...@emolecules.com wrote: On 6/19/11 4:37 AM, Samuel Gendler wrote: On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa ildefonso.cama...@gmail.com wrote: Greetings, I have been thinking a lot about pgsql performance when it is dealing with tables with lots of rows on one table (several millions, maybe thousands of millions). Say, the Large Object use case: one table has large objects (have a pointer to one object). The large object table stores the large object in 2000 bytes chunks (iirc), so, if we have something like 1TB of data stored in large objects, the large objects table would have something like 550M rows, if we get to 8TB, we will have 4400M rows (or so). I have read at several places that huge tables should be partitioned, to improve performance now, my first question comes: does the large objects system automatically partitions itself? if no: will Large Objects system performance degrade as we add more data? (I guess it would). You should consider partitioning your data in a different way: Separate the relational/searchable data from the bulk data that is merely being stored. Relational databases are just that: relational. The thing they do well is to store relationships between various objects, and they are very good at finding objects using relational queries and logical operators. But when it comes to storing bulk data, a relational database is no better than a file system. In our system, each object is represented by a big text object of a few kilobytes. Searching that text file is essential useless -- the only reason it's there is for visualization and to pass on to other applications. So it's separated out into its own table, which only has the text record and a primary key. Well, my original schema does exactly that (I mimic the LO schema): files (searchable): id, name, size, hash, mime_type, number_chunks files_chunks : id, file_id, hash, chunk_number, data (bytea) So, my bulk data is on files_chunks table, but due that data is restricted (by me) to 2000 bytes, the total number of rows on the files_chunks table can get *huge*. So, system would search the files table, and then, search the files_chunks table (to get each of the chunks, and, maybe, send them out to the web client). So, with a prospect of ~4500M rows for that table, I really thought it could be a good idea to partition files_chunks table. Due that I'm thinking on relatively small files (100MB), table partitioning should do great here, because I could manage to make all of the chunks for a table to be contained on the same table. Now, even if the system were to get larger files (5GB), this approach should still work. The original question was about Large Objects, and partitioning... see, according to documentation: http://www.postgresql.org/docs/9.0/static/lo-intro.html All large objects are placed in a single system table called pg_largeobject. So, the question is, if I were to store 8TB worth of data into large objects system, it would actually make the pg_largeobject table slow, unless it was automatically partitioned. Thanks for taking the time to discuss this matter with me! Sincerely, Ildefonso Camargo -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance