Re: [PERFORM] Large rows number, and large objects

2011-06-19 Thread Craig James

On 6/19/11 4:37 AM, Samuel Gendler wrote:

On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa 
ildefonso.cama...@gmail.com mailto:ildefonso.cama...@gmail.com wrote:

Greetings,

I have been thinking a lot about pgsql performance when it is dealing
with tables with lots of rows on one table (several millions, maybe
thousands of millions).  Say, the Large Object use case:

one table has large objects (have a pointer to one object).
The large object table stores the large object in 2000 bytes chunks
(iirc), so, if we have something like 1TB of data stored in large
objects, the large objects table would have something like 550M rows,
if we get to 8TB, we will have 4400M rows (or so).

I have read at several places that huge tables should be partitioned,
to improve performance now, my first question comes: does the
large objects system automatically partitions itself? if no: will
Large Objects system performance degrade as we add more data? (I guess
it would).


You should consider partitioning your data in a different way: Separate the 
relational/searchable data from the bulk data that is merely being stored.

Relational databases are just that: relational.  The thing they do well is to 
store relationships between various objects, and they are very good at finding 
objects using relational queries and logical operators.

But when it comes to storing bulk data, a relational database is no better than 
a file system.

In our system, each object is represented by a big text object of a few 
kilobytes.  Searching that text file is essential useless -- the only reason it's there 
is for visualization and to pass on to other applications.  So it's separated out into 
its own table, which only has the text record and a primary key.

We then use other tables to hold extracted fields and computed data about the primary 
object, and the relationships between the objects.  That means we've effectively 
partitioned our data into searchable relational data and non-searchable bulk 
data.  The result is that we have around 50 GB of bulk data that's never searched, and 
about 1GB of relational, searchable data in a half-dozen other tables.

With this approach, there's no need for table partitioning, and full table 
scans are quite reasonable.

Craig


[PERFORM] hstore - Implementation and performance issues around its operators

2011-06-19 Thread Stefan Keller
Hi,

We did a benchmark comparing a Key-Value-Pairs stored as EAV db schema
versus hstore.
The results are promising in favor of hstore but there are some
question which remain.

1. Obviously the '@' has to be used in order to let use the GiST index.
Why is the '-' operator not supported by GiST ('-' is actually
mentioned in all examples of the doc.)?

2. Currently the hstore elements are stored in order as they are
coming from the insert statement / constructor.
Why are the elements not ordered i.e. why is the hstore not cached in
all hstore functions (like hstore_fetchval etc.)?

3. In the source code 'hstore_io.c' one finds the following enigmatic
note: ... very large hstore values can't be output. this could be
fixed, but many other data types probably have the same issue.
What is the max. length of a hstore (i.e. the max. length of the sum
of all elements in text representation)?

4. Last, I don't fully understand the following note in the hstore
doc. (http://www.postgresql.org/docs/current/interactive/hstore.html
):
 Notice that the old names are reversed from the convention
 formerly followed by the core geometric data types!

Why names? Why not rather 'operators' or 'functions'?
What does this reversed from the convention mean concretely?

Yours, Stefan

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Degrading PostgreSQL 8.4 write performance

2011-06-19 Thread Pierre C



Load testing of postgresql 8.4 for OLTP application
suitability showed that throughput of the
database significantly degraded over time from thousands of write
transactions per second to almost zero.


A typical postgres benchmarking gotcha is :

- you start with empty tables
- the benchmark fills them
- query plans which were prepared based on stats of empty (or very small)  
tables become totally obsolete when the table sizes grow

- therefore everything becomes very slow as the tables grow

So you should disconnect/reconnect or issue a DISCARD ALL periodically on  
each connection, and of course periodically do some VACUUM ANALYZE (or  
have autovacuum do that for you).


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Large rows number, and large objects

2011-06-19 Thread Jose Ildefonso Camargo Tolosa
Hi!

Thanks (you both, Samuel and Craig) for your answers!

On Sun, Jun 19, 2011 at 11:19 AM, Craig James
craig_ja...@emolecules.com wrote:
 On 6/19/11 4:37 AM, Samuel Gendler wrote:

 On Sat, Jun 18, 2011 at 9:06 PM, Jose Ildefonso Camargo Tolosa
 ildefonso.cama...@gmail.com wrote:

 Greetings,

 I have been thinking a lot about pgsql performance when it is dealing
 with tables with lots of rows on one table (several millions, maybe
 thousands of millions).  Say, the Large Object use case:

 one table has large objects (have a pointer to one object).
 The large object table stores the large object in 2000 bytes chunks
 (iirc), so, if we have something like 1TB of data stored in large
 objects, the large objects table would have something like 550M rows,
 if we get to 8TB, we will have 4400M rows (or so).

 I have read at several places that huge tables should be partitioned,
 to improve performance now, my first question comes: does the
 large objects system automatically partitions itself? if no: will
 Large Objects system performance degrade as we add more data? (I guess
 it would).

 You should consider partitioning your data in a different way: Separate
 the relational/searchable data from the bulk data that is merely being
 stored.

 Relational databases are just that: relational.  The thing they do well is
 to store relationships between various objects, and they are very good at
 finding objects using relational queries and logical operators.

 But when it comes to storing bulk data, a relational database is no better
 than a file system.

 In our system, each object is represented by a big text object of a few
 kilobytes.  Searching that text file is essential useless -- the only reason
 it's there is for visualization and to pass on to other applications.  So
 it's separated out into its own table, which only has the text record and a
 primary key.

Well, my original schema does exactly that (I mimic the LO schema):

files (searchable): id, name, size, hash, mime_type, number_chunks
files_chunks : id, file_id, hash, chunk_number, data (bytea)

So, my bulk data is on files_chunks table, but due that data is
restricted (by me) to 2000 bytes, the total number of rows on the
files_chunks table can get *huge*.

So, system would search the files table, and then, search the
files_chunks table (to get each of the chunks, and, maybe, send them
out to the web client).

So, with a prospect of ~4500M rows for that table, I really thought it
could be a good idea to partition files_chunks table.  Due that I'm
thinking on relatively small files (100MB), table partitioning should
do great here, because I could manage to make all of the chunks for a
table  to be contained on the same table.  Now, even if the system
were to get larger files (5GB), this approach should still work.

The original question was about Large Objects, and partitioning...
see, according to documentation:
http://www.postgresql.org/docs/9.0/static/lo-intro.html

All large objects are placed in a single system table called pg_largeobject.

So, the question is, if I were to store 8TB worth of data into large
objects system, it would actually make the pg_largeobject table slow,
unless it was automatically partitioned.

Thanks for taking the time to discuss this matter with me!

Sincerely,

Ildefonso Camargo

-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance