[GENERAL] "could not read block 0... " error followed by "database does not exist"
Hi. I am running Postgres 8.2.7 on a Linux system for over a year now with no problems. Today one of the database users reported the following error: psql: FATAL: could not read block 0 of relation 1664/0/1262: read only 0 of 8192 bytes I tried stopping and restarting the Postgres server for the database. From the logfile: LOG: received smart shutdown request LOG: autovacuum launcher shutting down LOG: shutting down LOG: database system is shut down LOG: database system was shut down at 2010-02-12 17:15:37 PST LOG: autovacuum launcher started LOG: database system is ready to accept connections But when I try to connect to the database, I get the error message: FATAL: database "subptf" does not exist I tried stopping/restarting the database several times. I also killed all user connections to the database. How do I fix this problem? Thanks, Janet -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] change location of postmaster.pid file?
Hi. We are looking into the possibility of running a Postgres server on an underutilized machine. This machine has very little local disk space, so we would have to create the data directory on a shared file system. The underutilized machine was set up so that it can *only read* from the shared file system, i.e., cannot write to the shared file system. Is it possible to create a database cluster on a machine that has write access to the shared file system, shut down the Postgres server on that machine, and then start up the Postgres server on the machine that cannot write to the shared file system, and thereafter, *only query* the database. Since Postgres writes the postmaster.pid file to the data directory (which would be on the shared file system), the answer would appear to be no, since the 'underutilized' machine cannot write any files to the shared file system. Would it be possible to write the postmaster.pid file to the local file system on the 'underutilized' machine even though the data directory is on the shared file system? I realize that this seems like a bad idea - given that the purpose of the postmaster.pid file as I understand it is to prevent more than one postmaster running in a data directory - but I wanted to ask whether this is a possibility. Even if it were possible to write the postmaster.pid to the local file system on the 'underutilized' machine, does Postgres write other temporary files even if only SELECT statements are being executed against the database? And where does it write those files - in subdirectories of the data directory? Thank you, Janet -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] change location of postmaster.pid file?
Hi. Thanks for the quick and definitive answers to my questions. The information you provided will save me from wasting time and energy trying to see how far I could get otherwise. Thanks very much. Janet Tom Lane wrote: > Janet Jacobsen writes: > >> Is it possible to create a database cluster on a machine that >> has write access to the shared file system, shut down the >> Postgres server on that machine, and then start up the >> Postgres server on the machine that cannot write to the >> shared file system, and thereafter, *only query* the database. >> > > No. The pid file is only the first and smallest problem you'd run into > with a read-only database filesystem. > > regards, tom lane > > > On Fri, Jul 10, 2009 at 11:17 PM, Janet Jacobsen wrote: > >> > Is it possible to create a database cluster on a machine that >> > has write access to the shared file system, shut down the >> > Postgres server on that machine, and then start up the >> > Postgres server on the machine that cannot write to the >> > shared file system, and thereafter, *only query* the database. >> > > Postgres isn't really designed to work this way. It expects to have > write access and will occasionally still write stuff to disk even for > read-only queries. > > It won't work even a little bit before 8.3. For 8.3 or later you could > maybe make it work using vacuum freeze but there's no facility to > verify that it's really frozen everything and you'll still be taken by > surprise by queries which try to use temporary space for large sorts > or commands which start transactions that you didn't realize were > necessary. > > -- greg http://mit.edu/~gsstark/resume.pdf -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] general question on two-partition table
Hi. We have a table with 30 M records that is growing by about 100 K records per day. The experimentalists, whose data are in the table, have decided that they will focus on the records for which the value of one field, rbscore, is greater than a cut-off. However, they want to continue to store all of the data - even the records for which rbscore is less than the cutoff - in the database. For the current table, there are about 400 K (out of 30 M) records that meet the 'good' criteria. Each record in the table has about 40 fields, and the experimentalists do in fact write queries that use many of those fields, some more than others. (They are building a model and have not pinned down exactly which fields are more important than others, so that's why they store and query by so many fields.) If they are going to spend 95% of their time querying the records that meet the 'good' criteria, what are the good strategies for ensuring good performance for those queries? (1) Should I partition the table into two partitions based on the value of rbscore? (2) Should I create two separate tables? Are (1) and (2) more or less equivalent in terms of performance? I think that partitioning the table is a more flexible option (i.e., what if the cutoff value changes, no need to change the name of the table being queried, etc.), but would (2) give better performance given that 95% of their queries are for rbscores greater than a threshold value? Can you suggest other strategies? Thank you, Janet -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] general question on two-partition table
Hi. Thanks for your reply. I looked at the documentation for partitions - it is the case, right, that I have to create the master table and the two partition tables (depending on the value of rbscore) and then copy the records from the existing table into the two partitions? Stephen Frost wrote: > * Janet Jacobsen (jsjacob...@lbl.gov) wrote: > >> If they are going to spend 95% of their time querying the >> records that meet the 'good' criteria, what are the good >> strategies for ensuring good performance for those queries? >> (1) Should I partition the table into two partitions based on >> the value of rbscore? >> (2) Should I create two separate tables? >> >> Are (1) and (2) more or less equivalent in terms of >> performance? >> > > It's not clear to me what you plan here.. How would you handle (2) for > the users? Would you construct a view across them, or expect them to > query the right table(s)? Options, as I see them, and in the order of > 'best-to-worst' wrt user friendlyness and performance, I believe, are: > > 1- Partitioning (with CHECK constraints and constraint_exclusion) > 2- View across two tables (with appropriate WHERE clauses) > 3- Functional index (as suggested by someone else) > 4- separate tables (users have to figure out how to use them) > 5- single table with everything > > My recommendation would be #1, followed by #2. Be sure to look up how > to do partitioning by using inheiritance in PG, and, if you need to, > look at how to implement a trigger to handle inserts on the parent > table. Make sure you create your CHECK() constraints correctly, and > that you have constraint_exclusion enabled, and that it *works*. > Is there some reason why you emphasized "works"? Would I use EXPLAIN to verify that the constraints are being checked? >> I think that partitioning the table is a more flexible option >> (i.e., what if the cutoff value changes, no need to change >> the name of the table being queried, etc.), but would (2) >> give better performance given that 95% of their queries >> are for rbscores greater than a threshold value? >> > > If you have your partitioning set up correctly, I don't believe having > actual separate tables would be that much of a performance gain for > queries. It would help some with inserts, tho if you know which table > to insert into, you could just insert into that child rather than the > main and using a trigger. > > Enjoy, > > Stephen > Since the insert would be based on the value of rbscore, and there is only one cut-off value, I could skip using a trigger. At this point it seems like it was be fairly easy to try both the partial index on rbscore and partitioning to see whether there are performance differences. Does anyone have any experience with performances difference between using a partial index and partitions? (Or is it not so usual to have a two-partition table.) Also, does anyone have experience with the performance of partitions as the size of the table increases, say from 10 million rows to 10 billion rows? Janet -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] questions on (parallel) COPY and when to REINDEX
Hi. We are running a data processing/analysis pipeline that writes about 100K records to two tables on a daily basis. The pipeline runs from about 6:00 a.m. to 10:00 a.m. Our user base is small - about five people. Each accesses the database in a different way (generally using some script - either Perl or Python). Some people begin querying the database as soon as the new data/analysis results start being loaded. Others wait until the day's run is complete, so the number of concurrent users is small at this time. The data/analysis results are loaded into two tables from two files of 200 to 1,000 rows each using the COPY command, which is executed from a Perl script that uses DBD-Pg. Other details: Postgres 8.3.7 running on a Linux system with eight processors. Both of the big tables (now up to > 15 M rows each) have indexes on several of the columns. The indexes were created using CREATE INDEX CONCURRENTLY... Both tables have one or two foreign key constraints. My questions are: (1) At the point that the data are being loaded into the tables, are the new data indexed? (2) Should I REINDEX these two tables daily after the pipeline completes? Is this what other people do in practice? (3) Currently the pipeline executes in serial fashion. We'd like to cut the wall clock time down as much as possible. The data processing and data analysis can be done in parallel, but can the loading of the database be done in parallel, i.e., can I execute four parallel COPY commands from four copies of a script? Our initial attempt at doing this failed. I found one posting in the archives about parallel COPY, but it doesn't seem to be quite on point. (4) Does COPY lock the table? Do I need to explicitly LOCK the table before the COPY command? Does LOCK even apply to using COPY? If I used table locking, would parallel COPY work? (5) If I drop the indexes and foreign key constraints, then is it possible to COPY to a table from more than one script, i.e., do parallel COPY? It seems like a really bad idea to drop those foreign key constraints. Should re-think about where our database loading fits into the overall pipeline, i.e., do the data processing and analysis in parallel, but keep the data loading sequential? The reason for not doing all of the data loading at the end is that some of the users *really* want to start querying the data and analysis results as soon as they enter the database. Looking forward to your replies. Janet -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] questions on (parallel) COPY and when to REINDEX
Thanks for your reply. Responses below, and one follow-up question about when/how often to use analyze. Janet On 02/08/2009 05:09 a.m., Alban Hertroys wrote: On 1 Aug 2009, at 23:24, Janet Jacobsen wrote: My questions are: (2) Should I REINDEX these two tables daily after the pipeline completes? Is this what other people do in practice? No need, but as people are querying as soon as data is arriving, an analyse of the table you just copied to should help performance - the query plans will be derived from more accurate table statistics that way. The files that are being copied into the table(s) are between 200 to 1,000 each, but there are hundreds of these small files every day. Would you recommend running analyze after every COPY? (3) Currently the pipeline executes in serial fashion. We'd like to cut the wall clock time down as much as possible. The data processing and data analysis can be done in parallel, but can the loading of the database be done in parallel, i.e., can I execute four parallel COPY commands from four copies of a script? Our initial attempt at doing this failed. I found one posting in the archives about parallel COPY, but it doesn't seem to be quite on point. As long as you're not using the same database-session in parallel you should be fine. You can't do parallel database operations in the same session. Last time I did something similar I used separate database connections. Yes, they are separate database connections because the connection is made in the data loading script. (5) If I drop the indexes and foreign key constraints, then is it possible to COPY to a table from more than one script, i.e., do parallel COPY? It seems like a really bad idea to drop those foreign key constraints. You can COPY in parallel, but having no FK's does help insert-performance. In that case whether you should or shouldn't remove your FK's depends on what's more important to you; insert performance or data correctness. As some of your users query the data while it's still coming in I guess that data correctness is in fact more important to you and you should keep the FK's. Yes, I think we should keep the FK's. You wrote that your pipeline runs for a period of 4 hours and the table is about 15M rows now. What is taking up all that time? I understand why you'd want to parallelise that process, but where do you expect the gains? We're processing images, and the data processing and analysis takes up most of the time, but the images can be processed/analyzed in parallel. We've been doing all of the data loading at the end - one COPY at a time. Originally that made sense because the researchers wanted to check the images before loading the data/analysis results into the db. But now we want to load the data/analysis results as soon as they are available, so if there are four copies of the processing/analysis script running, we want to have the data loading script initiated at the end of each processing/analysis script, which is equivalent to four COPYs writing four different files into the same table. In terms of time, we will not see a huge decrease in the wall clock time, but the data/analysis results will be available to the users much sooner, and this matters a lot to the researchers. From the above I'm guessing that part of the problem you want to solve by parallelising is insert-performance. In cases like these I've seen people with problems with the look-up speed of foreign keys because the database chose a bad query plan. Regular analyses during inserting data should prevent that. Okay, so I can try this. Should re-think about where our database loading fits into the overall pipeline, i.e., do the data processing and analysis in parallel, but keep the data loading sequential? The reason for not doing all of the data loading at the end is that some of the users *really* want to start querying the data and analysis results as soon as they enter the database. You seem to have a design where the database grows indefinitely? Your database doesn't look to be very big yet, but I expect that at some point in the future your data will grow so much that the ordering of it on disk starts to matter for how fast records can be fetched. That's a typical scenario where people here start to advise using table partitioning. The thing is though that partitioning only works well if the queries your users perform contain clauses of which the database knows they divide the data (the same clauses used to partition the table). The people you work with apparently are knowledgeable enough that they create their own queries. They will have to take partition constraints into account too if you choose to use table partitioning. Yes, there are a couple of tables that it seems will grow indefinitely. We are not in production yet, so there is some questio
Re: [GENERAL] questions on (parallel) COPY and when to REINDEX
Hi, Andy. Thanks for the responses and information. Just to let you know... what we are storing in the db are the image attributes - about 40 of them - not the images. So the COPY is reading an ascii file of the image attributes. It turns out to be useful to have the image attributes handy - much better than reading the image headers. The images are available on spinning disk, and the image locations are in the db. Thanks, Janet On 02/08/2009 05:59 p.m., Andy Colson wrote: On 1 Aug 2009, at 23:24, Janet Jacobsen wrote: My questions are: (2) Should I REINDEX these two tables daily after the pipeline completes? Is this what other people do in practice? No need, but as people are querying as soon as data is arriving, an analyse of the table you just copied to should help performance - the query plans will be derived from more accurate table statistics that way. The files that are being copied into the table(s) are between 200 to 1,000 each, but there are hundreds of these small files every day. Would you recommend running analyze after every COPY? Just once, at the end. Assuming you are only deleting 100k records and re-adding/updating another 100k in a batch. That's not so many records it'll through the stats out of whack. (If you were dropping/updating 15M in a batch you might wanna analyze in the middle) You wrote that your pipeline runs for a period of 4 hours and the table is about 15M rows now. What is taking up all that time? I understand why you'd want to parallelise that process, but where do you expect the gains? We're processing images, and the data processing and analysis takes up most of the time, but the images can be processed/analyzed in parallel. We've been doing all of the data loading at the end - one COPY at a time. Originally that made sense because the researchers wanted to check the images before loading the data/analysis results into the db. Ah! Images! When you are doing the COPY are you escaping the data? You cant "just" copy a binary file. -Andy -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] plperl function
Hi. I'm trying to write a plperl function that returns a list of ids that I want to use in a subquery. The function call would look like: select * from mlist( 168.4, 55.2, 0.1); and would return a list of integers. I've written this function, and it returns the right list of integers, but when I use it as a subquery, the query hangs (if I use a return type of setof integer) or gives an error message (if I use a return type of integer[]). I want to use "select * from mlist( 168.4, 55.2, 0.1)" in something like select id from ctable where cmid in ( select * from mlist( 168.4, 55.2, 0.1 ) ); or select id from ctable where cmid = ANY ( select * from mlist( 168.4, 55.2, 0.1 ) ); cmid is an integer. - If I do explain select id from ctable where cmid in ( 102185, 102186,102187 ); (*where I've hard-coded the integers*), I get QUERY PLAN --- Bitmap Heap Scan on ctable (cost=2293.67..271604.69 rows=77653 width=8) Recheck Cond: (cmid = ANY ('{102185,102186,102187}'::integer[])) -> Bitmap Index Scan on ctable_cmid_index (cost=0.00..2274.26 rows=77653 width=0) Index Cond: (cmid = ANY ('{102185,102186,102187}'::integer[])) (4 rows) First I tried using the return type setof integer, but when I execute select id from ctable where cmid in ( select * from mlist( 168.4, 55.2, 0.1 ) ); the query just seems to hang (minutes go by) and eventually I hit Ctrl-c. The response time for select id from ctable where cmid in ( 102185, 102186,102187 ); (*where I've hard-coded the integers*),is very fast (< 1s). The explain above gave me the idea to try a return type of integer[], but then I get the error message, ERROR: operator does not exist: integer = integer[] HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. I also tried a return type of text and tried to cast it to integer[] like in the explain, but got a syntax error. - What return type should I be using? Is there anything wrong with using a plperl function to generate a list of integers to use in a subquery? I'd appreciate any suggestions, help with syntax, sample plperl functions, etc. Thanks, Janet -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] best way to check pgdump
Hi. What is the best way to check a pgdump without doing a restore? Thanks, Janet
[GENERAL] calling vacuum from a script
Hi. Is it possible to vacuum a table (vacuum full analyze) from a script. Currently I run the Postgres client and then run vacuum, but I'd like to automate the vacuum by calling it from a (Perl) script. Thanks, Janet