[GENERAL] "could not read block 0... " error followed by "database does not exist"

2010-02-12 Thread Janet Jacobsen

Hi.  I am running Postgres 8.2.7 on a Linux system for over
a year now with no problems.

Today one of the database users reported the following error:

   psql: FATAL:  could not read block 0 of relation 1664/0/1262: read
   only 0 of 8192 bytes

I tried stopping and restarting the Postgres server for the database.
From the logfile:

   LOG:  received smart shutdown request
   LOG:  autovacuum launcher shutting down
   LOG:  shutting down
   LOG:  database system is shut down
   LOG:  database system was shut down at 2010-02-12 17:15:37 PST
   LOG:  autovacuum launcher started
   LOG:  database system is ready to accept connections

But when I try to connect to the database, I get the error message:

   FATAL:  database "subptf" does not exist

I tried stopping/restarting the database several times.  I also
killed all user connections to the database.

How do I fix this problem?

Thanks,
Janet




--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] change location of postmaster.pid file?

2009-07-10 Thread Janet Jacobsen
Hi.  We are looking into the possibility of running a Postgres
server on an underutilized machine.  This machine has very
little local disk space, so we would have to create the data
directory on a shared file system.

The underutilized machine was set up so that it can *only
read* from the shared file system, i.e., cannot write to the
shared file system.

Is it possible to create a database cluster on a machine that
has write access to the shared file system, shut down the
Postgres server on that machine, and then start up the
Postgres server on the machine that cannot write to the
shared file system, and thereafter, *only query* the database.

Since Postgres writes the postmaster.pid file to the data
directory (which would be on the shared file system), the
answer would appear to be no, since the 'underutilized'
machine cannot write any files to the shared file system.

Would it be possible to write the postmaster.pid file to the
local file system on the 'underutilized' machine even though
the data directory is on the shared file system? 

I realize that this seems like a bad idea - given that the
purpose of the postmaster.pid file as I understand it is to
prevent more than one postmaster running in a data
directory - but I wanted to ask whether this is a possibility.

Even if it were possible to write the postmaster.pid to the
local file system on the 'underutilized' machine, does Postgres
write other temporary files even if only SELECT statements
are being executed against the database?  And where does it
write those files - in subdirectories of the data directory?

Thank you,
Janet

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] change location of postmaster.pid file?

2009-07-10 Thread Janet Jacobsen
Hi.  Thanks for the quick and definitive answers to my questions.
The information you provided will save me from wasting time and
energy trying to see how far I could get otherwise.  Thanks very much.

Janet


Tom Lane wrote:
> Janet Jacobsen  writes:
>   
>> Is it possible to create a database cluster on a machine that
>> has write access to the shared file system, shut down the
>> Postgres server on that machine, and then start up the
>> Postgres server on the machine that cannot write to the
>> shared file system, and thereafter, *only query* the database.
>> 
>
> No.  The pid file is only the first and smallest problem you'd run into
> with a read-only database filesystem.
>
>   regards, tom lane
>
>   
> On Fri, Jul 10, 2009 at 11:17 PM, Janet Jacobsen wrote:
>   
>> > Is it possible to create a database cluster on a machine that
>> > has write access to the shared file system, shut down the
>> > Postgres server on that machine, and then start up the
>> > Postgres server on the machine that cannot write to the
>> > shared file system, and thereafter, *only query* the database.
>> 
>
> Postgres isn't really designed to work this way. It expects to have
> write access and will occasionally still write stuff to disk even for
> read-only queries.
>
> It won't work even a little bit before 8.3. For 8.3 or later you could
> maybe make it work using vacuum freeze but there's no facility to
> verify that it's really frozen everything and you'll still be taken by
> surprise by queries which try to use temporary space for large sorts
> or commands which start transactions that you didn't realize were
> necessary.
>
> -- greg http://mit.edu/~gsstark/resume.pdf


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] general question on two-partition table

2009-07-27 Thread Janet Jacobsen
Hi.  We have a table with 30 M records that is growing by
about 100 K records per day.

The experimentalists, whose data are in the table, have
decided that they will focus on the records for which the
value of one field, rbscore, is greater than a cut-off.
However, they want to continue to store all of the data
- even the records for which rbscore is less than the cutoff
- in the database.

For the current table, there are about 400 K (out of 30 M)
records that meet the 'good' criteria.

Each record in the table has about 40 fields, and the
experimentalists do in fact write queries that use many of
those fields, some more than others.  (They are building a
model and have not pinned down exactly which fields are
more important than others, so that's why they store and
query by so many fields.)

If they are going to spend 95% of their time querying the
records that meet the 'good' criteria, what are the good
strategies for ensuring good performance for those queries?
(1) Should I partition the table into two partitions based on
the value of rbscore?
(2) Should I create two separate tables?

Are (1) and (2) more or less equivalent in terms of
performance?

I think that partitioning the table is a more flexible option
(i.e., what if the cutoff value changes, no need to change
the name of the table being queried, etc.), but would (2)
give better performance given that 95% of their queries
are for rbscores greater than a threshold value?

Can you suggest other strategies?

Thank you,
Janet




-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] general question on two-partition table

2009-07-28 Thread Janet Jacobsen
Hi.  Thanks for your reply. 

I looked at the documentation for partitions - it is the case, right, that I
have to create the master table and the two partition tables (depending
on the value of rbscore) and then copy the records from the existing
table into the two partitions? 

Stephen Frost wrote:
> * Janet Jacobsen (jsjacob...@lbl.gov) wrote:
>   
>> If they are going to spend 95% of their time querying the
>> records that meet the 'good' criteria, what are the good
>> strategies for ensuring good performance for those queries?
>> (1) Should I partition the table into two partitions based on
>> the value of rbscore?
>> (2) Should I create two separate tables?
>>
>> Are (1) and (2) more or less equivalent in terms of
>> performance?
>> 
>
> It's not clear to me what you plan here..  How would you handle (2) for
> the users?  Would you construct a view across them, or expect them to
> query the right table(s)?  Options, as I see them, and in the order of
> 'best-to-worst' wrt user friendlyness and performance, I believe, are:
>
> 1- Partitioning (with CHECK constraints and constraint_exclusion)
> 2- View across two tables (with appropriate WHERE clauses)
> 3- Functional index (as suggested by someone else)
> 4- separate tables (users have to figure out how to use them)
> 5- single table with everything
>
> My recommendation would be #1, followed by #2.  Be sure to look up how
> to do partitioning by using inheiritance in PG, and, if you need to,
> look at how to implement a trigger to handle inserts on the parent
> table.  Make sure you create your CHECK() constraints correctly, and
> that you have constraint_exclusion enabled, and that it *works*.
>   
Is there some reason why you emphasized "works"?  Would I use
EXPLAIN to verify that the constraints are being checked?
>> I think that partitioning the table is a more flexible option
>> (i.e., what if the cutoff value changes, no need to change
>> the name of the table being queried, etc.), but would (2)
>> give better performance given that 95% of their queries
>> are for rbscores greater than a threshold value?
>> 
>
> If you have your partitioning set up correctly, I don't believe having
> actual separate tables would be that much of a performance gain for
> queries.  It would help some with inserts, tho if you know which table
> to insert into, you could just insert into that child rather than the
> main and using a trigger.
>
>   Enjoy,
>
>   Stephen
>   
Since the insert would be based on the value of rbscore, and there is
only one cut-off value, I could skip using a trigger. 

At this point it seems like it was be fairly easy to try both the partial
index on rbscore and partitioning to see whether there are performance
differences.

Does anyone have any experience with performances difference
between using a partial index and partitions?  (Or is it not so usual to
have a two-partition table.)

Also, does anyone have experience with the performance of partitions
as the size of the table increases, say from 10 million rows to 10 billion
rows?

Janet





-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] questions on (parallel) COPY and when to REINDEX

2009-08-01 Thread Janet Jacobsen

Hi.  We are running a data processing/analysis pipeline that
writes about 100K records to two tables on a daily basis.
The pipeline runs from about 6:00 a.m. to 10:00 a.m.

Our user base is small - about five people.  Each accesses
the database in a different way (generally using some script
- either Perl or Python).

Some people begin querying the database as soon as the new
data/analysis results start being loaded.  Others wait until the
day's run is complete, so the number of concurrent users is
small at this time.

The data/analysis results are loaded into two tables from two
files of 200 to 1,000 rows each using the COPY command,
which is executed from a Perl script that uses DBD-Pg.

Other details: Postgres 8.3.7 running on a Linux system
with eight processors.

Both of the big tables (now up to > 15 M rows each) have
indexes on several of the columns.  The indexes were
created using CREATE INDEX CONCURRENTLY...
Both tables have one or two foreign key constraints.

My questions are:
(1) At the point that the data are being loaded into the tables,
are the new data indexed?
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?
(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies
of a script?  Our initial attempt at doing this failed.  I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.
(4) Does COPY lock the table?  Do I need to explicitly
LOCK the table before the COPY command?  Does LOCK
even apply to using COPY?  If I used table locking, would
parallel COPY work?
(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.

Should re-think about where our database loading fits into the
overall pipeline, i.e., do the data processing and analysis in
parallel, but keep the data loading sequential?  The reason for
not doing all of the data loading at the end is that some of the
users *really* want to start querying the data and analysis
results as soon as they enter the database.

Looking forward to your replies.

Janet



--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] questions on (parallel) COPY and when to REINDEX

2009-08-02 Thread Janet Jacobsen

Thanks for your reply.  Responses below, and one follow-up
question about when/how often to use analyze.

Janet


On 02/08/2009 05:09 a.m., Alban Hertroys wrote:

On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:

My questions are:
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?


No need, but as people are querying as soon as data is arriving, an 
analyse of the table you just copied to should help performance - the 
query plans will be derived from more accurate table statistics that way.


The files that are being copied into the table(s) are between 200 to 1,000
each, but there are hundreds of these small files every day.  Would you
recommend running analyze after every COPY? 



(3) Currently the pipeline executes in serial fashion.  We'd
like to cut the wall clock time down as much as possible.
The data processing and data analysis can be done in parallel,
but can the loading of the database be done in parallel, i.e.,
can I execute four parallel COPY commands from four copies
of a script?  Our initial attempt at doing this failed.  I found one
posting in the archives about parallel COPY, but it doesn't seem
to be quite on point.


As long as you're not using the same database-session in parallel you 
should be fine. You can't do parallel database operations in the same 
session. Last time I did something similar I used separate database 
connections.


Yes, they are separate database connections because the connection
is made in the data loading script.



(5) If I drop the indexes and foreign key constraints, then is it
possible to COPY to a table from more than one script, i.e., do
parallel COPY?  It seems like a really bad idea to drop those
foreign key constraints.


You can COPY in parallel, but having no FK's does help 
insert-performance. In that case whether you should or shouldn't 
remove your FK's depends on what's more important to you; insert 
performance or data correctness.


As some of your users query the data while it's still coming in I 
guess that data correctness is in fact more important to you and you 
should keep the FK's.


Yes, I think we should keep the FK's.


You wrote that your pipeline runs for a period of 4 hours and the 
table is about 15M rows now. What is taking up all that time? I 
understand why you'd want to parallelise that process, but where do 
you expect the gains?



We're processing images, and the data processing and analysis takes
up most of the time, but the images can be processed/analyzed in parallel.
We've been doing all of the data loading at the end - one COPY at a
time.  Originally that made sense because the researchers wanted to
check the images before loading the data/analysis results into the db.

But now we want to load the data/analysis results as soon as they are
available, so if there are four copies of the processing/analysis script
running, we want to have the data loading script initiated at the end of
each processing/analysis script, which is equivalent to four COPYs
writing four different files into the same table.

In terms of time, we will not see a huge decrease in the wall clock time,
but the data/analysis results will be available to the users much sooner,
and this matters a lot to the researchers.


From the above I'm guessing that part of the problem you want to solve 
by parallelising is insert-performance. In cases like these I've seen 
people with problems with the look-up speed of foreign keys because 
the database chose a bad query plan. Regular analyses during inserting 
data should prevent that.


Okay, so I can try this.



Should re-think about where our database loading fits into the
overall pipeline, i.e., do the data processing and analysis in
parallel, but keep the data loading sequential?  The reason for
not doing all of the data loading at the end is that some of the
users *really* want to start querying the data and analysis
results as soon as they enter the database.


You seem to have a design where the database grows indefinitely? Your 
database doesn't look to be very big yet, but I expect that at some 
point in the future your data will grow so much that the ordering of 
it on disk starts to matter for how fast records can be fetched. 
That's a typical scenario where people here start to advise using 
table partitioning.


The thing is though that partitioning only works well if the queries 
your users perform contain clauses of which the database knows they 
divide the data (the same clauses used to partition the table).


The people you work with apparently are knowledgeable enough that they 
create their own queries. They will have to take partition constraints 
into account too if you choose to use table partitioning.




Yes, there are a couple of tables that it seems will grow indefinitely.
We are not in production yet, so there is some questio

Re: [GENERAL] questions on (parallel) COPY and when to REINDEX

2009-08-02 Thread Janet Jacobsen

Hi, Andy.  Thanks for the responses and information.

Just to let you know... what we are storing in the db are the image
attributes - about 40 of them - not the images.  So the COPY
is reading an ascii file of the image attributes.  It turns out to be
useful to have the image attributes handy - much better than reading
the image headers. The images are available on spinning disk, and
the image locations are in the db.

Thanks,
Janet


On 02/08/2009 05:59 p.m., Andy Colson wrote:

On 1 Aug 2009, at 23:24, Janet Jacobsen wrote:

My questions are:
(2) Should I REINDEX these two tables daily after the pipeline
completes?  Is this what other people do in practice?


No need, but as people are querying as soon as data is arriving, an 
analyse of the table you just copied to should help performance - 
the query plans will be derived from more accurate table statistics 
that way.


The files that are being copied into the table(s) are between 200 to 
1,000

each, but there are hundreds of these small files every day.  Would you
recommend running analyze after every COPY?


Just once, at the end.  Assuming you are only deleting 100k records 
and re-adding/updating another 100k in a batch.  That's not so many 
records it'll through the stats out of whack.  (If you were 
dropping/updating 15M in a batch you might wanna analyze in the middle)




You wrote that your pipeline runs for a period of 4 hours and the 
table is about 15M rows now. What is taking up all that time? I 
understand why you'd want to parallelise that process, but where do 
you expect the gains?



We're processing images, and the data processing and analysis takes
up most of the time, but the images can be processed/analyzed in 
parallel.

We've been doing all of the data loading at the end - one COPY at a
time.  Originally that made sense because the researchers wanted to
check the images before loading the data/analysis results into the db.


Ah!  Images!  When you are doing the COPY are you escaping the data?  
You cant "just" copy a binary file.


-Andy


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] plperl function

2009-08-13 Thread Janet Jacobsen
Hi.  I'm trying to write a plperl function that returns a list of ids
that I want to use in a subquery. 

The function call would look like:

select * from mlist( 168.4, 55.2, 0.1);

and would return a list of integers.  I've written this function,
and it returns the right list of integers, but when I use it as a
subquery, the query hangs (if I use a return type of setof integer)
or gives an error message (if I use a return type of integer[]).

I want to use "select * from mlist( 168.4, 55.2, 0.1)" in something like

select id from ctable where cmid in ( select * from mlist( 168.4,
55.2, 0.1 ) );

or

select id from ctable where cmid = ANY ( select * from mlist( 168.4,
55.2, 0.1 ) );

cmid is an integer.

-

If I do

explain select id from ctable where cmid  in ( 102185, 102186,102187 );

(*where I've hard-coded the integers*), I get

QUERY
PLAN
---
 Bitmap Heap Scan on ctable  (cost=2293.67..271604.69 rows=77653 width=8)
   Recheck Cond: (cmid = ANY ('{102185,102186,102187}'::integer[]))
   ->  Bitmap Index Scan on ctable_cmid_index  (cost=0.00..2274.26
rows=77653 width=0)
 Index Cond: (cmid = ANY ('{102185,102186,102187}'::integer[]))
(4 rows)

First I tried using the return type setof integer, but when I execute

select id from ctable where cmid in ( select * from mlist( 168.4,
55.2, 0.1 ) );

the query just seems to hang (minutes go by) and eventually I hit Ctrl-c.

The response time for

select id from ctable where cmid  in ( 102185, 102186,102187 );

(*where I've hard-coded the integers*),is very fast (< 1s).

The explain above gave me the idea to try a return type of integer[], but
then I get the error message,

ERROR:  operator does not exist: integer = integer[]
HINT:  No operator matches the given name and argument type(s).
You might need to add explicit type casts.

I also tried a return type of text and tried to cast it to integer[]
like in the
explain, but got a syntax error.

-

What return type should I be using?  Is there anything wrong with using
a plperl function to generate a list of integers to use in a subquery?

I'd appreciate any suggestions, help with syntax, sample plperl
functions, etc.

Thanks,
Janet



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] best way to check pgdump

2010-06-22 Thread Janet Jacobsen
Hi.  What is the best way to check a pgdump
without doing a restore?

Thanks,
Janet


[GENERAL] calling vacuum from a script

2010-06-22 Thread Janet Jacobsen
Hi.  Is it possible to vacuum a table (vacuum full analyze)
from a script.  Currently I run the Postgres client and then
run vacuum, but I'd like to automate the vacuum by calling
it from a (Perl) script.

Thanks,
Janet