Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Matthew Nuzum
Carlo Stonebanks wrote:
> Lots of great info here, I will see what applies to my situation. However, I 
> don't see bulk inserts of the tables working, because all of the tables need 
> to be refreshed as values to deduplicate and match will change with every 
> row added. In order for this to work, i would have to write queries against 
> the hash tables. This is where something like MySQL's in-memory tables would 
> have come in handy...
> 
> What is GDB?
> 
> Carlo

Sorry, meant GDBM (disk based hash/lookup table).

With Postgres if your tables fit into RAM then they are in-memory as
long as they're actively being used.

Hashtables and GDBM, as far as I know, are only useful for key->value
lookups. However, for this they are *fast*. If you can figure out a way
to make them work I'll bet things speed up.
-- 
Matthew Nuzum
newz2000 on freenode

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] Performace Optimization for Dummies

2006-09-28 Thread Matthew Nuzum
Carlo Stonebanks wrote:
>> are you using the 'copy' interface?
> 
> Straightforward inserts - the import data has to transformed, normalised and 
> de-duped by the import program. I imagine the copy interface is for more 
> straightforward data importing. These are - buy necessity - single row 
> inserts.
> 

I know this is an answer to a question you didn't ask, but here it is. I
was once doing stuff where I processed log files and had to do many
lookups to normalize the data before insertion.

I started out doing everything in SQL and using postgresql tables and it
took a little over 24 hours to process 24 hours worth of data. Like you,
it was single process, many lookups.

I found a better way. I rewrote it (once in c#, again in python) and
used hashtables/dictionaries instead of tables for the lookup data. For
example, I'd start by loading the data into hash tables (yes, this took
a *lot* of ram) then for each row I did something like:
 1. is it in the hash table?
   1. If not, insert it into the db
   1. Insert it into the hashtable
 2. Get the lookup field out of the hash table
 3. Output normalized data

This allow me to create text files containing the data in COPY format
which can then be inserted into the database at dramatically increased
speeds.

My first version in C# (mono) cut the time down to 6 hours for 24 hours
worth of data. I tweaked the algorithms and rewrote it in Python and got
it down to 45 min. (Python can't take all the credit for the performance
boost, I used an improved technique that could have been done in C# as
well) This time included the time needed to do the copy and update the
indexes.

I created a version that also used gdb databases instead of hash tables.
It increased the time from 45 min to a little over an hour but decreased
the memory usage to something like 45MB (vs dozens or hundreds of MB per
hashtable)
-- 
Matthew Nuzum
newz2000 on freenode

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Sun Fire T2000 and PostgreSQL 8.1.3

2006-04-06 Thread Matthew Nuzum
On 4/6/06, Juan Casero (FL FLC) <[EMAIL PROTECTED]> wrote:
> Because I plan to develop a rather large (for us anyway) data warehouse
> with PostgreSQL.  I am looking for the right hardware that can handle
> queries on a database that might grow to over a 100 gigabytes.

You need to look for a server that has fast I/O. 100 GB of data will
take a long time to scan through and won't fit in RAM.

> Right
> now our decision support system based on postgresql 8.1.3 stores retail
> sales information for about 4 four years back *but* only as weekly
> summaries.  I want to build the system so it can handle daily sales
> transactions also.  You can imagine how many more records this will
> involve so I am looking for hardware that can give me the performance I
> need to make this project useable.

Sounds like you need to be doing a few heavy queries when you do this,
not tons of small queries. That likely means you need fewer CPUs that
are very fast.

> In other words parsing and loading
> the daily transaction logs for our stores is likely to take huge amounts
> of effort.  I need a machine that can complete the task in a reasonable
> amount of time.

See my previous comment

> As people start to query the database to find sales
> related reports and information I need to make sure the queries will run
> reasonably fast for them.

Get more than one CPU core and make sure you have a lot of drive
spindles. You will definately want to be able to ensure a long running
query doesn't hog your i/o system. I have a server with a single disk
and when we do a long query the server load will jump from about .2 to
10 until the long query finishes. More cpus won't help this because
the bottle neck is the disk.

>  I have already hand optimized all of my
> queries on the current system.  But currently I only have weekly sales
> summaries.  Other divisions in our company have done a similar project
> using MS SQL Server on SMP hardware far outclassing the database server
> I currently use and they report heavy loads on the server with less than
> ideal query run times.  I am sure I can do my part to optimize the
> queries once I start this project but there is only so much you can do.
> At some point you just need more powerful hardware.  This is where I am
> at right now.

You say "this is where I am at right __now__" but where will you be in
9 months? Sounds like you will be i/o bound by the time you get above
10GB.

> Apart from that since I will only get this one chance to
> buy a new server for data processing I need to make sure that I buy
> something that can grow over time as our needs change.  I don't want to
> buy a server only to find out later that it cannot meet our needs with
> future database projects.  I have to balance a limited budget, room for
> future performance growth, and current system requirements.  Trust me it
> isn't easy.

Isn't it about time we had our annual "what kind of server can I get
for $8k" thread?

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [PERFORM] 1 TB of memory

2006-03-16 Thread Matthew Nuzum
On 3/16/06, Jim Nasby <[EMAIL PROTECTED]> wrote:
> PostgreSQL tuned to the max and still too slow? Database too big to
> fit into memory? Here's the solution! http://www.superssd.com/
> products/tera-ramsan/
>
> Anyone purchasing one will be expected to post benchmarks! :)

Pricing is tight-lipped, but searching shows $1.85 /GB. That's close
to $500,000 for 250GB. One report says a person paid $219,000 for 32GB
and 1TB costs "well over $1,000,000."

But they "guarantee the performance."

Too rich for me.
--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] help needed asap....

2006-03-12 Thread Matthew Nuzum
On 12 Mar 2006 11:46:25 -, Phadnis <[EMAIL PROTECTED]> wrote:
>  Hi.
>
>  I am new to postgres and i need help from u.i hope i get positive response.. 
> though my questions mite seem silly to u...
>
>  iam working on postgres.. i have around 1 lakh records in almost 12 tables..
>  1 ) when i try to query for count or for any thg it takes a long time to 
> return the result. How to avoid this
>
>  2) also i want to know how to increase the  performance..( i do vacuum once 
> in a day)
>

These two questions are applicable to this list... your other
questions may get quicker responses on the users list.

However, you haven't provided enough information for anyone here to
help. Here's what you should do:

Find queries that you think should be faster than they are. For
example, if your query is "Select count(*) from foo" you can get
important performance information about the query by running:
EXPLAIN ANALYZE select count(*) from foo

Send the details of the query, including the output from the explain
analyze output (which looks pretty meaningless until you've learned
what to look for) to the list with a detailed question.

Also, for general performance hints, tell the list what your setup is,
what items you've tweaked (and maybe why).

Generally, be as generous with your details as you can. Also, have you
googled around for hints? Here's a good website with information:
http://www.varlena.com/varlena/GeneralBits/Tidbits/index.php
Notice there's a section on performance tips.

Also, this list works because volunteers who have knowledge and free
time choose to help  when they can. If you really need answers ASAP,
there are a few organizations who provide paid support.

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Process Time X200

2006-03-10 Thread Matthew Nuzum
On 3/10/06, NbForYou <[EMAIL PROTECTED]> wrote:
> Hey Michael, you sure know your stuff!
>
> Versions:
>
> PostgreSQL 7.3.9-RH running on the webhost.
> PostgreSQL 8.0.3 running on my homeserver.
>
> So the only solution is to ask my webhost to upgrade its postgresql?
> The question is will he do that? After all a license fee is required for
> commercial use. And running a webhosting service is a commercial use.
>
> thanks for replying and going through the effort of creating the database
> and populating it.
>
> Nick
>

You can look at the explain analyze output of the query from pg 7.3,
figure out why the plan is bad and tweak your query to get optimum
performance.

Yes, I agree with the other statements that say, "upgrade to 7.4 or
8.x if you can" but if you can't, then you can still work on it.

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Postgres on VPS - how much is enough?

2006-03-07 Thread Matthew Nuzum
On 3/7/06, Dave Page  wrote:
> On 7/3/06 18:45, "Jim C. Nasby" <[EMAIL PROTECTED]> wrote:
> > On Mon, Mar 06, 2006 at 01:14:45PM -0400, Marc G. Fournier wrote:
> >> We host VPSs here (http://www.hub.org) and don't use the 'single file,
> >> virtual file system' to put them into ... it must depend on where you
> >> host?
> >
> > Yeah, but aren't you also using FreeBSD jails? AFAIK linux doesn't have
> > an equivalent to jail; all their VPS stuff actually brings up a
> > full-blown copy of linux, kernel and all.
>
> No, linux vserver is equivalent to a jail - and they work superbly imho.
> developer.pgadmin.org is just one such VM that I run.
>
> http://www.linux-vserver.org/
>
> Regards, Dave.

I can confirm this. I've been using linux-vserver for years. It is a
very up-to-date and active project that is extremely responsive and
helpful to users of all experience levels.
--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Postgres on VPS - how much is enough?

2006-03-06 Thread Matthew Nuzum
On 3/6/06, Marc G. Fournier <[EMAIL PROTECTED]> wrote:
> On Mon, 6 Mar 2006, Matthew Nuzum wrote:
> > My problem with running PG inside of a VPS was that the VPS used a
> > virtual filesystem... basically, a single file that had been formatted
> > and loop mounted so that it looked like a regular hard drive.
> > Unfortunately, it was very slow. The difference between my application
> > and yours is that mine well more than filled the 1GB of RAM that I had
> > allocated. If your data will fit comfortably into RAM then you may be
> > fine.
>
> We host VPSs here (http://www.hub.org) and don't use the 'single file,
> virtual file system' to put them into ... it must depend on where you
> host?

That's true... I hope I didn't imply that I am anti-vps, I run my own
servers and one of them is dedicated to doing VPS for different
applications. I think they're wonderful.


On 3/6/06, Nagita Karunaratne <[EMAIL PROTECTED]> wrote:
> From personal experience, would you run Postgres on a linux machine
> (NOT a  vps) with 512MB of ram?
>
> Assumining I can keep all my data in memory.

Nagita,

It all depends on performance... I have one postgres database that
runs on a Pentium 350MHz with 128MB of RAM. It does 1 insert per
minute 24 hours per day. Because the load is so low, I can get away
with minimal hardware.

If your application has a lot of inserts/updates then disk speed is
important and can vary greatly from one VPS to another.

If your application is not time-critical than this may be a moot point anyway.

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Postgres on VPS - how much is enough?

2006-03-06 Thread Matthew Nuzum
On 3/6/06, Nagita Karunaratne <[EMAIL PROTECTED]> wrote:
> How big a VPS would I need to run a Postgres DB.
>

> One application will add about 500 orders per day
> Another will  access this data to create and send about 500 emails per day
> A third will access this data to create an after-sales survey for at
> most 500 times per day.
>
> What type of VPS would I need to run a database with this type pf load?
> Is 128 MB ram enough?
> What percentage of a 2.8 GHz CPU would be required?

My problem with running PG inside of a VPS was that the VPS used a
virtual filesystem... basically, a single file that had been formatted
and loop mounted so that it looked like a regular hard drive.
Unfortunately, it was very slow. The difference between my application
and yours is that mine well more than filled the 1GB of RAM that I had
allocated. If your data will fit comfortably into RAM then you may be
fine.

If you really want to know how it will work, try running it yourself.
Two projects that make this really easy and free is the colinux
project[1] which allows you to run a linux VPS in Windows and the
linux-vserver project[2] which is free software that works on pretty
much any linux OS.

Try it out, tinker with the values and that way you won't have to
guess when making your purchase decission.

[1] http://www.colinux.org/ Coperative Linux
[2] http://linux-vserver.org/ Linux-vserver project

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Large Database Design Help

2006-02-09 Thread Matthew Nuzum
On 2/9/06, Orion Henry <[EMAIL PROTECTED]> wrote:
>
> Hello All,
>
> I've inherited a postgresql database that I would like to refactor.   It
> was origionally designed for Postgres 7.0 on a PIII 500Mhz and some
> design decisions were made that don't make sense any more.  Here's the
> problem:
>
> 1) The database is very large, the largest table has 40 million tuples.
>
> 2) The database needs to import 10's of thousands of tuples each night
> quickly.  The current method is VERY slow.
>
> 3) I can't import new records with a COPY or drop my indexes b/c some of
> them are new records (INSERTS) and some are altered records (UPDATES)
> and the only way I can think of to identify these records is to perform
> a select for each record.

 [snip]
>
> 3) The current code that bulk loads data into the database is a loop
> that looks like this:
>
> $result = exe("INSERT INTO $table ($name_str) SELECT
> $val_str WHERE NOT EXISTS (SELECT 1 FROM $table WHERE $keys)");
> if ($result == 0)
> {
> $result = exe("UPDATE $table SET $non_keys WHERE
> $keys");
> }
>
> Is there a faster way to bulk load data when it's not known ahead of
> time if it's a new record or an updated record?

I experimented with something like this and I was able to successively
decrease the amount of time needed with an import. The final solution
that took my import down from aproximately 24 hours to about 30 min
was to use a C#/Java hashtable or a python dictionary. For example,
the unique data in one particular table was "User_Agent" so I made it
the key in my hashtable. I actually added a method to the hashtable so
that when I added a new record to the hashtable it would do the insert
into the db.

The downside to this is that it used *GOBS* of RAM.

Using Python, I was able to dramatically decrease the ram usage by
switching to a GDB based dictionary instead of the standard
dictionary. It only increased the time by about 50% so the total
processing time was about 45 min vs the previous 30 min.

I only had about 35 million records and my technique was getting to
the point where it was unweldy, so with your 40 million and counting
records you would probably want to start with the GDB technique unless
you have a ton of available ram.

You might interpret this as being a knock against PostgreSQL since I
pulled the data out of the db, but it's not; You'd be hard pressed to
find anything as fast as the in-memory hashtable or the on disk GDB;
however it's usefullness is very limited and for anything more complex
than just key=>value lookups moving to PostgreSQL is likely a big win.

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-16 Thread Matthew Nuzum
On 11/16/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote:
> If you have a cool SAN, it alerts you and removes all data off a disk
> _before_ it starts giving hard failures :-)
>
> /* Steinar */
> --
> Homepage: http://www.sesse.net/

Good point. I have avoided data loss *twice* this year by using SMART
hard drive monitoring software.

I can't tell you how good it feels to replace a drive that is about to
die, as compared to restoring data because a drive died.
--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Hardware/OS recommendations for large databases (

2005-11-16 Thread Matthew Nuzum
On 11/16/05, David Boreham <[EMAIL PROTECTED]> wrote:
>  >Spend a fortune on dual core CPUs and then buy crappy disks...  I bet
>  >for most applications this system will be IO bound, and you will see a
>  >nice lot of drive failures in the first year of operation with
>  >consumer grade drives.
>
> I guess I've never bought into the vendor story that there are
> two reliability grades. Why would they bother making two
> different kinds of bearing, motor etc ? Seems like it's more
> likely an excuse to justify higher prices. In my experience the
> expensive SCSI drives I own break frequently while the cheapo
> desktop drives just keep chunking along (modulo certain products
> that have a specific known reliability problem).
>
> I'd expect that a larger number of hotter drives will give a less reliable
> system than a smaller number of cooler ones.

Of all the SCSI and IDE drives I've used, and I've used a lot, there
is a definite difference in quality. The SCSI drives primarily use
higher quality components that are intended to last longer under 24/7
work loads. I've taken several SCSI and IDE drives apart and you can
tell from the guts that the SCSI drives are built with sturdier
components.

I haven't gotten my hands on the Raptor line of ATA drives yet, but
I've heard they share this in common with the SCSI drives - they are
built with components made to be used day and night for years straight
without ending.

That doesn't mean they will last longer than IDE drives, that just
means they've been designed to withstand higher amounts of heat and
sustained activity. I've got some IDE drives that have lasted years++
and I've got some IDE drives that have lasted months. However, my SCSI
drives I've had over the years all lasted longer than the server they
were installed in.

I will say that in the last 10 years, the MTBF of IDE/ATA drives has
improved dramatically, so I regularly use them in servers, however I
have also shifted my ideology so that a server should be replaced
after 3 years, where before I aimed for 5.

It seems to me that the least reliable components in servers these
days are the fans.

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Help tuning postgres

2005-10-13 Thread Matthew Nuzum
On 10/13/05, Csaba Nagy <[EMAIL PROTECTED]> wrote:
> On Thu, 2005-10-13 at 14:40, Andrew Sullivan wrote:
> > On Thu, Oct 13, 2005 at 10:15:03AM +0200, Csaba Nagy wrote:
> > > And how would the analyze help in finding this out ? I thought it would
> > > only show me additionally the actual timings, not more detail in what
> > > was done...
> >
> > Yes, it shows the actual timings, and the actual number of rows.  But
> > if the estimates that the planner makes are wildly different than the
> > actual results, then you know your statistics are wrong, and that the
> > planner is going about things the wrong way.  ANALYSE is a big help.
> > There's also a verbose option to it, but it's usually less useful in
> > production situations.

This is the point I was trying to make. I've seen special instances
where people have posted an explain annalyze for a select/update to
the list and suggestions have arisen allowing major performance
improvements.

If this task is where your database is performing its worst then it is
the best place to start with optimizing, short of the obvious stuff,
which it sounds like you've covered.

Sometimes, and I think this has often been true for databases that are
either very large or very small, statistics can be tweaked to get
better performance. One good example is when a sequential scan is
being chosen when an index scan may be better; something like this
would definately peg your disk i/o.

Throwing more hardware at your problem will definately help, but I'm a
performance freak and I like to optimize everything to the max.
*Sometimes* you can get drastic improvements without adding any
hardware. I have seen some truly miraculus turn-arounds by tweaking
some non-obvious settings based on suggestions made on this list.
--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] Help tuning postgres

2005-10-12 Thread Matthew Nuzum
On 10/12/05, Csaba Nagy <[EMAIL PROTECTED]> wrote:

> We have adapted our application (originally written for oracle) to
> postgres, and switched part of our business to a postgres data base.

> The data base has in the main tables around 150 million rows, the whole
> data set takes ~ 30G after the initial migration. After ~ a month of
> usage that bloated to ~ 100G. We installed autovacuum after ~ 2 weeks.
>
> The main table is heavily updated during the active periods of usage,
> which is coming in bursts.
>
> Now Oracle on the same hardware has no problems handling it (the load),
> but postgres comes to a crawl. Examining the pg_stats_activity table I
> see the updates on the main table as being the biggest problem, they are
> very slow. The table has a few indexes on it, I wonder if they are
> updated too on an update ? The index fields are not changing. In any
> case, I can't explain why the updates are so much slower on postgres.

I'm not the most experience person on this list, but I've got some big
tables I work with. Doing an update on these big tables often involves
a sequential scan which can be quite slow.

I would suggest posting the explain analyze output for one of your
slow updates. I'll bet it is much more revealing and takes out a lot
of the guesswork.

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


[PERFORM] Logarithmic change (decrease) in performance

2005-09-28 Thread Matthew Nuzum
Something interesting is going on. I wish I could show you the graphs,
but I'm sure this will not be a surprise to the seasoned veterans.

A particular application server I have has been running for over a
year now. I've been logging cpu load since mid-april.

It took 8 months or more to fall from excellent performance to
"acceptable." Then, over the course of about 5 weeks it fell from
"acceptable" to "so-so." Then, in the last four weeks it's gone from
"so-so" to alarming.

I've been working on this performance drop since Friday but it wasn't
until I replied to Arnau's post earlier today that I remembered I'd
been logging the server load. I grabbed the data and charted it in
Excel and to my surprise, the graph of the server's load average looks
kind of like the graph of y=x^2.

I've got to make a recomendation for a solution to the PHB and my
analysis is showing that as the dataset becomes larger, the amount of
time the disk spends seeking is increasing. This causes processes to
take longer to finish, which causes more processes to pile up, which
cuases processes to take longer to finish, which causes more processes
to pile up etc. It is this growing dataset that seems to be the source
of the sharp decrease in performance.

I knew this day would come, but I'm actually quite surprised that when
it came, there was little time between the warning and the grande
finale. I guess this message is being sent to the list to serve as a
warning to other data warehouse admins that when you reach your
capacity, the downward spiral happens rather quickly.

Crud... Outlook just froze while composing the PHB memo. I've been
working on that for an hour. What a bad day.
--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
   choose an index scan if your joining column's datatypes do not
   match


Re: [PERFORM] Monitoring Postgresql performance

2005-09-28 Thread Matthew Nuzum
On 9/28/05, Arnau <[EMAIL PROTECTED]> wrote:
> Hi all,
>
>I have been "googling" a bit searching info about a way to monitor
> postgresql (CPU & Memory, num processes, ... ) and I haven't found
> anything relevant. I'm using munin to monitor others parameters of my
> servers and I'd like to include postgresql or have a similar tool. Any
> of you is using anything like that? all kind of hints are welcome :-)
>
> Cheers!
> --
> Arnau

I have a cronjob that runs every 5 minutes and checks the number of
processes. When things get unruly I get a text message sent to my cell
phone. It also creates a detailed log entry. I'll paste in an example
of one of my scripts that does this below. This is on a dual purpose
server and monitors both cpu load average and postgres. You can have
the text message sent to multiple email addresses, just put a space
separated list of e-mail addresses between quotes in the CONTACTS=
line. It's simple, but it works and its always nice to know when
there's a problem *before the boss discovers it* ;-)

# Create some messages
HOSTNAME=`hostname`
WARNING_DB="Database connections on $HOSTNAME is rather high"
WARNING_CPU="CPU load on $HOSTNAME is rather high"
CONTACTS="[EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED]"
WARN=0

#calculate the db load
DB_LOAD=`ps -ax | grep postgres | wc -l`
if (($DB_LOAD > 150))
then
  WARN=1
  echo "$WARNING_DB ($DB_LOAD) " | mail -s "db_load is high
($DB_LOAD)" $CONTACTS
fi

#calculate the processor load
CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 2 | cut
--delimiter="." -f 1`
if (($CPU_LOAD > 8))
then
  WARN=1
  echo "$WARNING_CPU ($CPU_LOAD) " | mail -s "CPU_load is high
($CPU_LOAD)" $CONTACTS
fi

if (($WARN > 0))
then
  echo -=-=-=-=-=-=-=-=- W A R N I N G -=-=-=-=-=-=-=-=- >> /tmp/warn.txt
  NOW=`date`
  echo -=-=-=-=-=-$NOW-=-=-=-=-=- >> /tmp/warn.txt
  echo CPU LOAD: $CPU_LOAD DB LOAD: $DB_LOAD >> /tmp/warn.txt
  echo >> /tmp/warn.txt
  top -bn 1 >> /tmp/warn.txt
  echo >> /tmp/warn.txt
fi

NOW=`date`
CPU_LOAD=`cat /proc/loadavg | cut --delimiter=" " -f 1,2,3
--output-delimiter=\|`
echo -e $NOW\|$CPU_LOAD\|$DB_LOAD >> ~/LOAD_MONITOR.LOG

--
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Postgresql Hardware - Recommendations

2005-09-06 Thread Matthew Nuzum
On 9/5/05, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> ... The only info I can
> give so far is that the database size is about 60GB, and that it will be
> frequently accessed by multiple users (about 100 will be connected
> during business hours). The applications accessing the database are
> mostly reporting tools.

Optimizing hardware for mostly selects is different than optimizing
for lots of inserts. You will get good responses from this list if you
can give a little more details. Here are some questions:
How do you get your data into the db? Do you do bullk loads at
periodic intervals during the day? Do you do frequent updates/inserts?

You say reporting, do you use many stored procedures and calculations
on the server side? I've used some reporting apps that simply grab
tons of data from the server and then process it on the client side
(ODBC apps seem to do this), while other applications formulate the
queries and use stored procedures in order to transfer little data.

Of your 60GB, how much of that is active? Does your budget allow you
to buy enough RAM to get your active data into the disk cache? For
reporting, this *might* be your biggest win.

Here are some scenarios:
S1: Bulk uploads once or twice daily of about 250 MB of data. Few
inserts and updates during the day (1-2%). Reporting is largely done
on data from the last 5 business days. In this case you have < 2GB of
active data and your disk cache will hold all of your active data in
RAM (provided your db structure is diskcache friendly). An example of
this I have experienced is a sales application that queries current
inventory. Telephone agents queried, quieried, queried the
instock-inventory.

S2: Same as above but reporting is largely done on data covering 200+
business days. Its doubtful that you will get 50GB of RAM in your
server, you need to focus on disk speed. An example of this I have
experienced was an application that looked at sales trends and
performed commission calculations and projected sales forecasts.

S3: Lots of inserts/updates throughout the day (15 - 25%) - you need
to focus on disk speed. The content management system my employer
develops fits this model.

> 3) CPUs vs Memory
> Would you rather have 4x CPUs and 8GB of memory, or 2x CPUs with 16GB of
> memory?

Very hard to say without knowing your application. I have limited
experience but what I've found is that applications that support
multiple db architectures do not fully utilize the database server and
CPU utilization is low. Disk and network i/o is high. I don't know if
your application supports multiple backeneds, but chances are good
your biggest wins will come from RAM, disk and network investments.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 4: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Matthew Nuzum
On 8/30/05, Ron <[EMAIL PROTECTED]> wrote:
> >If you still have the budget, I would suggest considering either
> >what Ron suggested or possibly using a 4 drive RAID 10 instead.
> 
> IME, with only 4 HDs, it's usually better to split them them into two
> RAID 1's (one for the db, one for everything else including the logs)
> than it is to put everything on one RAID 10.  YMMV.

This coresponds to what I have observed as well. Of course, we all
know that work loads varry.

Just a note for the OP who has only two drives, there are tools for a
variety of OSs that monitor the S.M.A.R.T. features of the drive and
give an early warning in case it senses impending failure. I've caught
two drives before failure with these types of tools.

Also note that when reading discussions of this nature you must take
into consideration the value of your data. For some people, restoring
from a nightly backup is inconvienent, but not life-or-death. Some
people even do twice-daily backups so that in case of a failure they
can recover with little loss of data. This might be a good way to
mitigate the cost of expensive server hardware. If you cannot afford
to lose any data then you need to consider it imperitive to use some
type of RAID setup (not RAID 0) and to achieve great performance
you'll want more than 2 drives.
-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] RAID Configuration Sugestion

2005-08-30 Thread Matthew Nuzum
> > >On Tue, Aug 30, 2005 at 09:37:17 -0300,
> > > Alvaro Nunes Melo <[EMAIL PROTECTED]> wrote:
> > >>The server is a Dual Xeon 3.0 with 2 GB RAM and two SCSI disks. Our main
> > >>doubt is what is the best configuration for the disks. We are thinking
> > >>about use them in a RAID-0 array. Is this the best option? What do you
> > >>suggest on partitioning? Separate partitions for the OS, data and pg_xlog?
> > >
> > Our main goal is performance speedup. Disk space might not be a problem.
> > I've read a lot here about movig pg_xlog to different partitions, and
> > we'll surely make tests to see what configuration might be better.
> 

I've set up several servers with a config like this. Its not ideal,
but there's no reason you can't enjoy the benefits of a snappy
application.

The best results I've had involve dedicating one drive to OS, swap,
logs, tmp and everything and dedicate one drive to postgres. If you
use *nix you can mount the second drive as /var/lib/pgsql (or where
ever postgres lives on your server) with noatime as a mount option.

In retrospect, you might have saved the money on the second CPU and
gotten two more hard drives, but if you're running a dual task server
(i.e. LAMP) you may appreciate the second CPU.

The beauty of a server like this is that it puts more of the wizardry
of creating a fast application into the hands of the app developer,
which results in a better db schema, optimized queries and generally
*thinking* about the performance of the code. I personally feel that
to be a very rewarding aspect of my job. (As a hobby I program
microntrollers that run at 4MHz and have only 256 bytes of RAM, so
that could just be me.;-)

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Need for speed

2005-08-17 Thread Matthew Nuzum
On 8/17/05, Ron <[EMAIL PROTECTED]> wrote:
> At 05:15 AM 8/17/2005, Ulrich Wisser wrote:
> >Hello,
> >
> >thanks for all your suggestions.
> >
> >I can see that the Linux system is 90% waiting for disc io.
...
> 1= your primary usage is OLTP-like, but you are also expecting to do
> reports against the same schema that is supporting your OLTP-like
> usage.  Bad Idea.  Schemas that are optimized for reporting and other
> data mining like operation are pessimal for OLTP-like applications
> and vice versa.  You need two schemas: one optimized for lots of
> inserts and deletes (OLTP-like), and one optimized for reporting
> (data-mining like).

Ulrich,

If you meant that your disc/scsi system is already the fastest
available *with your current budget* then following Ron's advise I
quoted above will be a good step.

I have some systems very similar to yours. What I do is import in
batches and then immediately pre-process the batch data into tables
optimized for quick queries. For example, if your reports frequenly
need to find the total number of views per hour for each customer,
create a table whose data contains just the totals for each customer
for each hour of the day. This will make it a tiny fraction of the
size, allowing it to fit largely in RAM for the query and making the
indexes more efficient.

This is a tricky job, but if you do it right, your company will be a
big success and buy you more hardware to work with. Of course, they'll
also ask you to create dozens of new reports, but that's par for the
course.

Even if you have the budget for more hardware, I feel that creating an
effective db structure is a much more elegant solution than to throw
more hardware. (I admit, sometimes its cheaper to throw more hardware)

If you have particular queries that are too slow, posting the explain
analyze for each on the list should garner some help.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] faster INSERT with possible pre-existing row?

2005-07-26 Thread Matthew Nuzum
On 7/26/05, Dan Harris <[EMAIL PROTECTED]> wrote:
> I am working on a process that will be inserting tens of million rows
> and need this to be as quick as possible.
> 
> The catch is that for each row I could potentially insert, I need to
> look and see if the relationship is already there  to prevent
> multiple entries.  Currently I am doing a SELECT before doing the
> INSERT, but I recognize the speed penalty in doing to operations.  I
> wonder if there is some way I can say "insert this record, only if it
> doesn't exist already".  To see if it exists, I would need to compare
> 3 fields instead of just enforcing a primary key.

I struggled with this for a while. At first I tried stored procedures
and triggers, but it took very long (over 24 hours for my dataset).
After several iterations of rewritting it, first into C# then into
Python I got the whole process down to under 30 min.

My scenario is this:
I want to normalize log data. For example, for the IP address in a log
entry, I need to look up the unique id of the IP address, or if the IP
address is new, insert it and then return the newly created entry.
Multiple processes use the data, but only one process, run daily,
actually changes it. Because this one process knows that the data is
static, it selects the tables into in-memory hash tables (C#) or
Dictionaries (Python) and then does the lookups there. It is *super*
fast, but it uses a *lot* of ram. ;-)

To limit the ram, I wrote a version of the python code that uses gdbm
files instead of Dictionaries. This requires a newer version of Python
(to allow a gdbm db to work just like a dictionary) but makes life
easier in case someone is using my software on a lower end machine.
This doubled the time of the lookups from about 15 minutes to 30,
bringing the whole process to about 45 minutes.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [PERFORM] Projecting currentdb to more users

2005-07-12 Thread Matthew Nuzum
On 7/12/05, Yves Vindevogel <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> We have a couple of database that are identical (one for each customer).
> They are all relatively small, ranging from 100k records to 1m records.
> There's only one main table with some smaller tables, a lot of indexes
> and some functions.
> 
> I would like to make an estimation of the performance, the diskspace
> and other related things,
> when we have database of for instance 10 million records or 100 million
> records.
> 
> Is there any math to be done on that ?

Its pretty easy to make a database run fast with only a few thousand
records, or even a million records, however things start to slow down
non-linearly when the database grows too big to fit in RAM.

I'm not a guru, but my attempts to do this have not been very accurate.

Maybe (just maybe) you could get an idea by disabling the OS cache on
the file system(s) holding the database and then somehow fragmenting
the drive severly (maybe by putting each table in it's own disk
partition?!?) and measuring performance.

On the positive side, there are a lot of wise people on this list who
have +++ experience optimzing slow queries on big databases. So
queries now that run in 20 ms but slow down to 7 seconds when your
tables grow will likely benefit from optimizing.
-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: Don't 'kill -9' the postmaster


Re: [PERFORM] optimized counting of web statistics

2005-06-28 Thread Matthew Nuzum
On 6/29/05, Rudi Starcevic <[EMAIL PROTECTED]> wrote:
> Hi,
> 
> >I do my batch processing daily using a python script I've written. I
> >found that trying to do it with pl/pgsql took more than 24 hours to
> >process 24 hours worth of logs. I then used C# and in memory hash
> >tables to drop the time to 2 hours, but I couldn't get mono installed
> >on some of my older servers. Python proved the fastest and I can
> >process 24 hours worth of logs in about 15 minutes. Common reports run
> >in < 1 sec and custom reports run in < 15 seconds (usually).
> >
> >
> 
> When you say you do your batch processing in a Python script do you mean
> a you are using 'plpython' inside
> PostgreSQL or using Python to execut select statements and crunch the
> data 'outside' PostgreSQL?
> 
> Your reply is very interesting.

Sorry for not making that clear... I don't use plpython, I'm using an
external python program that makes database connections, creates
dictionaries and does the normalization/batch processing in memory. It
then saves the changes to a textfile which is copied using psql.

I've tried many things and while this is RAM intensive, it is by far
the fastest aproach I've found. I've also modified the python program
to optionally use disk based dictionaries based on (I think) gdb. This
signfincantly increases the time to closer to 25 min. ;-) but drops
the memory usage by an order of magnitude.

To be fair to C# and .Net, I think that python and C# can do it
equally fast, but between the time of creating the C# version and the
python version I learned some new optimization techniques. I feel that
both are powerful languages. (To be fair to python, I can write the
dictionary lookup code in 25% (aprox) fewer lines than similar hash
table code in C#. I could go on but I think I'm starting to get off
topic.)
-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] optimized counting of web statistics

2005-06-28 Thread Matthew Nuzum
On 6/28/05, Billy extyeightysix <[EMAIL PROTECTED]> wrote:
> Hola folks,
> 
> I have a web statistics Pg database (user agent, urls, referrer, etc)
> that is part of an online web survey system. All of the data derived
> from analyzing web server logs is stored in one large table with each
> record representing an analyzed webserver log entry.
> 
> Currently all reports are generated when the logs are being analyzed
> and before the data ever goes into the large table I mention above.
> Well, the time has come to build an interface that will allow a user
> to make ad-hoc queries against the stats and that is why I am emailing
> the performance list.

Load your data into a big table, then pre-process into additional
tables that have data better organized for running your reports.

For example, you may want a table that shows a sum of all hits for
each site, for each hour of the day. You may want an additional table
that shows the sum of all page views, or maybe sessions for each site
for each hour of the day.

So, if you manage a single site, each day you will add 24 new records
to the sum table.

You may want the following fields:
site (string)
atime (timestamptz)
hour_of_day (int)
day_of_week (int)
total_hits (int8)

A record may look like this:
site | atime | hour_of_day | day_of_week | total_hits
'www.yoursite.com'  '2005-06-28 16:00:00 -0400'  18  2  350

Index all of the fields except total_hits (unless you want a report
that shows all hours where hits were greater than x or less than x).

Doing:
select sum(total_hits) as total_hits from summary_table where atime
between now() and (now() - '7 days'::interval);
should be pretty fast.

You can also normalize your data such as referrers, user agents, etc
and create similar tables to the above.

In case you haven't guessed, I've already done this very thing.

I do my batch processing daily using a python script I've written. I
found that trying to do it with pl/pgsql took more than 24 hours to
process 24 hours worth of logs. I then used C# and in memory hash
tables to drop the time to 2 hours, but I couldn't get mono installed
on some of my older servers. Python proved the fastest and I can
process 24 hours worth of logs in about 15 minutes. Common reports run
in < 1 sec and custom reports run in < 15 seconds (usually).
-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Do Views execute underlying query everytime ??

2005-06-21 Thread Matthew Nuzum
On 6/21/05, PFC <[EMAIL PROTECTED]> wrote:
...
> In your case I don't think that is the solution, because you do big
> updates. With triggers this would mean issuing one update of your
> materialized view per row in your big update. This could be slow.
> 
> In this case you might want to update the cache table in one request
> rather than doing an awful lot of updates.
> 
> So you have two solutions :
> 
> 1- Junk it all and rebuild it from scratch (this can be faster than it
> seems)
> 2- Put the rows to be added in a temporary table, update the cache table
> considering the difference between this temporary table and your big
> table, then insert the rows in the big table.
> 
> This is the fastest solution but it requires a bit more coding (not THAT
> much though).
> 
Amit,

I understand your desire to not need any manual intervention...

I don't know what OS you use, but here are two practical techniques
you can use to achieve the above solution suggested by PFC:

a: If you are on a Unix like OS such as Linux of Free BSD you have the
beautiful cron program that will run commands nightly.

b: If you are on Windows you have to do something else. The simplest
solution I've found is called "pycron" (easily locatable by google)
and is a service that emulates Unix cron on windows (bypassing a lot
of the windows scheduler hassle).

Now, using either of those solutions, let's say at 6:00 am you want to
do your batch query.

1. Put the queries you want into a text file EXACTLY as you would type
them using psql and save the text file. For example, the file may be
named "create_mat_view.txt".
2. Test them by doing this from a command prompt: psql dbname <
create_mat_view.txt
3. Create a cron entry to run the command once a day, it might look like this:
0 6 * * * /usr/bin/psql dbname < /home/admin/create_mat_view.txt
or maybe like this:
0 6 * * * "C:\Program Files\PostgreSQL\8.0\psql.exe" dbname <
"C:\create_mat_view.txt"

I hope this helps,
-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 3: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [PERFORM] Do Views execute underlying query everytime ??

2005-06-21 Thread Matthew Nuzum
On 6/21/05, Amit V Shah <[EMAIL PROTECTED]> wrote:
> Hi all,
...
> I am thinking of having a solution where I create views for each screen,
> which are just read only.
> 
> However, I donot know if the query that creates the view is executed
> everytime I select something from the view. Because if that is the case,
> then I think my queries will again be slow. But if that is the way views
> work, then what would be the point in creating them ..
> 
> Any suggestions, helps --

They do get executed every time. I have a similar issue, but my data
does not change very frequently, so instead of using a view, I create
lookup tables to hold the data. So once a day I do something like
this:
drop lookup_table_1;
create table lookup_table_1 as SELECT ...;

In my case, rows are not deleted or updated, so I don't actually do a
"drop table..." I merely add new records to the existing table, but if
your data changes, the drop table technique can be faster than doing a
delete or update.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Help specifying new web server/database machine

2005-06-09 Thread Matthew Nuzum
On 6/9/05, Rory Campbell-Lange <[EMAIL PROTECTED]> wrote:
> Disks:
> 
> I'm somewhat confused here. I've followed the various notes about SATA
> vs SCSI and it seems that SCSI is the way to go. On a four-slot 1U
> server, would one do a single RAID10 over 4 disks 1rpm U320 disks?
> I would run the database in its own partition, separate from the rest of
> the OS, possible on LVM. An LSI-Megaraid-2 appears to be the card of
> choice.
> 

Can you tell us about your application? How much data will you have,
what is your ratio of reads to writes, how tollerant to data loss are
you? (for example, some people load their data in batches and if they
loose their data its no big deal, others would have heart failure if a
few transactions were lost)

If your application is 95% writes then people will suggest drastically
different hardware than if your application is 95% selects.

Here is an example of one of my servers:
application is 95+% selects, has 15GB of data (counting indexes), low
tollerance for data loss, runs on a 1 GHz P3 Compaq server with
mirrored 35 GB IDE disks and 1.6GB of RAM. Application response time
is aproximately .1 second to serve a request on a moderately loaded
server.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 8: explain analyze is your friend


Re: [PERFORM] Help specifying new web server/database machine

2005-06-08 Thread Matthew Nuzum
On 6/8/05, Rory Campbell-Lange <[EMAIL PROTECTED]> wrote:
> I'm tasked with specifying a new machine to run a web application
> prototype. The machine will be serving web pages with a Postgresql
> backend; we will be making extensive use of plpgsql functions. No
> database tables are likely to go over a million rows during the
> prototype period.
...
> 2GB (2x 1024MB) DDR-400 (PC3200) ECC Registered SDRAM (single rank)
> 4 Port AMCC/3Ware 9500-4LP PCI SATA RAID Controller
> 80GB SATA-150 7200RPM Hard Disk / 8MB Cache
> 80GB SATA-150 7200RPM Hard Disk / 8MB Cache
> 250GB SATA-150 7200RPM Hard Disk / 8MB Cache
> 250GB SATA-150 7200RPM Hard Disk / 8MB Cache

If your app is select heavy, especially the types of things that do
sequential scans, you will enjoy having enough ram to easily load all
of your tables and indexes in ram. If your database will exceed 1GB on
disk consider more ram than 2GB.

If your database will be write heavy choosing good controllers and
disks is essential. Reading through the archives you will see that
there are some important disk configurations you can choose for
optimizing disk writes such as using the outer portions of the disks
exclusively. If data integrity is not an issue, choose a controller
that allows caching of writes (usually IDE and cheaper SATA systems
cache writes regardless of what you want).

If it were my application, and if I had room in the budget, I'd double
the RAM. I don't know anything about your application though so use
the guidlines above.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] speed up query with max() and odd estimates

2005-04-26 Thread Matthew Nuzum
On 4/26/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote:
> On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote:
> >  Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
> > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
> 
> That's a gross misestimation -- four orders of magnitude off!
> 
> Have you considering doing this in two steps, first getting out whatever
> comes from the subquery and then doing the query? 

Well, I don't know if the estimates are correct now or not, but I
found that your suggestion of doing it in two steps helped a lot.

For the archives, here's what made a drastic improvement:

This batch program had an overhead of 25 min to build hash tables
using the sql queries. It is now down to about 47 seconds.

The biggest improvements (bringing it down to 9 min) were to get rid
of all instances of `select max(field) from ...` and replacing them
with `select field from ... order by field desc limit 1`

Then, to get it down to the final 47 seconds I changed this query:
SELECT client,max(atime) as atime from usage_access where atime >=
(select atime - '1 hour'::interval from usage_access order by atime
desc limit 1) group by client;

To these three queries:
SELECT atime - '1 hour'::interval from usage_access order by atime desc limit 1;
SELECT client, atime into temporary table recent_sessions from
usage_access where atime >= '%s';
SELECT client, max(atime) as atime from recent_sessions group by client;

Thanks for the help.
-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] speed up query with max() and odd estimates

2005-04-26 Thread Matthew Nuzum
On 4/26/05, Steinar H. Gunderson <[EMAIL PROTECTED]> wrote:
> On Tue, Apr 26, 2005 at 03:16:57PM -0500, Matthew Nuzum wrote:
> >  Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
> > width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
> 
> That's a gross misestimation -- four orders of magnitude off!
> 
> Have you considering doing this in two steps, first getting out whatever
> comes from the subquery and then doing the query? Have you ANALYZEd recently?
> Do you have an index on atime?
> 

Yes, there is an index on atime. I'll re-analyze but I'm pretty
certain that runs nightly.

Regarding two steps, are you suggesting:
begin;
select * into temp_table...;
select * from temp_table...;
drop temp_table;
rollback;

I have not tried that but will.

BTW, I created an index on clients just for the heck of it and there
was no improvement. (actually, a slight degradation)

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[PERFORM] speed up query with max() and odd estimates

2005-04-26 Thread Matthew Nuzum
I have this query that takes a little over 8 min to run:
select client,max(atime) as atime from usage_access where atime >=
(select atime - '1 hour'::interval from usage_access order by atime
desc limit 1) group by client;

I think it can go a lot faster. Any suggestions on improving this? DB
is 7.3.4 I think. (There is no index on client because it is very big
and this data is used infrequently.)

explain ANALYZE select client,max(atime) as atime from usage_access
where atime >= (select atime - '1 hour'::interval from usage_access
order by atime desc limit 1) group by client;
  
  QUERY PLAN

 Aggregate  (cost=3525096.28..3620450.16 rows=1271385 width=20)
(actual time=482676.95..482693.69 rows=126 loops=1)
   InitPlan
 ->  Limit  (cost=0.00..0.59 rows=1 width=8) (actual
time=0.40..0.41 rows=1 loops=1)
   ->  Index Scan Backward using usage_access_atime on
usage_access  (cost=0.00..22657796.18 rows=38141552 width=8) (actual
time=0.39..0.40 rows=2 loops=1)
   ->  Group  (cost=3525096.28..3588665.53 rows=12713851 width=20)
(actual time=482676.81..482689.29 rows=3343 loops=1)
 ->  Sort  (cost=3525096.28..3556880.90 rows=12713851
width=20) (actual time=482676.79..482679.16 rows=3343 loops=1)
   Sort Key: client
   ->  Seq Scan on usage_access  (cost=0.00..1183396.40
rows=12713851 width=20) (actual time=482641.57..482659.18 rows=3343
loops=1)
 Filter: (atime >= $0)
 Total runtime: 482694.65 msec


I'm starting to understand this, which is quite frightening to me. I
thought that maybe if I shrink the number of rows down I could improve
things a bit, but my first attempt didn't work. I thought I'd replace
the "from usage_access" with this query instead:
select * from usage_access where atime >= (select atime - '1
hour'::interval from usage_access order by atime desc limit 1);
  
  QUERY PLAN

 Seq Scan on usage_access  (cost=0.00..1183396.40 rows=12713851
width=116) (actual time=481796.22..481839.43 rows=3343 loops=1)
   Filter: (atime >= $0)
   InitPlan
 ->  Limit  (cost=0.00..0.59 rows=1 width=8) (actual
time=0.41..0.42 rows=1 loops=1)
   ->  Index Scan Backward using usage_access_atime on
usage_access  (cost=0.00..22657796.18 rows=38141552 width=8) (actual
time=0.40..0.41 rows=2 loops=1)
 Total runtime: 481842.47 msec

It doesn't look like this will help at all.

This table is primarily append, however I just recently deleted a few
million rows from the table, if that helps anyone.

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] How to improve db performance with $7K?

2005-04-18 Thread Matthew Nuzum
On 4/14/05, Tom Lane <[EMAIL PROTECTED]> wrote:
> 
> That's basically what it comes down to: SCSI lets the disk drive itself
> do the low-level I/O scheduling whereas the ATA spec prevents the drive
> from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
> possible for the drive to rearrange reads as well as writes --- which
> AFAICS is just not possible in ATA.  (Maybe in the newest spec...)
> 
> The reason this is so much more of a win than it was when ATA was
> designed is that in modern drives the kernel has very little clue about
> the physical geometry of the disk.  Variable-size tracks, bad-block
> sparing, and stuff like that make for a very hard-to-predict mapping
> from linear sector addresses to actual disk locations.  Combine that
> with the fact that the drive controller can be much smarter than it was
> twenty years ago, and you can see that the case for doing I/O scheduling
> in the kernel and not in the drive is pretty weak.
>   
> 

So if you all were going to choose between two hard drives where:
drive A has capacity C and spins at 15K rpms, and
drive B has capacity 2 x C and spins at 10K rpms and
all other features are the same, the price is the same and C is enough
disk space which would you choose?

I've noticed that on IDE drives, as the capacity increases the data
density increases and there is a pereceived (I've not measured it)
performance increase.

Would the increased data density of the higher capacity drive be of
greater benefit than the faster spindle speed of drive A?

-- 
Matthew Nuzum
www.bearfruit.org

---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Spend 7K *WHERE*? WAS Intel SRCS16 SATA raid? and How

2005-04-15 Thread Matthew Nuzum
I think there are many people who feel that $7,000 is a good budget for a
database server, me being one.

 * I agree with the threads that more disks are better.
 * I also agree that SCSI is better, but can be hard to justify if your
budget is tight, and I have great certainty that 2x SATA drives on a good
controller is better than x SCSI drives for many work loads.
 * I also feel that good database design and proper maintenance can be one
of the single biggest performance enhancers available. This can be labor
intensive, however, and sometimes throwing more hardware at a problem is
cheaper than restructuring a db.

Either way, having a good hardware platform is an excellent place to start,
as much of your tuning will depend on certain aspects of your hardware.

So if you need a db server, and you have $7k to spend, I'd say spend it.
>From this list, I've gathered that I/O and RAM are your two most important
investments.

Once you get that figured out, you can still do some performance tuning on
your new server using the excellent advice from this mailing list.

By the way, for all those who make this list work, I've rarely found such a
thorough, helpful and considerate group of people as these on the
performance list.

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
View samples of Elite CMS in action by visiting
http://www.followers.net/portfolio/




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] How to improve db performance with $7K?

2005-04-14 Thread Matthew Nuzum
On 4/14/05, Tom Lane <[EMAIL PROTECTED]> wrote:
>
> That's basically what it comes down to: SCSI lets the disk drive itself
> do the low-level I/O scheduling whereas the ATA spec prevents the drive
> from doing so (unless it cheats, ie, caches writes).  Also, in SCSI it's
> possible for the drive to rearrange reads as well as writes --- which
> AFAICS is just not possible in ATA.  (Maybe in the newest spec...)
>
> The reason this is so much more of a win than it was when ATA was
> designed is that in modern drives the kernel has very little clue about
> the physical geometry of the disk.  Variable-size tracks, bad-block
> sparing, and stuff like that make for a very hard-to-predict mapping
> from linear sector addresses to actual disk locations.  Combine that
> with the fact that the drive controller can be much smarter than it was
> twenty years ago, and you can see that the case for doing I/O scheduling
> in the kernel and not in the drive is pretty weak.
>
>

So if you all were going to choose between two hard drives where:
drive A has capacity C and spins at 15K rpms, and
drive B has capacity 2 x C and spins at 10K rpms and
all other features are the same, the price is the same and C is enough
disk space which would you choose?

I've noticed that on IDE drives, as the capacity increases the data
density increases and there is a pereceived (I've not measured it)
performance increase.

Would the increased data density of the higher capacity drive be of
greater benefit than the faster spindle speed of drive A?

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of “Elite Content Management System”
View samples of Elite CMS in action by visiting
http://www.followers.net/portfolio/




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] performance hit for replication

2005-04-12 Thread Matthew Nuzum
> >I'm eager to hear your thoughts and experiences,
> >
> >
> Well with replicator you are going to take a pretty big hit initially
> during the full
> sync but then you could use batch replication and only replicate every
> 2-3 hours.
> 
> Sincerely,
> 
> Joshua D. Drake
> 

Thanks, I'm looking at your product and will contact you off list for more
details soon.

Out of curiosity, does batch mode produce a lighter load? Live updating will
provide maximum data security, and I'm most interested in how it affects the
server.

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
Earn a commission of $100 - $750 by recommending Elite CMS. Visit
http://www.elitecms.com/Contact_Us.partner for details.


---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] performance hit for replication

2005-04-12 Thread Matthew Nuzum
I'd like to create a fail-over server in case of a problem. Ideally, it
would be synchronized with our main database server, but I don't see any
major problem with having a delay of up to 4 hours between syncs.

My database is a little shy of 10 Gigs, with much of that data being in an
archived log table. Every day a batch job is run which adds 100,000 records
over the course of 3 hours (the batch job does a lot of pre/post
processing).

Doing a restore of the db backup in vmware takes about 3 hours. I suspect a
powerful server with a better disk setup could do it faster, but I don't
have servers like that at my disposal, so I need to assume worst-case of 3-4
hours is typical.

So, my question is this: My server currently works great, performance wise.
I need to add fail-over capability, but I'm afraid that introducing a
stressful task such as replication will hurt my server's performance. Is
there any foundation to my fears? I don't need to replicate the archived log
data because I can easily restore that in a separate step from the nightly
backup if disaster occurs. Also, my database load is largely selects. My
application works great with PostgreSQL 7.3 and 7.4, but I'm currently using
7.3. 

I'm eager to hear your thoughts and experiences,
-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
Earn a commission of $100 - $750 by recommending Elite CMS. Visit
http://www.elitecms.com/Contact_Us.partner for details.



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Preventing query from hogging server

2005-03-24 Thread Matthew Nuzum

> I would strongly suggest doing the min and max calculations together:
> 
>   select groupid, min(col), max(col) from ...
>
> because if you do them in two separate queries 90% of the effort will be
> duplicated.
>
>   regards, tom lane

Thanks. Other than avoiding using too much sort mem, is there anything else
I can do to ensure this query doesn't starve other processes for resources?

Doing the explain analyze only increases my server load by 1 and seems to
readily relinquish CPU time, but previously when I had been running a test
query my server load rose to unacceptable levels.

FWIW, the explain was run from psql running on the db server, the test query
the other day was run from one of the webservers. Should I run this on the
db server to minimize load?

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
View samples of Elite CMS in action by visiting
http://www.followers.net/portfolio/


---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [PERFORM] Preventing query from hogging server

2005-03-24 Thread Matthew Nuzum
> How many rows in usage_access?

Oh, I just got my explain analyze:
  QUERY
PLAN 

--
 Subquery Scan "*SELECT*"  (cost=9499707.90..9856491.74 rows=3567838
width=28) (actual time=11443537.58..12470835.17 rows=1198141 loops=1)
   ->  Aggregate  (cost=9499707.90..9856491.74 rows=3567838 width=28)
(actual time=11443537.56..12466550.25 rows=1198141 loops=1)
 ->  Group  (cost=9499707.90..9767295.78 rows=35678384 width=28)
(actual time=11443537.10..12408372.26 rows=35678383 loops=1)
   ->  Sort  (cost=9499707.90..9588903.86 rows=35678384
width=28) (actual time=11443537.07..12035366.31 rows=35678383 loops=1)
 Sort Key: accountid, sessionid
 ->  Seq Scan on usage_access  (cost=0.00..1018901.84
rows=35678384 width=28) (actual time=8.13..416580.35 rows=35678383 loops=1)
 Total runtime: 12625498.84 msec
(7 rows)

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
View samples of Elite CMS in action by visiting
http://www.followers.net/portfolio/


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Preventing query from hogging server

2005-03-24 Thread Matthew Nuzum

> How many rows in usage_access?  How many groups do you expect?
> (Approximate answers are fine.)  What PG version is this, and
> what's your sort_mem setting?
> 
>   regards, tom lane

I believe there are about 40,000,000 rows, I expect there to be about
10,000,000 groups. PostgreSQL version is 7.3.2 and the sort_mem is at the
default setting.

(I know that's an old version. We've been testing with 7.4 now and are
nearly ready to upgrade.)

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
View samples of Elite CMS in action by visiting
http://www.followers.net/portfolio/


---(end of broadcast)---
TIP 8: explain analyze is your friend


[PERFORM] Preventing query from hogging server

2005-03-24 Thread Matthew Nuzum
I've got a report that is starting to take too long to run. I'm going to
create a lookup table that should speed up the results, but first I've got
to create the lookup table.

I honestly don't care how long the query takes to run, I just want to run it
without causing a major performance impact on other operations. The query
seems to take forever even when I limit the results to just 10, so I don't
know if I can get good results by splitting the query into groups of queries
(for example, for a years worth of data do 12 queries, one for each month or
maybe 365 queries, one for each day) or if there is a psql equivalent to
"nice."

I've tried `nice psql` in the past and I don't think that had much impact,
but I haven't tried it on this query.

Here is the query (BTW, there will be a corresponding "max" version of this
query as well):
INSERT INTO usage_sessions_min (accountid,atime,sessionid)
select accountid, min(atime) as atime, sessionid from usage_access 
group by accountid,sessionid;

atime is a timestamptz, accountis is varchar(12) and sessionid is int.

I've tried to do an explain analyze of this query, but its been running for
hours already and I don't know when it will finish.

-- 
Matthew Nuzum <[EMAIL PROTECTED]>
www.followers.net - Makers of "Elite Content Management System"
View samples of Elite CMS in action by visiting
http://www.elitecms.com/



---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]


Re: [PERFORM] Restricting Postgres

2004-11-07 Thread Matthew Nuzum
Matt - Very interesting information about squid effectiveness, thanks.

Martin,
You mean your site had no images? No CSS files? No JavaScript files? Nearly
everything is dynamic?

I've found that our CMS spends more time sending a 23KB image to a dial up
user than it does generating and serving dynamic content.

This means that if you have a "light" squid process who caches and serves
your images and static content from it's cache then your apache processes
can truly focus on only the dynamic data.

Case in point: A first time visitor hits your home page.  A dynamic page is
generated (in about 1 second) and served (taking 2 more seconds) which
contains links to 20 additional files (images, styles and etc). Then
expensive apache processes are used to serve each of those 20 files, which
takes an additional 14 seconds.  Your precious application server processes
have now spent 14 seconds serving stuff that could have been served by an
upstream cache.

I am all for using upstream caches and SSL accelerators to take the load off
of application servers.  My apache children often take 16 or 20MB of RAM
each.  Why spend all of that on a 1.3KB image?

Just food for thought.  There are people who use proxying in apache to
redirect expensive tasks to other servers that are dedicated to just one
heavy challenge.  In that case you likely do have 99% dynamic content.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Martin Foster

Matt Clark wrote:

> In addition we (as _every_ high load site should) run Squid as an
> accelerator, which dramatically increases the number of client connections
> that can be handled.  Across 2 webservers at peak times we've had 50,000
> concurrently open http & https client connections to Squid, with 150
Apache
> children doing the work that squid can't (i.e. all the dynamic stuff), and
> PG (on a separate box of course) whipping through nearly 800 mixed
selects,
> inserts and updates per second - and then had to restart Apache on one of
> the servers for a config change...  Not a problem :-)
> 
> One little tip - if you run squid on the same machine as apache, and use a
> dual-proc box, then because squid is single-threaded it will _never_ take
> more than half the CPU - nicely self balancing in a way.
> 
> M
> 

I've heard of the merits of Squid in the use as a reverse proxy. 
However, well over 99% of my traffic is dynamic, hence why I may be 
experiencing behavior that people normally do not expect.

As I have said before in previous threads, the scripts are completely 
database driven and at the time the database averaged 65 queries per 
second under MySQL before a migration, while the webserver was averaging 
2 to 4.

Martin Foster
Creator/Designer Ethereal Realms
[EMAIL PROTECTED]





---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] How to time several queries?

2004-10-20 Thread Matthew Nuzum
When I'm using psql and I want to time queries, which is what I've been
doing for a little over a day now, I do the following:

Select now(); query 1; query 2; query 3; select now();

This works fine unless you're doing selects with a lot of rows which will
cause your first timestamp to scroll off the screen.

-- 
Matthew Nuzum  + "Man was born free, and everywhere
www.bearfruit.org  :  he is in chains," Rousseau
+~~+ "Then you will know the truth, and 
the TRUTH will set you free," Jesus Christ (John 8:32 NIV)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
[EMAIL PROTECTED]
Sent: Monday, October 18, 2004 2:28 PM
To: [EMAIL PROTECTED]
Subject: [PERFORM] How to time several queries?

Hello

I posted this on the general list but think it would be more appropriate
here. Sorry.

I know it is possible to time isolated queries through the settting of the
\timing option in psql. This makes PgSQL report the time it took to
perform one operation.

I would like to know how one can get a time summary of many operations, if
it is at all possible.

Thank you.

Tim



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] Caching of Queries

2004-09-28 Thread Matthew Nuzum
I could spend a week or two tweaking the performance of my database servers
and probably make some sizeable improvements, but I'm not going to.

Why? Because PostgreSQL screams as it is.

I would make sure that if the consensus is to add some sort of caching that
it be done only if there is no hit to current performance and stability.
That being said, I think that server side caching has major buzz and there's
nothing wrong with adding features that sell.

I will disagree with 3 points made on the argument against caching.
Specifically, the benefit of doing caching on the db server is that the
benefits may be reaped by multiple clients where as caching on the client
side must be done by each client and may not be as effective.

So what if the caching has a slight chance of returning stale results?  Just
make sure people know about it in advance.  There are some things where
stale results are no big deal and if I can easily benefit from an aggressive
caching system, I will (and I do now with the adodb caching library, but
like I said, caching has to be done for each client).  In fact, I'm all for
using a low-tech cache expiration algorithm to keep complexity down.

Finally, if the caching is not likely to help (or may even hurt) simple
queries but is likely to help complex queries then fine, make sure people
know about it and let them decide if they can benefit. 

Sorry if I'm beating a dead horse or playing the devil's advocate.  Just
felt compelled to chime in.

-- 
Matthew Nuzum  + "Man was born free, and everywhere
www.bearfruit.org  :  he is in chains," Rousseau
+~~+ "Then you will know the truth, and 
the TRUTH will set you free," Jesus Christ (John 8:32 NIV)

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane
Sent: Monday, September 27, 2004 1:19 AM
To: Neil Conway
Cc: Aaron Werman; Scott Kirkwood; [EMAIL PROTECTED]
Subject: Re: [PERFORM] Caching of Queries

Neil Conway <[EMAIL PROTECTED]> writes:
> I think the conclusion of past discussions about this feature is that
> it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> cache when a single DML statement modifying the table in question is
> issued.

Do they actually make a rigorous guarantee that the cached result is
still accurate when/if it is returned to the client?  (That's an honest
question --- I don't know how MySQL implements this.)

IIRC, in our past threads on this topic, it was suggested that if you
can tolerate not-necessarily-up-to-date results, you should be doing
this sort of caching on the client side and not in the DB server at all.
I wouldn't try that in a true "client" scenario, but when the DB client
is application-server middleware, it would make some sense to cache in
the application server.

regards, tom lane

---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Postgres over Linux NBD or NFS

2004-06-29 Thread Matthew Nuzum
I just got the BigAdmin newsletter from Sun today... interestingly enough it
had a link to an article described as:

> Database Performance with NAS: Optimizing Oracle on NFS
> This paper discusses the operation of relational databases with network 
> attached storage (NAS). Management implications and performance 
> expectations for databases using NAS are presented.
The link points to: http://www.sun.com/bigadmin/content/nas/

I read just enough to see if it is relevant.  Here is the first part of the
summary:

> IT departments are increasingly utilizing Network Attached Storage (NAS) 
> and the Network File System (NFS) to meet the storage needs of mission-
> critical relational databases. Reasons for this adoption include improved 
> storage virtualization, ease of storage deployment, decreased complexity,
> and decreased total cost of ownership. This paper directly examines the
> performance of databases with NAS. In laboratory tests comparing NFS with
> local storage, NFS is shown capable of sustaining the same workload level
> as local storage. Under similar workload conditions, NFS does consume an
> increased number of CPU cycles; however, the proven benefits of NFS and
> NAS outweigh this penalty in most production environments.

Matthew Nuzum   | ISPs: Make $200 - $5,000 per referral by
www.followers.net   | recomending Elite CMS to your customers!
[EMAIL PROTECTED]   | http://www.followers.net/isp




---(end of broadcast)---
TIP 6: Have you searched our list archives?

   http://archives.postgresql.org


Re: [PERFORM] postgresql and openmosix migration

2004-06-22 Thread Matthew Nuzum
Hi Bill, I am more often in the "needing help" category than the "giving
help" when it comes to advise about using postgresql.  I have found it to be
an extremely powerful tool and by far the best performance/price for my
work.

I think you will get some excellent answers and help to your performance
questions if you send the list details about specific queries that are
running too slow.  If you are willing to throw more/bigger hardware at the
problem, let people know that when you ask and they will tell you if your
bottleneck can be alleviated through more ram, disks, cpu or whatever.
Having been watching this list for some time now, I suspect most of the
performance problems can be improved using non-intuitive query or
configuration modifications (for example, replacing min()/max() as suggested
by Mr. Wolf).

The heavy hitters on the list will usually ask for an "explain analyze" of
your query.  If your query is "select * from foo", then change it to
"EXPLAIN ANALYZE select * from foo" and post the output.  It will look
something like this:
QUERY PLAN


---
 Seq Scan on foo  (cost=0.00..1.04 rows=4 width=44) (actual time=8.46..8.47
rows=4 loops=1)
 Total runtime: 19.63 msec
(2 rows)

I'm sure your data is confidential; mine is too.  The good news is that none
of your data is included in the query. Only technical details about what the
database is doing.

If your problem might involve the application that works with the data, give
some details about that. For example, if you're using a Java application,
let people know what driver version you use, what jvm and other related
info.  There are lurkers on this list using just about every programming
language imaginable on more platforms than you can shake a stick at (I don't
care how good you are at shaking sticks, either).

The more details you give the better help you're going to get and you'd be
amazed at the results I've seen people get with a judicious amount of
tweaking. The other day someone had a query that took hours decrease to less
than 10 minutes by using some techniques prescribed by members on the list.
Bringing 30 - 60 second queries down to 2-3 seconds is commonplace.

You seem to be ready to throw money at the problem by investing in new
hardware but I would suggest digging into the performance problems first.
Too many times we've seen people on the list say, "I've just spent $x0,000
on a new xyz and I'm still having problems with this query."  Often times
the true solution is rewriting queries, tweaking config parameters, adding
RAM and upgrading disks (in that order I believe).

As I found out even today on the SQL list, it's best to ask questions in
this form:
"I want to do this...  I've been trying this...  I'm getting this... which
is problematic because..."

The more clearly you state the abstract goal the more creative answers
you'll get with people often suggesting things you'd never considered.

I hope this helps and I hope that you achieve your goals of a well
performing application. 

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:pgsql-performance-
> [EMAIL PROTECTED] On Behalf Of Bill
> Sent: Tuesday, June 22, 2004 1:31 PM
> To: Josh Berkus
> Cc: [EMAIL PROTECTED]
> Subject: Re: [PERFORM] postgresql and openmosix migration
> 
> Ok, so maybe someone on this group will have a better idea.  We have a
> database of financial information, and this has literally millions of
> entries.  I have installed indicies, but for the rather computationally
> demanding processes we like to use, like a select query to find the
> commodity with the highest monthly or annual returns, the computer
> generally
> runs unacceptably slow.  So, other than clustring, how could I achieve a
> speed increase in these complex queries?  Is this better in mysql or
> postgresql?
> 
> Thanks.


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] PostgreSQL on VMWare vs Windows vs CoLinux

2004-06-02 Thread Matthew Nuzum
On Wed, 2004-06-02 at 17:39, Greg Stark wrote:
> "Matthew Nuzum" <[EMAIL PROTECTED]> writes:
> 
> > I have colinux running on a Fedora Core 1 image.  I have the rhdb 3 (or
> > PostgreSQL RedHat Edition 3) on it running.  Here are tests with fsync on
> > and off:
> >   FSYNC OFF   FSYNC ON  RUN
> > 136.9142.0  124.5149.1  1
> > 122.1126.7  140.1169.7  2
> > 125.7148.7  147.4180.4  3
> > 103.3136.7  136.8166.3  4
> > 126.5146.1  152.3187.9  5
> > 114.4133.3  144.8176.7  6
> > 124.0146.5  143.3175.0  7
> > 121.7166.8  147.8180.5  8
> > 127.3151.8  146.7180.0  9
> > 124.6143.0  137.2167.5  10
> > --
> > 122.7144.2  142.1173.3  AVG
> > 
> > I hope those numbers' formatting come through all right.  
> 
> No, they didn't. You used tabs? Are they four space tabs or 8 space tabs?
> I assume 4 space tabs, but then what is the meaning of the four columns?
> You have two columns for each fsync setting? One's under Windows and one's
> under Vmware? Which is which?
> 
Sorry that wasn't clear.  The pgbench program puts out two numbers,
can't remember what they are, I think one number included the time to
make the connection.  Therefore, the first two columns represent the two
values presented from pgbench with FSYNC off.  The second two columns
are those same to figures but with FSYNC ON.  The 5th column is the
run.  I did 10 runs and included the output of all runs so that incase
anything significant could be gleaned from the details, the data would
be there.

The executive summary is this:
Tom was curious if colinux might be deceiving the applications that
expect the fsync to occur.  He suspected that pgbench run with and
without fsync enabled might reveal something.  Therefore:
FSYNC ON:  142.1
FSYNC OFF: 122.7

Having FSYNC off seems to yield faster results.

I'd like some input on a more demanding test though, because these tests
run so quickly I can't help but be suspicious of their accuracy.  When
there are two OSs involved, it seems like the momentary activity of a
background process could skew these results.

> > It occurs to me that the fsync may be performed to the linux filesystem, but
> > this filesystem is merely a file on the windows drive.  Would Windows cache
> > this file?  It's 2GB in size, so if it did, it would only be able to cache
> > part of it.
> 
> Well VMWare certainly doesn't know that the linux process called fsync. For
> all it knows the Linux kernel just schedule the i/o because it felt it was
> time.
> 
> So the question is how does VMWare alway handle i/o normally. Does it always
> handle i/o from the Guest OS synchronously or does it buffer it via the
> Host OS's i/o system. 
We probably will never know what the internal workings of VMWare are
like because it is a closed source program.  I'm not slighting them, I
have purchased a license of VMWare and use it for my software testing. 
However, colinux is an open source project and we can easily find out
how they handle this.  I have little interest in this as I use this
merely as a tool to speed up my application development and do not run
any critical services what-so-ever.

> 
> I'm actually not sure which it does, it could be doing something strange. But
> does seem most likely that it lets Windows buffer the writes, or does so
> itself. It might also depend on whether you're using raw disks or a virtual
> disk file. Undoable disks would throw another wrench in the works entirely.
In these tests I'm using a virtual disk file.  This is a 2GB file on the
hard drive that linux sees as a disk partition.  Colinux does not
support undoable disks in the way that vmware does.  Their wiky site
does not mention anything tricky being done to force disk writes to
actually be written; the implication therefore is that it leaves the i/o
completely at the discretion of XP.  Also note that XP Pro and 2000 Pro
both offer different caching options for the user to choose so unless it
does something to actually force a write the answer is probably "who
knows."

> 
> Note that "caching" isn't really the question. It doesn't have to cache the
> entire 2GB file or even very much of it. It just has to store the block that
> linux wants to write and report success to linux without waiting for the disk
> to report success. Linux will then think the file is sync'd to disk and allow
> postgres to continue with the next transaction without actually waiting for
> the physical disk to spin around to the right place and the head to seek and
&

Re: [PERFORM] PostgreSQL on VMWare vs Windows vs CoLinux

2004-06-02 Thread Matthew Nuzum
I have colinux running on a Fedora Core 1 image.  I have the rhdb 3 (or
PostgreSQL RedHat Edition 3) on it running.  Here are tests with fsync on
and off:
  FSYNC OFF   FSYNC ON  RUN
136.9142.0  124.5149.1  1
122.1126.7  140.1169.7  2
125.7148.7  147.4180.4  3
103.3136.7  136.8166.3  4
126.5146.1  152.3187.9  5
114.4133.3  144.8176.7  6
124.0146.5  143.3175.0  7
121.7166.8  147.8180.5  8
127.3151.8  146.7180.0  9
124.6143.0  137.2167.5  10
--
122.7144.2  142.1173.3  AVG

I hope those numbers' formatting come through all right.  

This computer is an AMD Athlon 900MHz with 448MB Ram running XP Pro SP1
This is using Colinux 0.60 (not the recently released 0.61) and 96MB of RAM
allocated to linux.

The computer was idle but it was running Putty, Excel and Task Manager
during the process.  (I prefer to use Putty to SSH into the virtual computer
than to run the fltk console)

It occurs to me that the fsync may be performed to the linux filesystem, but
this filesystem is merely a file on the windows drive.  Would Windows cache
this file?  It's 2GB in size, so if it did, it would only be able to cache
part of it.

I'd like to run a more difficult test personally.  It seems like this test
goes too fast to be very useful.

If someone would like me to try something more specific, e-mail me right
away and I'll do it.  I must leave my office at 4:15 EDT and will not return
until Friday, although I can do another test on my home computer Thursday.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:pgsql-performance-
> [EMAIL PROTECTED] On Behalf Of Tom Lane
> Sent: Wednesday, June 02, 2004 11:25 AM
> To: Greg Stark
> Cc: Vitaly Belman; [EMAIL PROTECTED]; Bryan Encina; Matthew
> Nuzum
> Subject: Re: [PERFORM] PostgreSQL on VMWare vs Windows vs CoLinux
> 
> Greg Stark <[EMAIL PROTECTED]> writes:
> > That said, I'm curious why the emulated servers performed better than
> the
> > Native Windows port. My first thought is that they probably aren't
> syncing
> > every write to disk so effectively they're defeating the fsyncs,
> allowing the
> > host OS to buffer disk writes.
> 
> It would be fairly easy to check this by repeating the comparisons with
> fsync = off in postgresql.conf.  A performance number that doesn't
> change much would be a smoking gun ;-).
> 
> The native port hasn't had any performance testing done on it yet, and
> I wouldn't be surprised to hear of a gotcha or two.  Perhaps with the
> recent schedule change there will be some time for performance tuning
> before we go beta.
> 
>   regards, tom lane
> 
> ---(end of broadcast)---
> TIP 3: if posting/reading through Usenet, please send an appropriate
>   subscribe-nomail command to [EMAIL PROTECTED] so that your
>   message can get through to the mailing list cleanly


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])


Re: [PERFORM] PostgreSQL caching

2004-05-27 Thread Matthew Nuzum
> 
> Hello Josh,
> 
> JB> Not that you can't improve the query, just that it might not fix
> JB> the problem.
> 
> Yes, I'm aware it might be slower than the Linux version, but then, as
> you said, I still can improve the query (as I did with your help now).
> 
> But true, if there's something awfully wrong with Win32 port
> performance, I might be doing some overwork...
> 
> JB> Therefore ... your detailed feedback is appreciated, especially if you
> can
> JB> compare stuff to the same database running on a Linux, Unix, or BSD
> machine.
> 
> I can't easily install Linux right now.. But I am considering using it
> through VMWare. Do you think it would suffice as a comprasion?
> 
> From what I saw (e.g
> http://usuarios.lycos.es/hernandp/articles/vpcvs.html) the performance
> are bad only when it's coming to graphics, otherwise it looks pretty
> good.
> 
> Regards,
>  Vitaly Belman
> 

An interesting alternative that I've been using lately is colinux
(http://colinux.sf.net).  It lets you run linux in windows and compared to
vmware, I find it remarkably faster and when it is idle less resource
intensive.  I have vmware but if I'm only going to use a console based
program, colinux seems to outperform it.  

Note that it may simply be interactive processes that run better because it
has a simpler interface and does not try to emulate the display hardware.
(Therefore no X unless you use vmware)  It seems though that there is less
overhead and if that's the case, then everything should run faster.

Also note that getting it installed is a little more work than vmware.  If
you're running it on a workstation that you use for normal day-to-day tasks
though I think you'll like it because you can detach the terminal and let it
run in the background.  When I do that I often forget it is running because
it produces such a low load on the system.  If you are going to give it a
try, the one trick I used to get things going was to download the newest
beta of winpcap and then the networking came up easily.  Everything else was
a piece of cake.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/



---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


Re: [PERFORM] Configuring PostgreSQL to minimize impact of checkpoints

2004-05-11 Thread Matthew Nuzum

> 
> Jack Orenstein <[EMAIL PROTECTED]> writes:
> > I'm looking at one case in which two successive transactions, each
> > updating a handful of records, take 26 and 18 *seconds* (not msec) to
> > complete. These transactions normally complete in under 30 msec.
...
> None of this is necessarily going to fix matters for an installation
> that has no spare I/O capacity, though.  And from the numbers you're
> quoting I fear you may be in that category.  "Buy faster disks" may
> be the only answer ...
> 

I had a computer once that had an out-of-the-box hard drive configuration
that provided horrible disk performance.  I found a tutorial at O'Reilly
that explained how to use hdparm to dramatically speed up disk performance
on Linux.  I've noticed on other computers I've set up recently that hdparm
seems to be used by default out of the box to give good performance.

Maybe your computer is using all of it's I/O capacity because it's using PIO
mode or some other non-optimal method of accessing the disk.

Just a suggestion, I hope it helps,

Matthew Nuzum   | ISPs: Make $200 - $5,000 per referral by
www.followers.net   | recomending Elite CMS to your customers!
[EMAIL PROTECTED]   | http://www.followers.net/isp


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Extreme high load averages

2003-07-07 Thread Matthew Nuzum
> A common problem is a table like this:
> 
> create table test (info text, id int8 primary key);
> insert into test values ('ted',1);
> .. a few thousand more inserts;
> vacuum full;
> analyze;
> select * from test where id=1;
> 
> will result in a seq scan, always, because the 1 by itself is
> autoconverted to int4, which doesn't match int8 automatically.  This
> query:
> 
> select * from test where id=1::int8
> 
> will cast the 1 to an int8 so the index can be used.
> 
> 

Hey Scott, this is a little scary because I probably have a lot of this
going on...

Is there a way to log something so that after a day or so I can go back and
look for things like this that would be good candidates for optimization?

I've got fast enough servers that currently the impact of this problem might
not be too obvious, but I suspect that after the server gets loaded up the
impact will become more of a problem.

By the way, I must say that this thread has been very useful.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/


---(end of broadcast)---
TIP 9: the planner will ignore your desire to choose an index scan if your
  joining column's datatypes do not match


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-07 Thread Matthew Nuzum

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:pgsql-performance-
> [EMAIL PROTECTED] On Behalf Of Andrew Sullivan
> Sent: Monday, July 07, 2003 5:23 AM
> To: [EMAIL PROTECTED]
> Subject: Re: [PERFORM] Moving postgresql.conf tunables into 2003...
> 
> On Sat, Jul 05, 2003 at 02:12:56PM -0700, Sean Chittenden wrote:
> > The SGML docs aren't in the DBA's face and are way out of the way for
> > DBAs rolling out a new system or who are tuning the system.  SGML ==
> > Developer, conf == DBA.
> 
> I could not disagree more.  I'd say more like, if the dba won't read
> the manual, get yourself a real dba.  Sorry, but so-called
> professionals who won't learn their tools have no home in my shop.
> 

I don' want to come off confrontational, so please don't take this as an
attack.

Are you willing to say that the PostgreSQL database system should only be
used by DBAs?  I believe that Postgres is such a good and useful tool that
anyone should be able to start using it with little or no barrier to entry.

I don't believe I'm alone in this opinion either.  As a matter of fact, this
philosophy is being adopted by many in the software industry.  Note that
Linux and many other OSs that act as servers are being made more secure and
easier to use __out of the box__ so that a person can simply install from cd
and start using the tool with out too much difficulty.

Maybe your definition of "dba" is broader than mine and what you mean is,
"someone who installs a postgres database".  Also, by manual, are you
referring to the 213 page Administration guide, or are you talking about the
340 page Reference Manual?  Let us rephrase your statement like this: "If
the [person who installs a postgres database] won't read the [340 page
reference] manual, then that person should go find a different database to
use."

I think that the postgres installation procedure, .conf files and
documentation can be modified in such a way that a newbie (we were all
newbies once) can have a good "out of box experience" with little effort.
That means they can __quickly__ get a __good performing__ database up and
running with __little effort__ and without needing to subscribe to a mailing
list or read a book.

I have seen software projects that have what I call an "elitist" attitude;
meaning they expect you to be an expert or dedicated to their software in
order to use it.  Invariably this mentality stifles the usefulness of the
product.  It seems that there is a relative minority of people on this list
who feel that you have to be "elite" in order to have a good working
postgres installation.  I don't feel that should be a requirement or even a
consideration.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faqs/FAQ.html


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-06 Thread Matthew Nuzum
> > This sort of narrative belongs in the SGML docs, not in a CONF file.
> > In fact, one could argue that we should take *all* commentary out of
> > the CONF file in order to force people to read the docs.
> 
> The SGML docs aren't in the DBA's face and are way out of the way for
> DBAs rolling out a new system or who are tuning the system.  SGML ==
> Developer, conf == DBA.
> 
> > Database performance tuning will always be a "black art," as it
> > necessitates a broad knowledge of PostgreSQL, OS architecture, and
> > computer hardware.  So I doubt that we can post docs that would
> > allow any 10% time DBA to make PostgreSQL "fly", but hopefully over
> > the next year we can make enough knowledge public to allow anyone to
> > make PostgreSQL "sprint".
> 
> I'm highly resistant to/disappointed in this attitude and firmly
> believe that there are well understood algorithms that DBAs use to
> diagnose and solve performance problems.  It's only a black art
> because it hasn't been documented.  Performance tuning isn't voodoo,
> it's adjusting constraints to align with the execution of applications
> and we know what the applications do, therefore the database can mold
> to the applications' needs.  

I agree.

We often seem to forget simple lessons in human nature.  Expecting someone
to spend 20 extra seconds to do something is often too much.  In many cases,
the only "manual" that a person will see is the .conf files.

At the very least, if there is good documentation for these parameters,
maybe the conf file should provide a link to this info. 

About the documentation...  The few times I've tried reading these sections
of the docs it was like reading a dictionary.

Bruce's book is a much better writing style because it starts out with a
basic concept and then expands on it, sometimes several times until a
thorough (but not exhaustive) example has been given.

The exhaustive material in the docs is good when you know what you're
looking for, and therefore is a critical piece of reference work.  I don't
want to belittle the authors of that material in any way.  An illustration
of this would be to compare the O'Reilly "... Nutshell" book series to
something like the [fictitious] book "Learn PostgreSQL in 24 hours".

To close this message, I would just like to add that one of the most
successful open source projects of all time could be used as an example.
The Apache httpd project is one of the few open source projects in wide
spread use that holds more market share than all competing products
combined.

It uses a three phase (if not more) documentation level.  The .conf file
contains detailed instructions in an easy to read and not-to-jargon-ish
structure.  The docs provide detailed tutorials and papers that expand on
configuration params in an easy to read format.  Both of these refer to the
thorough reference manual that breaks each possible option down into it's
nitty gritty details so that a user can get more information if they so
desire.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


Re: [PERFORM] Moving postgresql.conf tunables into 2003...

2003-07-06 Thread Matthew Nuzum
> 
> Brian Suggests:
> > I'm curious how many of the configuration values can be determined
> > automatically, or with the help of some script.  It seem like there
> > could be some perl script in contrib that could help figure this out.
> > Possibly you are asked a bunch of questions and then the values are
> > computed based on that.   Something like:
> 
> This would be great!  Wanna be in charge of it?
> 

Is there a to-do list for this kind of stuff?  Maybe there could be a "help
wanted" sign on the website.  Seems like there are lot's of good ideas that
fly around here but never get followed up on.

Additionally, I have an increasingly large production database that I would
be willing to do some test-cases on.  I don't really know how to do it
though... If someone where able to give instructions I could run tests on
three different platforms.

Matthew Nuzum   | Makers of "Elite Content Management System"
www.followers.net   | View samples of Elite CMS in action
[EMAIL PROTECTED]   | http://www.followers.net/portfolio/


---(end of broadcast)---
TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]