Re: [GENERAL] Solid State Drives with PG

2010-04-07 Thread Gordan Bobic

John R Pierce wrote:

all the enterprise SAN guys I've talked with say the Intel x25 drives 
are consumer junk, about the only thing they will use is STEC Zeus, and 
even then they mirror them.


A couple of points there.

1) Mirroring flash drives is a bit ill advised since flash has a rather 
predictable long-term wear-out failure point. It would make more sense 
to mirror with a mechanical disk and use the SSD for reads, with some 
clever firmware to buffer up the extra writes to the mechanical disk and 
return completed status as soon as the data has been committed to the 
faster flash disk.


2) How much of that dislike of Intel is actually justified by something 
other than the margins offered / procurement policy (a.k.a. buying from 
the vendor that sends you the best present rather than from the vendor 
that has the best product)? Intel X25-E drives have write endurance, 
performance and power consumption (150mW TDP!) that are at least as good 
as other enterprise grade drives. Most enterprise grade drives don't 
even have trim support yet. I wouldn't knock Intel drives until you've 
tried them. Also bear in mind that Intel X25-E drives have high street 
prices similar to similar sized 15,000rpm mechanical drives you might 
buy from a SAN vendor (of course, same drives without the re-badge can 
be had for a fraction of the price).


Then again, I never did have a very high opinion of big name SAN vendor 
hardware - I have always achieved better results at a fraction of the 
cost with appliances I've built myself.


Gordan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Does anyone use in ram postgres database?

2010-03-26 Thread Gordan Bobic

Alan McKay wrote:

On Thu, Mar 25, 2010 at 4:15 PM, Scott Marlowe  wrote:

These questions always get the first question back, what are you
trying to accomplish?  Different objectives will have different
answers.


We have a real-time application that processes data as it comes in.
Doing some simple math tells us that a disk-based DB cannot possible
perform fast enough to allow us to process the data.


Now, if your pg_xlog directory is a problem, then you either need
bigger faster hard drives, or your data is more transient in nature
and you can recreate it and you put the whole db into RAM.


When we only saw a 3x improvement in speed with the RAM based DB, we
were still seeing a fair bit of disk activity but were not sure what
was going on.  Then we thought about pg_xlog and moved it to RAM as
well, but as I recall still not a great improvement.


Have you considered using one of these:
http://www.acard.com/english/fb01-product.jsp?idno_no=270&prod_no=ANS-9010&type1_title= 
Solid State Drive&type1_idno=13


Gordan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


[GENERAL] Statement Triggers

2010-03-10 Thread Gordan Bobic
Hi,

Can anyone point me at a comprehensive example of statement (as opposed to
row) triggers? I've googled it and looked through the documentation, but
couldn't find a complete example relevant to what I'm trying to do.

Specifically, what features of the SQL statement that triggered the event
are available to the function invoked by the trigger? Say I wanted to write
all INSERT statements executed on a table into a log file. How would I
access the original statement in the triggered function? If I cannot access
the statement itself, what information is available for statement based
triggers?

Thanks.

Gordan

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Replication Using Triggers

2008-01-19 Thread Gordan Bobic

Scott Marlowe wrote:

On Jan 19, 2008 6:14 PM, Gordan Bobic <[EMAIL PROTECTED]> wrote:

Gregory Youngblood wrote:

On Sat, 2008-01-19 at 23:46 +, Gordan Bobic wrote:

David Fetter wrote:

In that case, use one of the existing solutions.  They're all way
easier than re-inventing the wheel.

Existing solutions can't handle multiple masters. MySQL can do it at
least in a ring arrangement.


What about pgcluster? It's supposed to be able to provide synchronous
multi-master replication for postgresql.

I looked at that, too, but it wasn't really a "cluster of equal peers"
sort of solution, which is what I am after. Still, thanks for pointing
it out.


Oh, and there's this too:

Cybertec sync-multi-master
http://www.postgresql.org/about/news.752
http://www.postgresql.org/about/news.752


The design of that seems suspiciously similar to pgcluster with separate 
load balancer and replicator servers.


Gordan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Replication Using Triggers

2008-01-19 Thread Gordan Bobic

Gregory Youngblood wrote:

On Sat, 2008-01-19 at 23:46 +, Gordan Bobic wrote:

David Fetter wrote:
> In that case, use one of the existing solutions.  They're all way
> easier than re-inventing the wheel.

Existing solutions can't handle multiple masters. MySQL can do it at 
least in a ring arrangement.


What about pgcluster? It's supposed to be able to provide synchronous 
multi-master replication for postgresql.


I looked at that, too, but it wasn't really a "cluster of equal peers" 
sort of solution, which is what I am after. Still, thanks for pointing 
it out.


Gordan

---(end of broadcast)---
TIP 5: don't forget to increase your free space map settings


Re: [GENERAL] Replication Using Triggers

2008-01-19 Thread Gordan Bobic

David Fetter wrote:


That's just it - I don't think any user-land libraries would
actually be required. One of supposed big advantages of MySQL is
it's straightforward replication support. It's quite painful to
see PostgreSQL suffer purely for the sake of lack of marketting in
this department. :-(

The "straigtforward" replication support in MySQL is seriously
broken.

I am not arguing that it isn't! :-) I am merely trying to implement
something at least as good (or rather, no more broken) for
PostgreSQL with a minimum of effort.


In that case, use one of the existing solutions.  They're all way
easier than re-inventing the wheel.


Existing solutions can't handle multiple masters. MySQL can do it at 
least in a ring arrangement.


Gordan

---(end of broadcast)---
TIP 4: Have you searched our list archives?

  http://archives.postgresql.org/


Re: [GENERAL] Replication Using Triggers

2008-01-18 Thread Gordan Bobic

Andreas 'ads' Scherbaum wrote:

Have a plperl function that creates connections to all servers in the 
cluster (replication partners), and issues the supplied write query to 
them, possibly with a tag of some sort to indicated it is a replicated 
query (to prevent circular replication).


Have a post execute trigger that calls the above replication function if 
the query was issued directly (as opposed to replicated), and passes it 
the query it just executed if it was successful.
Not sure here if you mean literally the SQL query that was executed - in 
which case you have all sorts of problems with sequences and functions 
returning different values.
Indeed, but sequences at least can be worked around. Post-execute, 
sequence number used should be available already, and the sequence offset 
and increment can be set so as to ensure they do not clash. That's what 
MySQL does (and I must apologize for making the comparison all the time).


Sequences are only one (small) problem. What about functions returning
different results (volatile) for each call? Just imagine random() or
now().


Yes, that's a problem. The bodge workaround for that is to save the 
master's state for such functions and re-pack it from a function into a 
literal in a pre-execution trigger, and then replicate the literals.



What about inserts or updates selecting parts of table data? You
can't be sure to get exactly the same results on the slave.


You can if you have an ordering consistency check mechanism, as I 
mentioned in the other mail. Recovery when "something goes wrong" (tm), 
however, could get interesting, especially under heavy distributed write 
load. If there's a counter and a hash, I guess you could lock 
everything, find the one with the biggest counter, and release the lock 
on everything else until it catches up, then re-lock, then replicate. It 
would add a fair bit of latency, though.


Gordan

---(end of broadcast)---
TIP 6: explain analyze is your friend


Re: [GENERAL] Replication Using Triggers

2008-01-18 Thread Gordan Bobic

Andrew Sullivan wrote:

On Fri, Jan 18, 2008 at 04:09:45PM +, [EMAIL PROTECTED] wrote:
That's just it - I don't think any user-land libraries would actually be 
required. One of supposed big advantages of MySQL is it's straightforward 
replication support. It's quite painful to see PostgreSQL suffer purely 
for the sake of lack of marketting in this department. :-(


The "straigtforward" replication support in MySQL is seriously broken.


I am not arguing that it isn't! :-)
I am merely trying to implement something at least as good (or rather, 
no more broken) for PostgreSQL with a minimum of effort.



We
(by which I really mean "Jan") spent a great deal of time on the design of
Slony (and it's add-on nature is a feature, not a bug -- one thing it can do
is cross-version upgrades on PostgreSQL versions that were out before Slony
was finished being dfesigned) to avoid several nasty corner cases that are
sort of waved aside in the MySQL documentation.  Designing a replication
system that works well 80% of the time is a waste of effort, because the
times when you really need it are all already in that 20% of cases that you
won't cover with the simple-minded solution. 


Specifically,

1) That's what MySQL does (it either ignores errors or stops replication 
on encountering an error, which of those two it does is selectable, but 
that's about it).


That's got to be _the_ most brain-dead approach to replication I've ever
heard.  It chooses the two least good of all possible worlds, and when you
get into your particular version of hell at 0-dark:30, you have to spend
some time first figuring out which hell you happen to be in.


I couldn't agree more. But I don't see another multi-master replication 
solution on the horizon.



In any case,

fire and forget asynchronous replication a-la MySQL. Having a choice 
between transactions and speed is good. :-)


if this is what you believe, then you don't need a database to store your
data anyway.  I can make your data system faster by storing all your data on
/dev/null.  Writes will be very fast indeed.


Fantastically put. :-)

But in the meantime, until a better multi-master replication solution 
becomes available, I think I'll stick with the current plan.


I suppose some kind of a write counter with a rolling write query hash 
could be implemented. Replicator function issues locks and compares the 
counters/hashes to establish whether a state is consistent on all nodes 
before a write query is replicated. It's a kludge and a horrible one at 
that, and it will slow down the writes under load, but I think it would 
work for ensuring ordering consistency with not-commutative write 
operations.


Gordan

---(end of broadcast)---
TIP 9: In versions below 8.0, the planner will ignore your desire to
  choose an index scan if your joining column's datatypes do not
  match


Re: [GENERAL] Drivers for Other Languages

2005-03-30 Thread Gordan Bobic
Mike Rylander wrote:
On Wed, 30 Mar 2005 12:07:06 +0100, Gordan Bobic <[EMAIL PROTECTED]> wrote:
Hi,
How difficult is it to write a driver for pgsql (via network or UNIX
domain sockets) for an as yet unsupported language?
Specifically, I'd like a driver for JavaScript, for use with Mozilla
JSLib/XPCShell.
I presume there isn't one already, so I guess I'll have to write one.
So, where can I find the specification for the protocol that I am going
to have to talk to the socket?

Actually, there is an SQL XPCOM extention for mozilla.  A search for
"sql" at mozilla.org turned up http://www.mozilla.org/projects/sql/ . 
Wow! Thanks! That's exactly what I was looking for. :-)
It turns out that this extention *only* supports postgres as of now,
with plans for Someone-elsesSQL ;) support later.
Why would I want to use that? ;-)
Gordan
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[GENERAL] Drivers for Other Languages

2005-03-30 Thread Gordan Bobic
Hi,
How difficult is it to write a driver for pgsql (via network or UNIX 
domain sockets) for an as yet unsupported language?

Specifically, I'd like a driver for JavaScript, for use with Mozilla 
JSLib/XPCShell.

I presume there isn't one already, so I guess I'll have to write one. 
So, where can I find the specification for the protocol that I am going 
to have to talk to the socket?

Many thanks.
Gordan
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [GENERAL] Triggers and User Defined Trigger Functions

2005-03-09 Thread Gordan Bobic
Csaba Nagy wrote:
DELETE FROM Temp1 WHERE Test = 'test3';
ERROR:  syntax error at or near "$2" at character 44
QUERY:  INSERT INTO Temp2 (ID, test) VALUES ( $1   $2 )
CONTEXT:  PL/pgSQL function "temp1_trigger_delete" line 2 at SQL statement
LINE 1: INSERT INTO Temp2 (ID, test) VALUES ( $1   $2 )
  ^^^
What did I miss?

A comma in the indicated position I guess...
Thanks. I'm feeling really stupid now. You may all mock me. :-)
Thanks for your help, it's most appreciated. :-)
Gordan
---(end of broadcast)---
TIP 6: Have you searched our list archives?
  http://archives.postgresql.org


Re: [GENERAL] Triggers and User Defined Trigger Functions

2005-03-09 Thread Gordan Bobic
Richard Huxton wrote:
Gordan Bobic wrote:
Hi,
I'm trying to figure out how to do this from the documentation, but I 
can't figure it out. :-(

Here is what I'm trying to do:
CREATE TABLE MyTable
(
IDbigserial unique,
MyDatachar(255),
PRIMARY KEY (ID)
);
CREATE TABLE Archive_MyTable
(
IDbigserial unique,
MyDatachar(255),
PRIMARY KEY (ID)
);
CREATE FUNCTION MyTable_Trigger_DELETE()
RETURNS ???opaque/trigger/HeapTuple??? AS '

RETURNS TRIGGER
You can't use SQL as the target language, it has to be one of the 
procedural languages (e.g. plpgsql)

Something like:
CREATE FUNCTION my_trig_fn() RETURNS trigger AS '
BEGIN
  INSERT INTO archive_mytable (id,mydata) VALUES (OLD.id, OLD.mydata);
  RETURN OLD;
END;
' LANGUAGE plpgsql;
Thanks. :-)
I did that, and I can now create the function and the trigger OK. But 
when the trigger fires (i.e. on DELETE), I get the following error:

DELETE FROM Temp1 WHERE Test = 'test3';
ERROR:  syntax error at or near "$2" at character 44
QUERY:  INSERT INTO Temp2 (ID, test) VALUES ( $1   $2 )
CONTEXT:  PL/pgSQL function "temp1_trigger_delete" line 2 at SQL statement
LINE 1: INSERT INTO Temp2 (ID, test) VALUES ( $1   $2 )
What did I miss?
Gordan
---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster


[GENERAL] Triggers and User Defined Trigger Functions

2005-03-09 Thread Gordan Bobic
Hi,
I'm trying to figure out how to do this from the documentation, but I 
can't figure it out. :-(

Here is what I'm trying to do:
CREATE TABLE MyTable
(
ID  bigserial unique,
MyData  char(255),
PRIMARY KEY (ID)
);
CREATE TABLE Archive_MyTable
(
ID  bigserial unique,
MyData  char(255),
PRIMARY KEY (ID)
);
CREATE FUNCTION MyTable_Trigger_DELETE()
RETURNS ???opaque/trigger/HeapTuple??? AS '
INSERT INTO Archive_MyTable
(
ID,
MyData
)
VALUES
(
OLD.ID,
OLD.MyData
);
RETURN OLD;
' LANGUAGE SQL;
This gives me one of the following errors:
ERROR:  SQL functions cannot return type opaque
ERROR:  SQL functions cannot return type "trigger"
ERROR:  type "heaptuple" does not exist
What type should my function be returning?
ERROR: type
Then I'd like to do the following:
CREATE TRIGGER MyTable_Trigger_DELETE BEFORE DELETE ON MyTable
FOR EACH ROW
EXECUTE PROCEDURE MyTable_Trigger_DELETE();
Can I create a trigger function like this? If not, what are my options 
WRT alternatives?

Many thanks.
Gordan
---(end of broadcast)---
TIP 7: don't forget to increase your free space map settings


[GENERAL] FTI - unique look-up records, and no sub-words?

2001-10-18 Thread Gordan Bobic

Hi.

After looking again at the other email I've sent earlier, I realized that it 
goes on for far too long, so I'll try to summarize my question more briefly.

1) How can FTI be made to NOT break up words into sub-strings? Most of those 
are likely to be useless in my particular application. In fact, this feature 
is why my FTI queries run slower on certain strings than sequential ILIKE 
scans.

1.1) How do I make FTI only separate the text string into words on 
non-alphanumeric characters (i.e. [!a-z|!A-Z|!0-9] - I think...). Is that 
what it already does?

2) How can I make inserts work when enforcing unique records in the FTI 
table? There is hardly any need for duplicates in the FTI table...

At the moment, the search for a particular string can take signifficantly 
longer with FTI than using the plain ILIKE. FTI also returns 30K matches in 
the FTI table for only 10K of records in the master table, which is clearly 
not very useful. Doing a DISTINCT on the FTI table with that string will 
yield 10K records, indicating that there are about 3 times more records in 
the FTI table which are just duplicates taking up space.

I am NOT prepared to capitulate and use something MS SQL instead. How can I 
get this to work with PostgreSQL? Who maintains the FTI contrib?

Kind regards.

Gordan

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [GENERAL] FTI Queries and Explain (long)

2001-10-18 Thread Gordan Bobic

[Broken SQL instead of performance issue fixed]

It would appear that when I define the index on the FTI table (string and 
oid) to be unique (which makes sense, since there is little point in having 
duplicate rows in this case), a lot of inserts fail where they shouldn't. I 
am guessing that if the insert into the look-up FTI table fails, the insert 
into the master table fails as well.

I can understand that this might be useful for matches where the number of 
occurences is important, but in this particular application, that is not the 
case. Before I go and look into modifying the fti function code for my 
specific purpose, it would be nice to have a confirmation of this behaviour - 
otherwise it may take me a while to find what I'm looking for. ;-)

Another question - there are (as often happens) multiple ways of doing what I 
want in SQL, but I am not sure what is the fastest and most efficient way of 
doing it (in theory at least).

I want to do a multi-criterion search on the same field (the FTI indexed 
one), and have separate AND, NOT and OR search terms.
AND = "terms that must occur in the text"
OR = "terms of which at least one has to occur in the text"
NOT = "terms which must not occur in the text"

Initially, before FTI, I used a big ILIKE query which worked reasonably well. 

I should point out that my test bed machine for this is a Pentium 100 MHz 
with 128 MB of RAM and an IDE disk. My database is expected to be around 
50K-100K records, and about 100-200 MB on disk in PostgreSQL files (that's 
what the disk consumption of the vacuumed database was before FTI).

Using the same example data set as before, yhe query was something like:

SELECT  *
FROMJobs
WHERE   (
Description ILIKE '%AndTerm1%'  AND
Description ILIKE '%AndTerm2%'  AND
Description ILIKE '%AndTerm3%'
...
)
AND
(
Description ILIKE '%OrTerm1%'   OR
Description ILIKE '%OrTerm2%'   OR
Description ILIKE '%OrTerm3%'
...
)
AND
(
Description NOT ILIKE '%OrTerm1%'   AND
Description NOT ILIKE '%OrTerm2%'   AND
Description NOT ILIKE '%OrTerm3%'
...
)

This usually returned the required data within 30 seconds or so, after, 
obviously, doing as sequential search through the database due to the 
non-anchored ILIKE match.

After implementing FTI, the insertion speed has gone through the floor (as 
expected), but the select speed doesn't seem to be that much greater, even 
when using the index (string, oid) on the FTI look-up table. On simple 
queries that only require one or two terms there is a big speed improvement, 
but on queries with three or more terms, the improvement is not that great.

The queries typically return 10 - 200 rows (give or take a bit, depending on 
the specific query).

The queries I am using at the moment to replace the above ILIKE solution are 
in the form

SELECT  Jobs.*
FROMJobs,
Jobs_Description_FTI
WHERE   Jobs_Description_FTI.string = $And/Or/NotTerms[$i]  AND
Jobs_Description_FTI.id = Jobs.oid

The AND queries are INTERSECTed together, OR queries and UNIONed together, 
both are UNIONed, and then the NOT queries are EXCEPTed.

In some cases, this has yielded a signifficant improvement in performance, as 
Tom suggested (thanks for that, it was much appreciated). Sometimes, however, 
things go the other way.

To cut the long story short, I seem to have tracked the problem down to a 
certain situation.

If there is, say, 10K records in the master table, there is about 4M records 
in the lookup table. This in itself isn't an issue. Queries that return small 
numbers of records, e.g.

SELECT  count(*)
FROMJobs_Description_FTI
WHERE   string = 'linux'

(returns ~300 rows)

happen more or less instantaneously.

However, a very similar query such as:

SELECT  count(*)
FROMJobs_Description_FTI
WHERE   string = 'nt'

(returns ~30K rows)

takes around two-three minutes.

I tried doing a

SELECT  count(*)
FROMJobs
WHERE   Description ILIKE '%nt%'

(returns 11K records out of 12K)

and that only takes about 10 seconds or so.

SELECT  count(*)
FROMJobs
WHERE   Description ILIKE '% nt %'

returns ~800 records out of 12K, which is much more reasonable.

Ideally, that should be

SELECT  count(*)
FROMJobs
WHERE   Description ~* '.*[!a-z]nt[!a-z].*'

or something like that, which yields a similar number of records to the 
previous query, but is slower.

I am fully aware that this is fairly normal under the circumstances, but I 
need a way of defeating this performance issue. The only way of doing that 
that I can see at the moment is to:

1) Modify the FTI function to split the text field only at non-alphanumeric 
characters, and only return whole words, rather than substrings of words.

2) Allow the insert into master table to s

Re: [GENERAL] Any Good Way To Do Sync DB's?

2001-10-15 Thread Gordan Bobic

On Monday 15 Oct 2001 13:35, Joseph Koenig wrote:
> Your solution sounds very interesting (Not the throw away NT
> part...)

That is where a signifficant part of the performance improvement would come 
from, if performance was what you were after...

> ...does anyone else have any input on this? Would it work well?

It works great - I am using it on a number of servers. It also gives you a 
much more secure setup, because you can pipe everything through a single 
firewall with a squid accelerating proxy. That means that a hacker would have 
to breach your firewall befure they would have even a remote opportunity to 
penetrate deeper into your network and attack the web server.

> Any idea as to what amount of traffic it would be capable of handling?

Any amount of traffic, it just means that only two connections would be 
served at any one time. If your query return and CGI page construction times 
are very low (which they should be if you are using mod_perl on a decent 
server), then you are unlikely to notice any slowdown. The excess hits would 
just get queued and processed when back-ends become available. It would also 
be a good idea to look through your scripts and attempt to connect the 
database in Perl::DBI as late as possible (not at the beginning of the script 
as a global variable, but just before the query is issued), and disconnect as 
early as possible (as soon as the query is executed and the data retrieved). 
This would not hog the connections to the PostgreSQL end as much.

Of course, this is all provided you use perl CGIs and mod_perl on Apache. 
Otherwise, persistent connections don't come into the picture at all.

> If apache is only running in two instances, would that really keep the
> number of ODBC connections to 2 at max?

No, that would only keep the number of your connections from the WEB SERVER 
to two max. ODBC connections from the rest of your network would be a 
completely separate issue. Basically, if Apache only ever has two instances 
running, then it can only ever serve up to 2 requests at any one time, and 
hence the database server will never receive more than 2 simultaneous request 
from the web server - UNLESS your CGIs make multiple database connections in 
parallel. If they do, then you're fighting a loosing battle, and you might as 
well give up.

If your CGIs only ever use one connection, then putting the web server behind 
an accelerating squid proxy would actually help further, by not using the 
Apache back ends to serve static documents, such as frames pages, gif/jpg 
background or button images, etc. This means that only the actual CGI 
requests would go to Apache. I have implemented this sort of a load-reduction 
solution with a reduced number of active Apache servers in my last 
consultancy contract, and it worked very well.

> By the way, I fully agree with throwing away NT --

Understandable...

> unfortunately this DB is part of their inventory and
> POS system that another company set up for them. They just finished
> investing about $200K in the system, so just telling them to get rid of
> it isn't quite an answer. Thanks for all of the input from everyone.

Hang on - if they have just invested $200K into this "solution", then why 
exactly is an additional $5,500 for another licence a problem all of a 
sudden???

Regards.

Gordan

---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [GENERAL] Any Good Way To Do Sync DB's?

2001-10-12 Thread Gordan Bobic

On 12 Oct 2001, Doug McNaught wrote:

> Joseph Koenig <[EMAIL PROTECTED]> writes:
>
> > I have a project where a client has products stored in a large Progress
> > DB on an NT server. The web server is a FreeBSD box though, and the
> > client wants to try to avoid the $5,500 license for the Unlimited
> > Connections via OpenLink software and would like to take advantage of
> > the 'free' non-expiring 2 connection (concurrent) license. This wouldn't
> > be a huge problem, but the DB can easily reach 1 million records. Is
> > there any good way to pull this data out of Progess and get it into
> > Postgres? This is way too large of a db to do a "SELECT * FROM table"
> > and do an insert for each row. Any brilliant ideas? Thanks,
>
> Probably the best thing to do is to export the data from Progress in a
> format that the PostgreSQL COPY command can read.  See the docs for
> details.

I'm going to have to rant now. The "dump" and "restore" which use the COPY
method are actually totally useless for large databases. The reason for
this is simple. Copying a 4 GB table with 40M rows requires over 40GB of
temporary scratch space to copy, due to the WAL temp files. That sounds
totally silly. Why doesn't pg_dump insert commits every 1000 rows or so???

Cheers.

Gordan


---(end of broadcast)---
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])



Re: [GENERAL] PostgreSQL Replication Server?

2001-06-07 Thread Gordan Bobic

Or try:

http://pgreplicator.sourceforge.net/

Haven't used it myself yet, but it looks pretty good...

> > Now, erserver seems to work, but it needs a bit hacking around
that I
> > hadn't done yet. Maybe when I get it working I'll see to writing
> > something. In the mean time, source code is the only thing that
can help
> > you.
>
> I forgot to explain: erserver works well "out of the box" for one
> master / one slave setup. If you need more than one slave, you need
some
> hacking around. If you need more than one master, I doubt it'll work
in
> its current form.
>
> --
> Alvaro Herrera ()
>
>
> ---(end of
broadcast)---
> TIP 4: Don't 'kill -9' the postmaster
>
>



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [GENERAL] Text data type doesn't accept newlines?

2001-06-07 Thread Gordan Bobic

Not sure, but the syntax is as I described below. Try checking the
perl DBD::Pg documentation. I think that's where I read about it
originally, many moons ago.

> Just checked the Pg docs, don't see a quote function. What is it
part of?
>
>
> > Are you using the "quote" function? You have to use it if you are
to
> > guarantee that the data will be acceptable as "input".
> >
> > $myVar = $myDB -> quote ($myVar)
> >
> >
> >> I'm using the Pg perl interface. But, think my problem was that I
> > had
> >> unescaped single quotes in the string. Added the following to my
> > code to
> >> escape them and it works now:
> >>
> >> $self->{errors} =~ s"'"\\'"g;# escape single quotes
> >
> >
> >
> >
> > ---(end of
broadcast)---
> > TIP 4: Don't 'kill -9' the postmaster
> >
>
> --
> Randy Perry
> sysTame
> Mac Consulting/Sales
>
> phn 561.589.6449
> mobile email[EMAIL PROTECTED]
>
>
>
>
>



---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [GENERAL] Text data type doesn't accept newlines?

2001-06-06 Thread Gordan Bobic

Are you using the "quote" function? You have to use it if you are to
guarantee that the data will be acceptable as "input".

$myVar = $myDB -> quote ($myVar)


> I'm using the Pg perl interface. But, think my problem was that I
had
> unescaped single quotes in the string. Added the following to my
code to
> escape them and it works now:
>
> $self->{errors} =~ s"'"\\'"g;# escape single quotes




---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [GENERAL] Re: Ideal hardware configuration for pgsql

2001-05-04 Thread Gordan Bobic

> > As for those OS wars - are there any serious Linux sysadmins who
don't
> > have a copy of "Linux System Security" next to the server?
>
> I run Linux on everything, and I don't have a copy of that book next
to
> any of my machines.  Then again, I don't run Redhat.

I don't have a copy of it either, and I do run RedHat. The trick is to
disable the services you don't use, and have portsentry firewall all
the ports. :-)

Regards.

Gordan





---(end of broadcast)---
TIP 4: Don't 'kill -9' the postmaster



Re: [GENERAL] Ideal hardware configuration for pgsql

2001-05-04 Thread Gordan Bobic

> > A note about SCSI vs IDE... I have recently tried both on a dual
P3 with
> > 1gb of ram running Mandrake 7.2. I was amazed the idle CPU's were
> > running near 20-23% with nothing other then a bash shell running
on 2
> > IBM IDE ATA 100 drives. I converted to 2 IBM SCSI U2 drives and
the idle
> > CPU's went down to 0-2%. This was a tremendous difference. I feel
SCSI
> > is a must.

I have a suspicion that you didn't try using hdparm first to tune up
your disks. Have you set them up with 32-bit transfers, UDMA, and
unmasked interrupts? I am running a single 40 GB Maxtor 5400 UDMA/66
disk, and even when I am stress testing the database, about 96% on
each CPU goes to user-space postmaster processes. The system CPU
consumption rarely goes above 4%. And FYI, I am running a dual 1 GHz
P3 on a VP6. The IDE disk is connected to thepromise controller. I
haven't tried it on the VIA, but I suspect the results would be the
same. The only way I can see your CPU consumption going to 20% is if
you are using PIO, and haven't tuned up your disks.

> > I would like to try one of those new Promise Supertrak ATA 100
RAID 5
> > cards or hear from someone who has tried one... Anyone? I read
they have
> > the same onboard CPU as the Adaptec ultra160 RAID 5 card and can
have up
> > to twice as much onboard cache (128mb) as the SCSI counterpart. It
is
> > true hardware level raid 5 with a hot spare and each of the up to
6
> > drives is on its own interface. I would think this would alleviate
the
> > onboard IDE / CPU burn and improve the IDE performance
tremendously.
> >
> > The cost of the Promise Supertrak is over $100 more than the
Adaptec
> > ultra 160 card, but the ATA 100 drives are very inexpensive
compared to
> > SCSI. I found that you could get more than 4x the space for 25%
less
> > money with IDE ATA 100 RAID 5.
> >
> > If this is true... IDE RAID 5 might be a good alternative to SCSI.

UDMA without raid is already a good alternative to SCSI.

> > As for the Intel vs Sparc, well I use both. The Intel's are great
when
> > you need Windows but when you need performance the Intel just
can't keep
> > up with the stability, raw power or speed of Solaris 8 on a Sun
machine.

I'd dispute that. Admittedly, I haven't used the latest and greatest
in Sun hardware due to cost issues, but my experience says that
Linux+Intel is pretty much as good as SPARC+Solaris as far as
stability goes. I suppose for heavy number crunching, the late
UltraSPARC CPUs are much better, and Sun systems to generally feel
more responsive than standard PCs. It all depends on how much you want
to spend, and what sort of performance are you after.

> > As for the cost, we use HP dual 700 CPU x86 Netservers which cost
about
> > $30,000ea and with the Sun Developer program we saved 40% on our
Sun
> > machines which put the Sun E220 dual CPU in the same ballpark only
it
> > will run circles around the HP and I think anyone who develops
with Java
> > can become a Sun Developer.

$30,000 each??? And it is only a dual 700 P3? You are joking right? I
have just built a dual 1 GHz P3 server with 1 GB of PC133 CAS2, and
everything else for signifficantly less than $1,500! Where did the
remaining $28,500 go? 4 TB hardware RAID5 disk array with 1 GB of
cache and 10 hot spare disks? Because that would cost you roughly
$28K...

Regards.

Gordan



---(end of broadcast)---
TIP 5: Have you checked our extensive FAQ?

http://www.postgresql.org/users-lounge/docs/faq.html



Re: [GENERAL] Scalability

2001-03-14 Thread Gordan Bobic

(Frequent Access)
If you just have lots of queries in parallel, try replication, and
pick a random server for each connection.

(Complex Queries)
If you absolutely, positively need one query to be executed across all
nodes in the cluster because one machine would just take too long no
matter how big, then you may have to shell out for Oracle...

(Splitting the Data)
Alternatively, you could put separate tables on separate machines, or
wait until the "schema" gets advanced enough to let you do full
cross-database SQL queries... No idea how far off this is in
PostgreSQL...

Personally, I find that CPU is rarely the bottleneck. It is usually
either lack of memory or the speed of the disks...

- Original Message -
From: "Jan Ploski" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Wednesday, March 14, 2001 1:54 PM
Subject: [GENERAL] Scalability


> Hello,
>
> When large databases, complex queries and frequent access are
combined,
> the database backend may become a performance bottleneck. This is
quite
> obvious.
>
> Do I assume correctly that CPU will be the earliest limiting factor
> for a database (if we have a reasonable amount of memory in the
box)?
>
> If yes, one could deal with it by switching to a more powerful
machine,
> OR one could distribute the database between several equivalent
machines,
> as I imagine. The latter makes sense if data can be partitioned so
that
> consistency and dependencies can be worked around, that is, I can
move
> two subsets of data (say A and B) to two different boxes if rows in
these
> subsets are very unlikely to be combined in a single query. Then I
can have
> my application code access the correct database instance based on
some
> simple rule, for example, users with IDs 0-100 have their
messages
> stored on box A and all others on box B. Or: articles for newsgroups
> with hash code == 0 -> box A, others -> box B. Do we get linear
scalability
> that way? I guess so. Does it sound cool? Yes. But does it cost less
than
> upgrading a single server? I'm not sure.
>
> Is anyone out there using such a load balancing scheme with
PostgreSQL?
> Are there any foreseeable problems which would make it impractical?
>
> Thanks -
> JPL
>
> ---(end of
broadcast)---
> TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to
[EMAIL PROTECTED])
>
>



---(end of broadcast)---
TIP 6: Have you searched our list archives?

http://www.postgresql.org/search.mpl



Re: [GENERAL] Re: Slowdown problem when writing 1.7million records

2001-02-27 Thread Gordan Bobic

> > I am very new to PostgreSQL and have installed v7.03 on a Red Hat
Linux
> > Server (v6.2), I am accessing the files using JDBC from a Windows
2000 PC.
> >
> > I have created a small file as follows:
> > CREATE TABLE expafh (
> > postcode CHAR(8) NOT NULL,
> > postcode_record_no INT,
> > street_name CHAR(30),
> > town CHAR(31),
> > PRIMARY KEY(postcode) )
> >
> > I am now writing 1.7million records to this file.
> >
> > The first 100,000 records took 15mins.
> > The next 100,000 records took 30mins
> > The last 100,000 records took 4hours.
> >
> > In total, it took 43 hours to write 1.7million records.
> >
> > Is this sort of degradation normal using a PostgreSQL database?
>
> AFAICT, no.

Funny you should say that, but I seem to be experiencing a similar
things when doing pgdump/restore. When restoring a table with around
40M records, everythings starts nice and fast. However, by the time
it's half done, inserts take forever. It took about 16 hours to
restore all 40M records (3 GB or so + indices). Obviously, since it
was a "restore", there was no vacuuming during the operation, but it
still seemed to get unusually slow toward the end... Judging by the
increase rates in file sizes, anyway, which I accept, isn't the most
reliable way to judge this matter by...

But, I guess this is nothing comparable to your problem. I was
inserting 40M records, and it took 16 hours.

You are inserting 1.7M records, and it's taking 43 hours.

Are you using autocommit? Are you using fsync? You should probably
periodically commit the data every few thousand inserts if you don't
have autocommit enabled.

Are you running out of memory? What amount of shared memory do you
have enabled? How much is Postgres allowed to use? I know this is
fairly obvious stuff, but it is, as such, quite easy to overlook...

> > I have never experienced this sort of degradation with any other
database
> > and I have done exactly the same test (using the same hardware) on
the
> > following databases:
> > DB2 v7 in total took 10hours 6mins
> > Oracle 8i in total took 3hours 20mins
> > Interbase v6 in total took 1hr 41min
> > MySQL v3.23 in total took 54mins
> >
> > Any Help or advise would be appreciated.

The only thing that comes to mind is that if you're doing a bulk
insert, you should probably drop all indices that aren't unique or for
primary keys, and re-create them once your insert all your data...

Regards.

Gordan





Re: [GENERAL] Can PostgreSQL be a mail backend?

2001-02-27 Thread Gordan Bobic

> It works quite well (designing a web-based system on it right now),
but
> because of a DDB::Pg limit, I can only get 8k into a 'text' field.
So if
> your app is web-based, you might want to not use perl...

Umm... I'm not sure what you're talking about here. I think you are
referring to the 8KB row length limit in pre-7.1 PostgreSQL. AFAIK,
Perl imposes no such limitation. In fact, I am using a database here
based on v7.1b4 that quite happily stores records with text fields
that exceed 50KB, and I am using a Perl program on the front end to
read and write those records.

So, what 8KB limit are you talking about? If there is one that I'm not
aware of, I'd sure like to find out about it...

Regards.

Gordan





Re: [GENERAL] VACUUM and 24/7 database operation

2001-01-23 Thread Gordan Bobic

>For one of our customer, we are running a PostgreSQL database on a
> dynamic PHP-driven site. This site has a minimum of 40 visitors at a
> time and must be responsive 24h a day.

And from the bandwidth and hit logs, they cannot determine a time of day
when there are hardly any hits? Possible, but it might be worth a
double-check.

>The problem is with VACUUMing this table. It takes 2 long minutes
> everyday. Two minutes during wich no request can be done because of the
> lock on the table... (nearly every request is involving this large
> table). Our customer really dislike this fact and is planning to
> replace PostgreSQL with Oracle.

If they are sufficiently desperate over 2 minutes per day, which is, BTW,
less than 0.14% of the time, to want to replace it with Oracle at the full
cost of it (and last time I checked, Oracle in full server setup costs were
in 5 figures, just for the software), then I'd say let them. It's their
money they are throwing away. I am assuming here that they have checked and
confirmed that Oracle would no suffer a similar problem?

>2 minutes is seen by our customer as sufficent for his customer to
> get away from his site.

That would be the 0.14%, would it? What sort of service are they running on
it?

> - Is 2 minutes a standard time for vacuuming a 500.000 rows table ?

I'd say that is pretty good. I have a 30M row table with 30K-100K
inserts/updates per day (not including selects here), and I'm quite happy
if I can get it to vacuum in less than 15 minutes on a P2/400.

> - Can it be reduced ?

I guess. If you run your database on am ultra-fast RAID stripe, put more
memory and a faster processor in the server, you could probably reduce it.

Have you considered running two servers? If you set up two replicated
servers, then you could run everything off one server. At a convenient time
of day, when there's minimum load, you could swap vacuum the backup one,
wait for the queued replicated queries to be executed, and then fail them
over. Then, you can vacuum the primary server, and make it wait as the
fail-over server until next time it's time to vacuum the database. The
fail-over should take a matter of seconds (fractions of seconds?),
depending on how you do it. A second server is likely to cost them less
than a full Oracle licence...

Incidentally, how do they handle backups? I am not sure how well a backup
of the database will work out with it still running, and records being
inserted during the backup. Running two servers will also get you around
this problem, in the same way, because you can shut down a database while
you back it up without loosing any uptime or data.


> - In a far future, what are the problems we can run into not vacuuming
> that table ? We have already seen that after a month, some transactions
> involving where id >= some_value take forever, so we supressed them.

Performance on inserts and updates will degrade, especially if there are
lots of deletes as well. Basically, the tables will get fragmented, because
deletes, AFAIK, do lazy deletion, so the tables aren't cleaned out until
you vacuum them. This also tends to mess up the index timings because there
is a lot of dead records in them.

Eventually, it will grind to a halt.

Depending on what your performance and requirements are, you could do a
vacuum once per week, perhaps?

Regards.

Gordon




Re: [GENERAL] Another optimizer question

2001-01-19 Thread Gordan Bobic

> > If I do a view that produces the data I want through joins, it takes
hours,
> > even with all fields indexed, and after VACUUM ANALYZE. Doing SET
ENABLE
> > SEQ_SCAN = OFF doesn't seem to make any difference. The query plan
changes,
> > but select times are still roughly the same... Doing the separate
> > subqueries on each table and joining data manualy in the application
code
> > takes literaly seconds.
>
> Um, have you updated since I changed the subselect-restriction
> processing two days ago?

Not yet. I've done it in appliction code for now. I'll try again in
straight SQL when the next beta or release are available.

Thanks.

Gordan




Re: [GENERAL] Another optimizer question

2001-01-19 Thread Gordan Bobic

> > SELECT * FROM Table1 INNER JOIN Table2 ON (Table1.Field1 =
> > Table2.Field1)
> > WHERE Table1.Field1 = 'SomeValue';
> > [ is slow, but this is fast: ]
> > SELECT * FROM Table1 INNER JOIN Table2 ON (Table1.Field1 =
> > Table2.Field1)
> > WHERE Table1.Field1 = 'SomeValue' AND Table2.Field1 = 'SomeValue';
>
> You're correct, we do not deduce the latter from the former, and so the
> first example does not realize that it could use a constrained indexscan
> over Table2.
>
> Maybe we should try to do this, but I suspect we'd waste more cycles
> trying than we'd make back on the queries where it helps.

Hmm... As far as I can tell, it would help in any query joining several
tables on the same field in one table. I don't know how this compares to an
"average" case, but it is certainly common in the databases that I'm
working on... It would be nice if there was at least an option to enable
this sort of behaviour.

At the moment, the 6 table joins I am doing are so painfully slow, no
matter how I optimize the queries/views that I am going to forget about the
whole idea of views and set up some code that will do the separate table
subqueries manually. Even with _ALL_ fields that joins are performed on are
indexed in hash or btree, it still takes forever. Seen as I have to do this
often, unless I can find a way to speed up my queries by an order of
magnitude (or several in some cases), I don't see any other way around
this.

If I do a view that produces the data I want through joins, it takes hours,
even with all fields indexed, and after VACUUM ANALYZE. Doing SET ENABLE
SEQ_SCAN = OFF doesn't seem to make any difference. The query plan changes,
but select times are still roughly the same... Doing the separate
subqueries on each table and joining data manualy in the application code
takes literaly seconds. I am sure that cannot be right and I must be doing
something wrong, so if anyone has a good idea of how to solve this type of
problem, I'm not sure I have a lot of options left...

Regards.

Gordan




[GENERAL] Another optimizer question

2001-01-18 Thread Gordan Bobic

I am not sure if this is a bug, an oversight or something else entirely,
but it would appear that if there are two tables, Table1 and Table2, which
are joined using INNER JOIN, specifying WHERE =  one of the join fields
doesn't automatically get equalised to the other field.

For example:

SELECT * FROM Table1 INNER JOIN Table2 ON (Table1.Field1 = Table2.Field1)
WHERE Table1.Field1 = 'SomeValue';

takes a very long time (several minutes), and explain says that sequential
scans are used on both tables.

However, changing the above to:

SELECT * FROM Table1 INNER JOIN Table2 ON (Table1.Field1 = Table2.Field1)
WHERE Table1.Field1 = 'SomeValue' AND Table2.Field1 = 'SomeValue';

yields the correct answer in a fraction of a second. Explain says that
indices are being used. However, here's a REALLY strange thing. If I do:

SET ENABLE_SEQSCAN = OFF;

and run the first query, explain says that indices are used, but it STILL
takes forever. The first, slow query executes a merge join, while the
second only executes two index scans in a nested loop.

Why? This seems like a fairly basic thing, but it seems to break something
in the way the query is executed...

Regards.

Gordan




[GENERAL] Tuning queries and distinct behaviour

2001-01-18 Thread Gordan Bobic

Is there a way to tune queries?

I'm doing queries that join around 5-6 tables. All join fields are indexed
either in hash (where tables are small enough and join is done on "="), or
btree (big tables, not joined on "="). The tables have between several
hundred and several tens of millions of records. The problem is that this
tends to take a _VERY_ long time. I know that I'm asking for a bit much on
such a huge task, but if I break the queries down manually into 10 simper
ones that I could run manually with temporary tables, each of those would
take a few seconds at most. The optimizer occasionally decides to do
sequential scans, and this is probably what is killing the performance. Is
there any way to give the optimizer hints? I suspect that it would help in
a vast majority of cases if it looked ad the where clauses in views and
selects on views and tried cutting down the working set through that, and
then pruning down the rest as it went along. It just seems a bit strange
that doing subqueries with temporary tables should be so much faster.

Doing VACUUM ANALYZE often helps, but not always. Is there any way to give
the optimizer hints on how to speed up selects on views that do big joins,
both inner and outer?

Another question - I have to do a join on the "<" operation. Something
like:

SELECT * FROM Table1, Table2 WHERE Table1.Field1 < Table2.Field2 ORDER BY
Table1.Field1 DESC, Table2.Field2 DESC;

This will give me a very large set of records. However, I only really need
a few of those records. I only want the highest Field1, Field2 combination
records for some other field in Table1. Effectively, something like:

CREATE VIEW SomeView AS SELECT * FROM Table1, Table2 WHERE Table1.Field1 <
Table2.Field2 ORDER BY Table1.Field1 DESC, Table2.Field2 DESC;

and then doing:

SELECT DISTINCT ON (Table1.Field3) * FROM SomeView. I would hope that
DISTINCT would pick the first record returned for each of the different
occurences in SomeView.

Unfortunately it doesn't. By having a quick scan at the returned records,
it doesn't seem to pick them in any particular order. Is this the case? And
is there an SQL standard that says which records should DISTINCT return
(first, last, or arbitrary)? And is there a way to enforce this behaviour,
just in case the default behaviour changes in the future?

Alternatively, can anyone think of a solution to this problem?

Thanks.

Gordan




[GENERAL] Using max() MUCH slower in v7.1

2001-01-17 Thread Gordan Bobic

Hi.

I have just upgraded from v7.0.3 to v7.1b3, and one of the things I am
noticing is that doing a max() query search seems to take forever.

For example, if I have a view like:

CREATE VIEW LastDate AS
SELECTCompany,
max(Date) AS Date
FROMPastInvoices
GROUP BYCompany;

And then call it like
SELECT Date FROM LastDate ORDER BY Date WHERE Company = 'SomeCompany';

It takes ABSOLUTELY forever. The table has about 25-30M records. This
worked with acceptable speed on v7.0.3. I used pg_dump to port the data,
and I have just done a VACUUM ANALYZE. EXPLAIN says that it will use the
index for the Company field, which is right, but it still takes a
ridiculously long time (hours). All indices are BTREE (HASH index creation
fails with the "out of overflow pages" error), and the table is index on
both Company and Date.

OTOH, if I just use no view and do

SELECT Date FROM PastInvoices WHERE Company = 'SomeCompany' ORDER BY Date
DESC, LIMIT 1;

which does PRECISELY the same thing, that finishes in a fraction of a
second. This was the same speed that the max() view query ran at on v7.0.x.
Why such a sudden change?

Regards.

Gordan




Re: [GENERAL] How passwords can be crypted in postgres?

2001-01-02 Thread Gordan Bobic

> >  [...]
> > Isn't this just as bad? If you store the encrypted password, that
doesn't
> > help you in the slightest in this case, because if you can breach the
list
> > of encrypted passwords, you still know what you need to send as the
> > "password" from the front end to let you into the database.
> >  [...]
>
> If you encrypt the input from the frontend as well and compare the
> encrypted strings it will not help you to look into the list of
> encrypted passwords ... or am I wrong?

What problem are you trying to defeat? If you are worried about "sniffing"
passwords from the traveling packets, then regardless of whether the
password field carries a plain text password or scrambled garbage, if you
know where the password field is, you can sniff it. If you are simply using
this for authentication, then it doesn't matter whether the password is
encrypted or not. You are still, effectively, transmitting a "password
string" that is used for authentication.

The security of passwords, encrypted or otherwise is purely reliant on the
security of your database server that stores the data.

Does that make sense?

Regards.

Gordan




Re: SV: [GENERAL] MySQL and PostgreSQL speed compare

2001-01-02 Thread Gordan Bobic

> Advanced tools do have advanced safety features, but are sold "ready
> for most use", not "safely disabled until you read all of the manuals
> so you can figure out how to make it work decently". I agree that
> reading the manuals is an important part of learning a new tool,
> but it shouldn't be *required* to make it work for basic use.

It isn't *required*. It works lovely the way it is shipped. But if you want
more speed, you should go and read the manual before complaining. It is not
crippled in any way - just tuned on the side of caution. It STILL works
well for MOST users who just want something to work, rather than ultimate
speed or reliability. It is up to the user to decide what is more important
for their particular application, and what is more appropriate given their
setup and budget.

> Users shouldn't have to know how to tune the fuel injection system
> for *optimum* performance in order to take a car for a test drive
> on a fast roadway.

No, they shouldn't. However, for THOSE users, the more appropriate way of
solving the problem would be to buy faster hardware - this is the analogy
you are following, right? If you want to drive faster than the car will let
you, buy a faster car, right?

> Computer software is, indeed, a tool which does not do everything
> for you. But is should come "from the factory" setup for the way
> a user would expect it to run, not partially disabled for maximum
> safety.

It is not "disabled" in any way. It works very well, for a vast majority of
uses. If you are setting up a web site, which you want people to see, then
you should consider yourself serious enough to read the documentation. If
you are intending to stake the future of your business on a server, then
exactly what are you thinking if you still refuse to RTFM?

> It's a power tool, and it can "hurt" if misused. If that's
> too much responsibility for a bad user, it won't matter how safely
> it's been tuned at the factory, the bad user will *still* modify it
> in unsafe ways, and often tune it or use it the wrong way, damaging
> the tool in the process.

There is a valid point in there somewhere. However, there is nothing wrong
with erring on the side of caution. All the functionalityis there - but if
you need more speed, all it takes is reading through the archives for an
hour or so, and you will find all the answers you need.

> I don't expect my software to come optimized for my use. I expect
> it to come optimized for the most users and uses, not "dumbed down"
> for the worst case, or "safely disabled" for the worst users.

Why? What's your reasoning behind that? If all the functionality is there,
and the only penalty is speed, which is still adequate for most uses, what
is the problem? If you are happy with tuning things up for your particular
application, they the chances are that you will go through the tuning
process yourself regardless of how it is shipped. All the default that is
slightly slower will do is encourage you to read the docs that little bit
sooner, if your system becomes large enough for this to be an issue.

Regards.

Gordan




Re: [GENERAL] How passwords can be crypted in postgres?

2001-01-02 Thread Gordan Bobic

> I usually just run 'crypt()' on the clear text before storing it to the
> backend ...

Isn't this just as bad? If you store the encrypted password, that doesn't
help you in the slightest in this case, because if you can breach the list
of encrypted passwords, you still know what you need to send as the
"password" from the front end to let you into the database.

Unless I am missing something here, doing this doesn't make any
difference... Not for someone serious about breaching security, anyway...

Regards.

Gordan




Re: SV: [GENERAL] MySQL and PostgreSQL speed compare

2001-01-02 Thread Gordan Bobic

[tuning analogies snipped]
>
> Likewise with self-proclaimed computer tuners.

You have no idea how much I agree with you there.

> > I really don't understand why people expect computers to do everything
> > for them, the burden of using tools properly belongs to the user.
>
> I of course agree in principle to this statement (and the whole tone of
> your reply) -- but, my statement doesn't reflect my opinion -- it
> reflects reality.  Facts are stubborn things.
>
> Of course the defaults will never be perfect -- nor will all users RTM.
> But, therein lies the utility of 'defaults' benchmarking -- let's see
> what the trade offs really are so that we the 'experts' can
> intelligently recommend things -- as well as intelligently tweak the
> defaults.

Unfortunately, there are two ways you can load the defaults. You can strip
them down for maximum speed, or you can load them up for maximum
reliability. The latter will make the users complain about speed on the
support lists in a generally annoyed fashion.

Stripping things down for sppeed, OTOH, will work great - until someone
gets a powercut, or some other kind of wierd hardware failure that will
wipe out their data. Then they will come back again and complain.

And the answer is always to simply spend an hour or so reading the
documentation...

Some people, eh...

Regards.

Gordan




Re: [GENERAL] MySQL and PostgreSQL speed compare

2001-01-02 Thread Gordan Bobic

> >>>Actually, if he ran Postgresql with WAL enabled, fsync shouldn't
> >>>make much of a difference.
>
> WAL seems to be enabled by default. What WAL is good for I do not know.
But
> if I start PostgreSQL without the -S I see a lot of info about WAL this
and
> WAL that.

You seem to be too hung up on defaults. I am not into advocacy, and
whatever database works better for you is the right one to use. However,
using the defaults as the basis for benchmarking is intrinsically flawed.
It ultimately depends on what the person who set up the distribution felt
like at the time of creating the packages. There may be guidelines which
err on the side of caution, to the point of paranoia. All these are quite
common. If you are serious enough about using a database to run into
bottlenecks of whatever sort you are experiencing, then you should also be
serious enough to RTFM and find out about tuning the database for a
particular application (I consider a benchmark to be an application in this
case) before you do it. Posting results of a benchmark on a default
installation will not prove absolutely anything.

> ...
> > But isn't it recommended to run the server with fsync?  If so,
> > you shouldn't disable it on a benchmark then.
>
> I run both MySQL and PostgreSQL as they are (minimum switches, no tuning,
as
> default as it can be). That is MySQL as the .rpm installed it
> (--datadir --pid-file --skip-locking) and PostgreSQL with -i -S -D. Thats
> the way most people would be running them anyway. And default should be
good
> enought for this test (simple queries, few rows (max 1000) per table).

There you go with defaults again. And I'm afraid that your argument "Thats
the way most people would be running them anyway." is also flawed in the
same way. People serious enough about using a database in a sufficiently
heavy environment to run up against speed problems whould be serious enough
about reading up on the software they are using to find out how to tune it
for their application.

Is this some kind of Windows induced dementia? Use everything as it was
installed, and expect it to always work in the best possible way for your
particular application? Use everything the way it was installed because
"users are too thick to play with the settings"? What abous sysops? Would
you really want your business, mission critical server to be operated by
someone who cannot even be bothered to read the documentation for the
software he is installing in sufficient depth to find out about things like
tuning?

The problem here is not the lack of knowledge - it is the resistance to the
concept of learning about something before judging it. Can you see what is
wrong with that approach?

> ...
> > > > Well I expected MySQL to be the faster one, but this much.
> ...
> > > To me, all this is pointing toward the possibility that you haven't
> > > switched of fsync. This will make a MASSIVE difference to
insert/update
>
> The idea was to run as recomended and as default as possible. But with
the
> latest (alpha/beta/development) code.

Latest code doesn't matter in this case. If you are running a benchmark,
here are the things you should be considering if you are being serious
about measuring real-world performance AND usefulness.

1) Never benchmark pre-releases. Always use the latest RELEASE version,
with all the required stability/bugfix patches installed.
2) Always tune the software and hardware up for the particular benchmark.
This will allow you to asses the ability of software/hardware to adapt to a
specific application.
3) If you are testing pre-release versions, you should ALWAYS take the
results with a pinch of salt. Pre-releases are not necessarily stable
(although they often are), and they are often set up to allow for easier
bug tracking and reliability testing, rather than pure speed measuring.
4) ALWAYS contact the developers of the software before publishing the
results. They will give you useful hints on how to optimize things.
5) Default installations are usually completely meaningless for
benchmarking purposes.

> ...
> > > And in case you cannot be bothered, add the "-o -F" parameters (IIRC)
to
> ...
> > > flushes the it's disk cache bufferes after every query. This should
even
> > > things out quite a lot.
>
> Ill test that. Even thou it feels like tweaking PostgreSQL away from what
> its considered safe by PostgreSQL developers. If it would be safe it
would
> be default.

OK, I am not a PostgreSQL developer (not quite yet, anyway), so they should
comment on this from their point of view.

However, if you are benchmarking speed, then tune the setup for speed. That
is what you are measuring, right? If you are testing something for
reliability and torture-proof features, then tune the setup for that. Not
tuning the system for the application is like using a sledge hammer to
unscrew a bolt. There is such a thing as the correct tool for the task!

> >>> Sir, thanks for sharing this with us. However, unless you can explain

Re: [GENERAL] Performance Tuning, hardware-wise

2001-01-02 Thread Gordan Bobic

> As for the drive in that machine, doing inserts on it was SLOW.
Slower
> even than on our beater development machine.  I suppose I could have
fiddled
> with hdparm to increase the disk I/O, but that would have been a
temporary
> fix at best. Our CGI applications were eating lots of CPU time, and we
just
> needed more processors.

If you needed more CPU power, then that's fair. However, bear in mind that
hdparm is NOT necessarily just a TEMPORARY fix. Using DMA modes can HALVE
the CPU utilization required for heavy disk I/O. It is also not uncommon to
increase the disk bandwidth by as much as four times by tuning the disks
with hdparm. I have seen tens, if not hundreds of Linux machines. I have
yet to see one that didn't benefit greatly from using hdparm to tune up
disk transfers. On my database development server, the CPU consumption on
heavy SELECT/INSERT/UPDATE setup went from around 60% to around 40%, and
the hdparm -t -T reported the increase from 4 MB/s on both buffer-cache and
buffer-disk transfers to about 50 MB/s and 20 MB/s respectively. It takes a
bit to get it right, but if you know your disk and motherboard spec, and
don't feel like adventuring into overclocking and overtuning, you are
pretty much guaranteed success on the first try. Then you just have to add
the tune-up to your startup scripts, preferably before the swap is enabled,
as in my experience, in a very limited number of cases, retuning the disk
after the swap has been started can cause some minor, random stability
problems.

>  So, we moved to a large database machine to service a group of web
> servers.  The database machine has four Xeon 700's in it, with 512 megs
of
> RAM.  For the disk, we did decide to go with a RAID array, for disk
> redundancy as much as efficiency.  The fast disk array makes inserts go
> tremedously fast, but for selects, well, it's still the same - you're
> limitted by CPU and memory.   Why not disk I/O?  Well, the entire dataset
is
> in the disk cache anyway, so there's very little disk I/O involved with
> selects.

Glad to hear that this worked for you, but some of us have a setup where
you have to randomly query a 8+ GB database. Having it all in cache just
isn't going to happen in that case and lots of disk I/O bandwidth is the
only feasible answer (I don't consider an 8 GB RAM disk to be a feasible
option)

>  Our data directory is 340 megs, and the machine has 512 megs of RAM,
> and Postgres is just too efficient with RAM.   ; )  I start up the
> postmaster with 3072 buffer blocks, and yesterday increased the amount of
> memory each postmaster can use for sorting from the default of 4 megs to
32
> megs, which did give a small speed increase, but the machine *still* has
at
> least 120 megs completely unused at all times.  Maybe I'll give each
backend
> more RAM for sorting. : )

You're lucky that you have a comparatively small data set. My data set is
about 20 times as big, and we were stargint to experience crashes (well,
things would abort, rather then crash as they are well behaved, but still)
because the machine would blow the 256MB of RAM and 256 MB of swap. And it
would grind to a halt long before that... 512 MB made a world of
difference... But as with all different applications, YMWV.

>   Under Linux, RAM disks aren't much faster/better than regulsr disks,
from
> the benchmarks I've seen.

I can second this. The caching seems to be just too efficient for a vast
improvement in a real-world application... I don't know about benchmarking,
though, as IMHO and experience, benchmark numbers are often meaningless
when used to assess a real-world situation.

> Besides, then a power-outtage can really hose
> you. : )

Well, that is entirely true, but technically, you ought to have a UPS if
you have a mission critical system. I have recently had a complete disk's
worth of data hosed due to power failure, as something went wrong and the
root inode got corrupted. Usefulness of backups is difficult to
overestimate...

HTH.

Gordan




[GENERAL] v7.1 RPMs

2000-12-12 Thread Gordan Bobic

Have the RPMs been published yet? I seem to remember somebody saying that
they should be on the web site by the last weekend, but I can't find them.
A link would be appreciated... I need some of the new features, but I'd
rather avoid working out all the strange file locations (i.e. not
/usr/local) for a RH6.2 system...

Cheers.

Gordan




Re: [GENERAL] Database cluster?

2000-12-01 Thread Gordan Bobic

> > Than you can connect to any of the postgres on your cluster, for
> >example: > round robin.
> >
> >Hmm... But is this really what we want to do? This is less than ideal
for
> >several reasons (if I understand what you're saying correctly).
Replication
> >is off-line for a start, and it only works well for a system that has
few
> >inserts and lots of selects, probably from a lot of different users.
> >Probably a good things for applications like web search engines, but not
> >necessarily for much else.
>
> *** it isn't replication. It's that your cluster behaves like a
> single-computer. You modify the 'OS' (GFS + DIPC), not postgresql.

OK, that makes sense. Kind of like Mosix, then. But like mosix, this would
require lots of network bandwidth - or not, depending on how good GFS is at
figuring our what goes where.

> > > Another issue are datafiles, GFS seems promising. > But postgresql
uses
> >fcnl, and GFS (globalfilesystem.org) doesn't > support it yet. > A
> >distributed filesystem with locking etc. is required, Ideas ?
> >
> >Hmm... I am not sure that a distributed file system is what we want
here. I
> >think it might be better to have separate postgres databases on separate
> >local file systems, and handle putting the data together on a higher
level.
> >I think this would be better for both performance and scaleability.
Having
>
> ***yes... but WHEN we can have these features ? No one have done it till
> now, i've requested and searched but almost no reply.

Well, if you come up with a detailed design, I'm quite happy to help with
coding individual functions...

> >one big file system is likely to incur heavy network traffic penalties,
and
> >that is not necessary, as it can be avoided by just having the
distribution
> >done on a database level, rather than file system level.
> >
> >But then again, the distributed file system can be seen as a "neater"
> >solution, and it might work rather well, if they get the caching right
with
> >the correct "near-line" distribution of data across the network file
system
> >to make sure that the data is where it is most useful. In other words,
make
> >sure that the files (or even better, inodes) that are frequently
accessed
> >by a computer are on that computer).
> >
> >Still there is the issue of replication and redundancy.
>
> ***GFS does it transparently.

But wouldn't this all be incredibly network intensive? Could we implement
something that would make a process go to the data, instead of the other
way around? In database, data is typically bigger than the process
accessing it...

> >Indeed. As such, it should probably be the first thing to do toward
> >"clustering" a database. Still, it would be good to have a clear
> >development path, even though on that path we cludge things slightly at
> >various steps in order to have a useable system now, as opposed to a
> >"perfect" system later.
> >
>
> *** yes, i want clustering now...and i'm alone.

No, you're not. I NEED clustering now. Eventually the number of records and
tables comes and bites you, no matter how much you optimize your
application. And for most of us mere mortals, buying a Cray for running a
database is just not a viable option...

> I my opinion if GFS will do fcntl (and we can ask to GFS people, i
think),
> the stuff in this email can be done rapidly.

Well, I think you've just volunteered to contact them. :-)

> >A shared all approach is not necessarily that bad. It is (as far as I
can
> >tell), not better or worse than a "share nothing" approach. They both
have
> >pros and cons. Ideally, we should work toward coming up with an idea for
a
> >hybrid system that would pick the best of both worlds.
> >
> > > This system can give a sort of single-system-image, useful to
distribute
> > > other software beyond postgresql.
> >
> >Indeed. This is always a good thing for scalability for most
applications,
> >but databases have their specific requirements which may not be best
> >catered for by standard means of distributed processing. Still, what you
> >are suggesting would be a major improvement, from where I'm looking at
it,
> >but I am probably biased by looking at it from the point of view of my
> >particular application.
> >
> > > Also Mariposa (http://s2k-ftp.cs.berkeley.edu:8000/mariposa/) seems >
> >interesting, but it's not maintained and it's for an old postgresql
> >version.
> >
> >Hmm... Most interesting. There could be something recyclable in there.
Must
> >look at the specs and some source later...
> >
>
> *** i've compiled it , but with no results.
> An idea is to get diff to corresponding pure postgresql version (6.4/5?),
> then study the patch, and grab the secrets to fuse in current version.
The
> research papers seems very good. Perhaps some guy that have done
> Mariposa can help...

See above comment...

> My goal is to have a clustered open source database with the less effort
> possible, now.
>
> The project to do good stuff (ie code) in this field is very long...

Indeed. Th

Re: [GENERAL] Database cluster?

2000-11-30 Thread Gordan Bobic

> > > I am considering splitting the database into tables residing on
separate
> > > machines, and connect them on one master node.
> > >
> > > The question I have is:
> > >
> > > 1) How can I do this using PostgreSQL?
> >
> > You can't.
>
>I'll jump in with a bit more info.  Splitting tables across multiple
> machines would do nothing more than make the entire system run at a
snail's
> pace.  Yes, it would slow it down immensely, because you just couldn't
move
> data between machines quickly enough.

I don't believe that is the case. In my case, queries typically return
comparatively small amounts of data. Around 100 records at most. The amount
of data that needs to be transferred is comparatively small, and even over
10 Mb ethernet, it would take at most about a second to transfer. This is a
much smaller delay than the query time itself, which can take 10 seconds or
more. Remember that I said there are tables with over 30M records? Doing
multi-table joins on things like this takes a long time...

So, splitting the data in such a way that one table is queried, and then
tables joined from it are queried in parallel, would cause a signifficant
speed-up.

For example, say we have tables T1, T2 and T3.

T1 has fields F1.1, F1.2 and F1.3. T2 has F2.1 and T3 has F3.1 (at least,
probably lots of other fields.

Say I want to do
SELECT *
FROM T1
WHERE F1.1 = F2.1 AND F1.2 = F3.1 AND F1.3 = 'somedata';

Then F1.3 could be searched for 'somedata'. When the records are found,
this could be cross-matched remotely, in parallel for F1.1=F2.1 and
F1.2=F3.1, on different machines.

This means that depending on the type, configuration and the usage of the
database, a potentially massive improvement in performance could be
achiveved, especially on multi-table joins which span lots of BIG tables.

Somebody mentioned the fact that postgres uses IPC for communicating
between processes. I think there are tools for clustering (I am not sure if
Mosix supports transparently allowing IPC across nodes) which can work
around that.

>   Why?  Well, whenever you join two tables that are on different
machines,
> the tables have to go across whatever sort of connection you have between
> the machines.  Even if you use gigabit ethernet, you are still running at
a
> mere fraction of the bandwidth of the computer's internal bus - and at
> orders of magnitude greater latency.  You'd have lots of CPU's sitting
> around, doing absolutely nothing, waiting for data to come across the
wire.

Gigabit ethernet has around the same bandwidth as PCI bus. I suppose it all
depends on what machine you have running this. This would be true in the
case that the datbase server is a nice big Alpha with severl CPUs.

>There are alternatives, such as IP-over-SCSI.  That reduces the
latency
> of ethernet quite a bit, and gives you much more bandwidth (say, up to
160
> megabytes/second).  However, that's still a pittance compared to the main
> system bus inside your computer.

But SCSI is still 160MB burst (not sustained, unless you're using very
expensive arrays). And Gigabit ethernet is 133 MB/s, albeit with greater
latency.

> That's one of the greatest hurdles to distributed computing.  That's
why
> the applications that are best adapted to distributed computing are those
> that don't require much data over the wire - which certainly doesn't
apply
> to databases. : )

I think it depends whether the amount of data is the problem, or fitting it
together.

Somebody please explain to me further why I am wrong in all this?

Regards.

Gordan




Re: [GENERAL] Tuple size limits and upgrading

2000-11-28 Thread Gordan Bobic

Sorry for replying to my own email, but I've just stumbled upon an article
that seems to imply that v7.1 will support unlimited record lengths. Is
this the case? When is v7.1 due for release? Is a beta available?

Thanks.

Gordan

- Original Message -
From: "Gordan Bobic" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, November 28, 2000 10:36 AM
Subject: [GENERAL] Tuple size limits and upgrading


> Hi!
>
> I've got a bit of a problem with tuple size limits. They seem to be
limited
> to 8K, which is not enough for me, in the current application. Ideally,
I'd
> like to bump this up to at least 256K, although 512K would be nice. This
is
> for the purpose of storing large text fields, although it's usefulness
> might be extended for storing binary data as well (e.g. large images).
>
> Is there a way to allow records (tuples) to have arbitrary size?
>
> A fair majority of the text files I am storing in the database fits into
> records despite the 8K limit. However, there is a substantial number of
> those that don't. The biggest ones can go up to 512KB and beyond, but
this
> is somewhat uncommon.
>
> I have read in the archives that this can be modified by changing a few
> parameters and re-compiling. What is the limit to which the record size
can
> be increased? Does this setting seriously hammer performance (even if
there
> are only comparatively few records of this size)?
>
> The problem is that I would like to keep the way things are working,
> including directory structures, and I am not sure how to do this if I
> re-compile from source, as that will put everything into /usr/local.
>
> I am running RH Linux 6.2, upgraded to PGSQL 7.0.1, and files are in
> /var/lib/pgsql and /usr/lib/pgsql, and non-/usr/local locations.
>
> How can I upgrade to a customized version of 7.0.3 while keeping all the
> files in the same place? I know I can use the SRPM distribution, but just
> doing rpm --rebuild doesn't let me tweak the parameter in need to change.
>
> Can anybody suggest a way of doing this?
>
> Thanks.
>
> Gordan
>
>
>




[GENERAL] Tuple size limits and upgrading

2000-11-28 Thread Gordan Bobic

Hi!

I've got a bit of a problem with tuple size limits. They seem to be limited
to 8K, which is not enough for me, in the current application. Ideally, I'd
like to bump this up to at least 256K, although 512K would be nice. This is
for the purpose of storing large text fields, although it's usefulness
might be extended for storing binary data as well (e.g. large images).

Is there a way to allow records (tuples) to have arbitrary size?

A fair majority of the text files I am storing in the database fits into
records despite the 8K limit. However, there is a substantial number of
those that don't. The biggest ones can go up to 512KB and beyond, but this
is somewhat uncommon.

I have read in the archives that this can be modified by changing a few
parameters and re-compiling. What is the limit to which the record size can
be increased? Does this setting seriously hammer performance (even if there
are only comparatively few records of this size)?

The problem is that I would like to keep the way things are working,
including directory structures, and I am not sure how to do this if I
re-compile from source, as that will put everything into /usr/local.

I am running RH Linux 6.2, upgraded to PGSQL 7.0.1, and files are in
/var/lib/pgsql and /usr/lib/pgsql, and non-/usr/local locations.

How can I upgrade to a customized version of 7.0.3 while keeping all the
files in the same place? I know I can use the SRPM distribution, but just
doing rpm --rebuild doesn't let me tweak the parameter in need to change.

Can anybody suggest a way of doing this?

Thanks.

Gordan