This STDOU issue gets even weirder. Now I have set up our two new servers
(identical hw/sw) as I would have needed to do so anyways. After having PG
running, I also set up the same test scenario as I have it on our problematic
servers, and started the COPY-to-STDOUT experiment. And you know what? Both new
servers are performing well. No hanging, and the 3 GByte test dump was written
in around 3 minutes (as expected). To make things even more complicated ... I
went back to our production servers. Now, the first one - which I froze up with
oprofile this morning and needed a REBOOT - is performing well too! It needed 3
minutes for the test case ... WTF? BUT, the second production server, which did
not have a reboot, is still behaving badly.
Now I tried to dig deeper (without killing a production server again) ... and
came to comparing the outputs of PS (with '-fax' parameter then, '-axl'). Now I
have found something interesting:
- all fast servers show the COPY process as being in the state Rs ("runnable
(on run queue)")
- on the still slow server, this process is in 9 out of 10 samples in Ds
("uninterruptible sleep (usually IO)")
Now, this "Ds" state seems to be something unhealthy - especially if it is
there almost all the time - as far as my first reeds on google show (and
although it points to IO, there is seemingly only very little IO, and IO-wait
is minimal too). I have also done "-axl" with PS, which brings the following
line for our process:
F UID PID PPID PRI NI VSZ RSS WCHAN STAT TTY TIME COMMAND
1 5551 2819 4201 20 0 5941068 201192 conges Ds ? 2:05 postgres:
postgres musicload_cache [local] COPY"
Now, as far as I understood from my google searches, the column WCHAN shows,
where in the kernel my process is hanging. Here it says "conges". Now, can
somebody tell me, what "conges" means ???? Or do I have other options to get
out even more info from the system (maybe without oprofile - as it already
burned my hand :-).
And yes, now I see a reboot as a possible "Fix", but that would not ensure me,
that the problem will not resurface. So, for the time being, I will leave my
current second production server as is ... so I can further narrow down the
potential reasons of this strange STDOUT slow down (especially I someone ha s a
tip for me :-)
Andras Fabian
(in the meantime my "slow" server finished the COPY ... it took 46 minutes
instead of 3 minutes on the fast machines ... a slowdown of factor 15).
-----Ursprüngliche Nachricht-----
Von: Andras Fabian
Gesendet: Montag, 12. Juli 2010 10:45
An: 'Tom Lane'
Cc: [email protected]
Betreff: AW: [GENERAL] PG_DUMP very slow because of STDOUT ??
Hi Tom (or others),
are there some recommended settings/ways to use oprofile on a situation like
this??? I got it working, have seen a first profile report, but then managed to
completely freeze the server on a second try with different oprofile settings
(next tests will go against the newly installed - next and identical - new
servers).
Andras Fabian
-----Ursprüngliche Nachricht-----
Von: Tom Lane [mailto:[email protected]]
Gesendet: Freitag, 9. Juli 2010 15:39
An: Andras Fabian
Cc: [email protected]
Betreff: Re: [GENERAL] PG_DUMP very slow because of STDOUT ??
Andras Fabian <[email protected]> writes:
> Now I ask, whats going on here ???? Why is COPY via STDOUT so much slower on
> out new machine?
Something weird about the network stack on the new machine, maybe.
Have you compared the transfer speeds for Unix-socket and TCP connections?
On a Red Hat box I would try using oprofile to see where the bottleneck
is ... don't know if that's available for Ubuntu.
regards, tom lane
--
Sent via pgsql-general mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general