Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-19 Thread MauMau

From: Jeff Janes jeff.ja...@gmail.com

I thought that this was the point I was making, not the point I was
missing.  You have the same hard drives you had before, but now due to a
software improvement you are cramming 5 times more stuff through them.
Yeah, you will get bigger latency spikes.  Why wouldn't you?  You are now
beating the snot out of your hard drives, whereas before you were not.

If you need 10,000 TPS, then you need to upgrade to 9.4.  If you need it
with low maximum latency as well, then you probably need to get better IO
hardware as well (maybe not--maybe more tuning could help).  With 9.3 you
didn't need better IO hardware, because you weren't capable of maxing out
what you already had.  With 9.4 you can max it out, and this is a good
thing.

If you need 10,000 TPS but only 2000 TPS are completing under 9.3, then
what is happening to the other 8000 TPS? Whatever is happening to them, it
must be worse than a latency spike.

On the other hand, if you don't need 10,000 TPS, than measuring max 
latency

at 10,000 TPS is the wrong thing to measure.


Thank you, I've probably got the point --- you mean the hard disk for WAL is 
the bottleneck.  But I still wonder a bit why the latency spike became so 
bigger even with # of clients fewer than # of CPU cores.  I suppose the 
requests get processed more smoothly when the number of simultaneous 
requests is small.  Anyway, I want to believe the latency spike would become 
significantly smaller on an SSD.


Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread MauMau

From: Andres Freund and...@2ndquadrant.com

On 2014-02-18 01:35:52 +0900, MauMau wrote:

For example, please see the max latencies of test set 2 (PG 9.3) and test
set 4 (xlog scaling with padding).  They are 207.359 and 1219.422
respectively.  The throughput is of course greatly improved, but I think 
the
response time should not be sacrificed as much as possible.  There are 
some

users who are sensitive to max latency, such as stock exchange and online
games.


You need to compare both at the same throughput to have any meaningful
comparison.


I'm sorry for my lack of understanding, but could you tell me why you think 
so?  When the user upgrades to 9.4 and runs the same workload, he would 
experience vastly increased max latency --- or in other words, greater 
variance in response times.  With my simple understanding, that sounds like 
a problem for response-sensitive users.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Andres Freund
On 2014-02-18 20:49:06 +0900, MauMau wrote:
 From: Andres Freund and...@2ndquadrant.com
 On 2014-02-18 01:35:52 +0900, MauMau wrote:
 For example, please see the max latencies of test set 2 (PG 9.3) and test
 set 4 (xlog scaling with padding).  They are 207.359 and 1219.422
 respectively.  The throughput is of course greatly improved, but I think
 the
 response time should not be sacrificed as much as possible.  There are
 some
 users who are sensitive to max latency, such as stock exchange and online
 games.
 
 You need to compare both at the same throughput to have any meaningful
 comparison.
 
 I'm sorry for my lack of understanding, but could you tell me why you think
 so?  When the user upgrades to 9.4 and runs the same workload, he would
 experience vastly increased max latency --- or in other words, greater
 variance in response times.

No, the existing data indicates no such thing. When they upgrade they
will have the *same* throughput as before. The datapoints you indicate
that there's an increase in latency, but it's there while processing
several time as much data! The highest throughput of set 2 is 3223,
while the highest for set 4 is 14145.
To get interesting latency comparison you'd need to use pgbench --rate
and use a rate *both* versions can sustain.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Jeff Janes
On Tue, Feb 18, 2014 at 3:49 AM, MauMau maumau...@gmail.com wrote:

 From: Andres Freund and...@2ndquadrant.com

 On 2014-02-18 01:35:52 +0900, MauMau wrote:

 For example, please see the max latencies of test set 2 (PG 9.3) and test
 set 4 (xlog scaling with padding).  They are 207.359 and 1219.422
 respectively.  The throughput is of course greatly improved, but I think
 the
 response time should not be sacrificed as much as possible.  There are
 some
 users who are sensitive to max latency, such as stock exchange and online
 games.


 You need to compare both at the same throughput to have any meaningful
 comparison.


 I'm sorry for my lack of understanding, but could you tell me why you
 think so?  When the user upgrades to 9.4 and runs the same workload, he
 would experience vastly increased max latency


The tests shown have not tested that.  The test is not running the same
workload on 9.4, but rather a vastly higher workload.  If we were to
throttle the workload in 9.4 (using pgbench's new -R, for example) to the
same level it was in 9.3, we probably would not see the max latency
increase.  But that was not tested, so we don't know for sure.



 --- or in other words, greater variance in response times.  With my simple
 understanding, that sounds like a problem for response-sensitive users.


If you need the throughput provided by 9.4, then using 9.3 gets lower
variance simply be refusing to do 80% of the assigned work.  If you don't
need the throughput provided by 9.4, then you probably have some natural
throttling in place.

If you want a real-world like test, you might try to crank up the -c and -j
to the limit in 9.3 in a vain effort to match 9.4's performance, and see
what that does to max latency.  (After all, that is what a naive web app is
likely to do--continue to make more and more connections as requests come
in faster than they can finish.)

Cheers,

Jeff


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Heikki Linnakangas

On 02/18/2014 06:27 PM, Jeff Janes wrote:

On Tue, Feb 18, 2014 at 3:49 AM, MauMau maumau...@gmail.com wrote:


--- or in other words, greater variance in response times.  With my simple
understanding, that sounds like a problem for response-sensitive users.


If you need the throughput provided by 9.4, then using 9.3 gets lower
variance simply be refusing to do 80% of the assigned work.  If you don't
need the throughput provided by 9.4, then you probably have some natural
throttling in place.

If you want a real-world like test, you might try to crank up the -c and -j
to the limit in 9.3 in a vain effort to match 9.4's performance, and see
what that does to max latency.  (After all, that is what a naive web app is
likely to do--continue to make more and more connections as requests come
in faster than they can finish.)


You're missing MauMau's point. In essence, he's comparing two systems 
with the same number of clients, issuing queries as fast as they can, 
and one can do 2000 TPS while the other one can do 1 TPS. You would 
expect the lower-throughput system to have a *higher* average latency. 
Each query takes longer, that's why the throughput is lower. If you look 
at the avg_latency columns in the graphs 
(http://hlinnaka.iki.fi/xloginsert-scaling/padding/), that's exactly 
what you see.


But what MauMau is pointing out is that the *max* latency is much higher 
in the system that can do 1 TPS. So some queries are taking much 
longer, even though in average the latency is lower. In an ideal, 
totally fair system, each query would take the same amount of time to 
execute, and after it's saturated, increasing the number of clients just 
makes that constant latency higher.


Yeah, I'm pretty sure that's because of the extra checkpoints. If you 
look at the individual test graphs, there are clear spikes in latency, 
but the latency is otherwise small. With a higher TPS, you reach 
checkpoint_segments quicker; I should've eliminated that effect in the 
tests I ran...


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Andres Freund
On 2014-02-18 19:12:32 +0200, Heikki Linnakangas wrote:
 You're missing MauMau's point. In essence, he's comparing two systems with
 the same number of clients, issuing queries as fast as they can, and one can
 do 2000 TPS while the other one can do 1 TPS. You would expect the
 lower-throughput system to have a *higher* average latency.
 Each query takes
 longer, that's why the throughput is lower. If you look at the avg_latency
 columns in the graphs (http://hlinnaka.iki.fi/xloginsert-scaling/padding/),
 that's exactly what you see.
 
 But what MauMau is pointing out is that the *max* latency is much higher in
 the system that can do 1 TPS. So some queries are taking much longer,
 even though in average the latency is lower. In an ideal, totally fair
 system, each query would take the same amount of time to execute, and after
 it's saturated, increasing the number of clients just makes that constant
 latency higher.

Consider me being enthusiastically unenthusiastic about that fact. The
change in throughput still makes this pretty uninteresting. There's so
many things that are influenced by a factor 5 increase in throughput,
that a change in max latency is really not saying much.
There's also the point that with 5 times the throughput it's getting
more likely to sleep while holding critical locks and such.

 Yeah, I'm pretty sure that's because of the extra checkpoints. If you look
 at the individual test graphs, there are clear spikes in latency, but the
 latency is otherwise small. With a higher TPS, you reach checkpoint_segments
 quicker; I should've eliminated that effect in the tests I ran...

I don't think that'd be a good idea. The number of full page writes so
greatly influences the WAL charactersistics, that changing checkpoint
segments would make the tests much harder to compare.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Heikki Linnakangas

On 02/18/2014 10:51 PM, Andres Freund wrote:

On 2014-02-18 19:12:32 +0200, Heikki Linnakangas wrote:

Yeah, I'm pretty sure that's because of the extra checkpoints. If you look
at the individual test graphs, there are clear spikes in latency, but the
latency is otherwise small. With a higher TPS, you reach checkpoint_segments
quicker; I should've eliminated that effect in the tests I ran...


I don't think that'd be a good idea. The number of full page writes so
greatly influences the WAL charactersistics, that changing checkpoint
segments would make the tests much harder to compare.


I was just thinking of bumping up checkpoint_segments so high that there 
are no checkpoints during any of the tests.


- Heikki


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Andres Freund
On 2014-02-18 23:01:08 +0200, Heikki Linnakangas wrote:
 On 02/18/2014 10:51 PM, Andres Freund wrote:
 On 2014-02-18 19:12:32 +0200, Heikki Linnakangas wrote:
 Yeah, I'm pretty sure that's because of the extra checkpoints. If you look
 at the individual test graphs, there are clear spikes in latency, but the
 latency is otherwise small. With a higher TPS, you reach checkpoint_segments
 quicker; I should've eliminated that effect in the tests I ran...
 
 I don't think that'd be a good idea. The number of full page writes so
 greatly influences the WAL charactersistics, that changing checkpoint
 segments would make the tests much harder to compare.
 
 I was just thinking of bumping up checkpoint_segments so high that there are
 no checkpoints during any of the tests.

Hm. I actually think that full page writes are an interesting part of
this because they are so significantly differently sized than normal
records.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-18 Thread Jeff Janes
On Tue, Feb 18, 2014 at 9:12 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 02/18/2014 06:27 PM, Jeff Janes wrote:

 On Tue, Feb 18, 2014 at 3:49 AM, MauMau maumau...@gmail.com wrote:

  --- or in other words, greater variance in response times.  With my
 simple
 understanding, that sounds like a problem for response-sensitive users.


 If you need the throughput provided by 9.4, then using 9.3 gets lower
 variance simply be refusing to do 80% of the assigned work.  If you don't
 need the throughput provided by 9.4, then you probably have some natural
 throttling in place.

 If you want a real-world like test, you might try to crank up the -c and
 -j
 to the limit in 9.3 in a vain effort to match 9.4's performance, and see
 what that does to max latency.  (After all, that is what a naive web app
 is
 likely to do--continue to make more and more connections as requests come
 in faster than they can finish.)


 You're missing MauMau's point. In essence, he's comparing two systems with
 the same number of clients, issuing queries as fast as they can, and one
 can do 2000 TPS while the other one can do 1 TPS. You would expect the
 lower-throughput system to have a *higher* average latency. Each query
 takes longer, that's why the throughput is lower. If you look at the
 avg_latency columns in the graphs (http://hlinnaka.iki.fi/
 xloginsert-scaling/padding/), that's exactly what you see.

 But what MauMau is pointing out is that the *max* latency is much higher
 in the system that can do 1 TPS. So some queries are taking much
 longer, even though in average the latency is lower. In an ideal, totally
 fair system, each query would take the same amount of time to execute, and
 after it's saturated, increasing the number of clients just makes that
 constant latency higher.


I thought that this was the point I was making, not the point I was
missing.  You have the same hard drives you had before, but now due to a
software improvement you are cramming 5 times more stuff through them.
 Yeah, you will get bigger latency spikes.  Why wouldn't you?  You are now
beating the snot out of your hard drives, whereas before you were not.

If you need 10,000 TPS, then you need to upgrade to 9.4.  If you need it
with low maximum latency as well, then you probably need to get better IO
hardware as well (maybe not--maybe more tuning could help).  With 9.3 you
didn't need better IO hardware, because you weren't capable of maxing out
what you already had.  With 9.4 you can max it out, and this is a good
thing.

If you need 10,000 TPS but only 2000 TPS are completing under 9.3, then
what is happening to the other 8000 TPS? Whatever is happening to them, it
must be worse than a latency spike.

On the other hand, if you don't need 10,000 TPS, than measuring max latency
at 10,000 TPS is the wrong thing to measure.

Cheers,

Jeff


[HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-17 Thread MauMau

Hello Heikki san,

I'm excited about your great work, xlog scaling.  I'm looking forward to the 
release of 9.4.


Please let me ask you about your performance data on the page:

http://hlinnaka.iki.fi/xloginsert-scaling/padding/

I'm worried about the big increase in max latency.  Do you know the cause? 
More frequent checkpoints caused by increased WAL volume thanks to enhanced 
performance?


Although I'm not sure this is related to what I'm asking, the following code 
fragment in WALInsertSlotAcquireOne() catched my eyes.  Shouldn't the if 
condition be slotno == -1 instead of !=?  I thought this part wants to 
make inserters to use another slot on the next insertion, when they fail to 
acquire the slot immediately.  Inserters pass slotno == -1.  I'm sorry if I 
misunderstood the code.


/*
 * If we couldn't get the slot immediately, try another slot next time.
 * On a system with more insertion slots than concurrent inserters, this
 * causes all the inserters to eventually migrate to a slot that no-one
 * else is using. On a system with more inserters than slots, it still
 * causes the inserters to be distributed quite evenly across the slots.
 */
if (slotno != -1  retry)
 slotToTry = (slotToTry + 1) % num_xloginsert_slots;

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-17 Thread Andres Freund
Hi,

On 2014-02-18 00:43:54 +0900, MauMau wrote:
 Please let me ask you about your performance data on the page:
 
 http://hlinnaka.iki.fi/xloginsert-scaling/padding/
 
 I'm worried about the big increase in max latency.  Do you know the cause?
 More frequent checkpoints caused by increased WAL volume thanks to enhanced
 performance?

I don't see much evidence of increased latency there? You can't really
compare the latency when the throughput is significantly different.

 Although I'm not sure this is related to what I'm asking, the following code
 fragment in WALInsertSlotAcquireOne() catched my eyes.  Shouldn't the if
 condition be slotno == -1 instead of !=?  I thought this part wants to
 make inserters to use another slot on the next insertion, when they fail to
 acquire the slot immediately.  Inserters pass slotno == -1.  I'm sorry if I
 misunderstood the code.

I think you're right.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-17 Thread MauMau

From: Andres Freund and...@2ndquadrant.com

On 2014-02-18 00:43:54 +0900, MauMau wrote:
I'm worried about the big increase in max latency.  Do you know the 
cause?
More frequent checkpoints caused by increased WAL volume thanks to 
enhanced

performance?


I don't see much evidence of increased latency there? You can't really
compare the latency when the throughput is significantly different.


For example, please see the max latencies of test set 2 (PG 9.3) and test 
set 4 (xlog scaling with padding).  They are 207.359 and 1219.422 
respectively.  The throughput is of course greatly improved, but I think the 
response time should not be sacrificed as much as possible.  There are some 
users who are sensitive to max latency, such as stock exchange and online 
games.



Although I'm not sure this is related to what I'm asking, the following 
code

fragment in WALInsertSlotAcquireOne() catched my eyes.  Shouldn't the if
condition be slotno == -1 instead of !=?  I thought this part wants 
to
make inserters to use another slot on the next insertion, when they fail 
to
acquire the slot immediately.  Inserters pass slotno == -1.  I'm sorry if 
I

misunderstood the code.


I think you're right.


Thanks for your confirmation.  I'd be glad if the fix could bring any 
positive impact on max latency.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Do you know the reason for increased max latency due to xlog scaling?

2014-02-17 Thread Andres Freund
On 2014-02-18 01:35:52 +0900, MauMau wrote:
 From: Andres Freund and...@2ndquadrant.com
 On 2014-02-18 00:43:54 +0900, MauMau wrote:
 I'm worried about the big increase in max latency.  Do you know the
 cause?
 More frequent checkpoints caused by increased WAL volume thanks to
 enhanced
 performance?
 
 I don't see much evidence of increased latency there? You can't really
 compare the latency when the throughput is significantly different.
 
 For example, please see the max latencies of test set 2 (PG 9.3) and test
 set 4 (xlog scaling with padding).  They are 207.359 and 1219.422
 respectively.  The throughput is of course greatly improved, but I think the
 response time should not be sacrificed as much as possible.  There are some
 users who are sensitive to max latency, such as stock exchange and online
 games.

You need to compare both at the same throughput to have any meaningful
comparison.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers