[SQL] [noob] How to optimize this double pivot query?

2012-10-01 Thread Robert Buck
I have two tables that contain key-value data that I want to combine in
pivoted form into a single result set. They are related to two separate
tables.

The tables are: test_results, test_variables, metric_def, metadata_key. The
latter two tables are enum-like tables, basic descriptors of data stored in
other tables. The former two tables are basically key-value tables (with
ids as well); these k-v tables are related to the latter two tables via
foreign keys.

The following SQL takes about 11 seconds to run on a high-end laptop. The
largest table is about 54k records, pretty puny.

Can someone provide a hint as to why this is so slow? Again, I am a noob to
SQL, so the SQL is probably poorly written.

Thanks in advance,

Bob

select

t.id_name,
max(t.begin_time) as begin_time,
max(t.end_time) as end_time,

max(case when (m.id_name = 'package-version') then v.value end) as
package_version,
max(case when (m.id_name = 'database-vendor') then v.value end) as
database_vendor,
max(case when (m.id_name = 'bean-name') then v.value end) as bean_name,
max(case when (m.id_name = 'request-distribution') then v.value end) as
request_distribution,
max(case when (m.id_name = 'ycsb-workload') then v.value end) as
ycsb_workload,
max(case when (m.id_name = 'record-count') then v.value end) as
record_count,
max(case when (m.id_name = 'transaction-engine-count') then v.value
end) as transaction_engine_count,
max(case when (m.id_name = 'transaction-engine-maxmem') then v.value
end) as transaction_engine_maxmem,
max(case when (m.id_name = 'storage-manager-count') then v.value end)
as storage_manager_count,
max(case when (m.id_name = 'test-instance-count') then v.value end) as
test_instance_count,
max(case when (m.id_name = 'operation-count') then v.value end) as
operation_count,
max(case when (m.id_name = 'update-percent') then v.value end) as
update_percent,
max(case when (m.id_name = 'thread-count') then v.value end) as
thread_count,

max(case when (d.id_name = 'tps') then r.value end) as tps,
max(case when (d.id_name = 'Memory') then r.value end) as memory,
max(case when (d.id_name = 'DiskWritten') then r.value end) as
disk_written,
max(case when (d.id_name = 'PercentUserTime') then r.value end) as
percent_user,
max(case when (d.id_name = 'PercentCpuTime') then r.value end) as
percent_cpu,
max(case when (d.id_name = 'UserMilliseconds') then r.value end) as
user_milliseconds,
max(case when (d.id_name = 'YcsbUpdateLatencyMicrosecs') then r.value
end) as update_latency,
max(case when (d.id_name = 'YcsbReadLatencyMicrosecs') then r.value
end) as read_latency,
max(case when (d.id_name = 'Updates') then r.value end) as updates,
max(case when (d.id_name = 'Deletes') then r.value end) as deletes,
max(case when (d.id_name = 'Inserts') then r.value end) as inserts,
max(case when (d.id_name = 'Commits') then r.value end) as commits,
max(case when (d.id_name = 'Rollbacks') then r.value end) as rollbacks,
max(case when (d.id_name = 'Objects') then r.value end) as objects,
max(case when (d.id_name = 'ObjectsCreated') then r.value end) as
objects_created,
max(case when (d.id_name = 'FlowStalls') then r.value end) as
flow_stalls,
max(case when (d.id_name = 'NodeApplyPingTime') then r.value end) as
node_apply_ping_time,
max(case when (d.id_name = 'NodePingTime') then r.value end) as
node_ping_time,
max(case when (d.id_name = 'ClientCncts') then r.value end) as
client_connections,
max(case when (d.id_name = 'YcsbSuccessCount') then r.value end) as
success_count,
max(case when (d.id_name = 'YcsbWarnCount') then r.value end) as
warn_count,
max(case when (d.id_name = 'YcsbFailCount') then r.value end) as
fail_count

from test as t

left join test_results as r on r.test_id = t.id
left join test_variables as v on v.test_id = t.id
left join metric_def as d on d.id = r.metric_def_id
left join metadata_key as m on m.id = v.metadata_key_id

group by t.id_name

;

"GroupAggregate  (cost=5.87..225516.43 rows=926 width=81)"
"  ->  Nested Loop Left Join  (cost=5.87..53781.24 rows=940964 width=81)"
"->  Nested Loop Left Join  (cost=1.65..1619.61 rows=17235
width=61)"
"  ->  Index Scan using test_uc on test t  (cost=0.00..90.06
rows=926 width=36)"
"  ->  Hash Right Join  (cost=1.65..3.11 rows=19 width=29)"
"Hash Cond: (m.id = v.metadata_key_id)"
"->  Seq Scan on metadata_key m  (cost=0.00..1.24
rows=24 width=21)"
"->  Hash  (cost=1.41..1.41 rows=19 width=16)"
"  ->  Index Scan using test_variables_test_id_idx
on test_variables v  (cost=0.00..1.41 rows=19 width=16)"
"Index Cond: (test_id = t.id)"
"->  Hash Right Join  (cost=4.22..6.69 rows=55 width=28)"
"  Hash Cond: (d.id = r.metric_def_id)"
"  ->  Seq Scan on metric

Re: [SQL] [noob] How to optimize this double pivot query?

2012-10-01 Thread Robert Buck
So as you can probably glean, the tables store performance metric data. The
reason I chose to use k-v is simply to avoid having to create an additional
column every time a new metric type come along. So those were the two
options I thought of, straight k-v and column for every value type.

Are there other better options worth considering that you could point me
towards that supports storing metrics viz. with an unbounded number of
metric types in my case?

Bob

On Mon, Oct 1, 2012 at 9:07 PM, David Johnston  wrote:

> *From:* pgsql-sql-ow...@postgresql.org [mailto:
> pgsql-sql-ow...@postgresql.org] *On Behalf Of *Robert Buck
> *Sent:* Monday, October 01, 2012 8:47 PM
> *To:* pgsql-sql@postgresql.org
> *Subject:* [SQL] [noob] How to optimize this double pivot query?
>
> ** **
>
> I have two tables that contain key-value data that I want to combine in
> pivoted form into a single result set. They are related to two separate
> tables.
>
> The tables are: test_results, test_variables, metric_def, metadata_key.
> The latter two tables are enum-like tables, basic descriptors of data
> stored in other tables. The former two tables are basically key-value
> tables (with ids as well); these k-v tables are related to the latter two
> tables via foreign keys.
>
> The following SQL takes about 11 seconds to run on a high-end laptop. The
> largest table is about 54k records, pretty puny.
>
> Can someone provide a hint as to why this is so slow? Again, I am a noob
> to SQL, so the SQL is probably poorly written.
>
> 
>
> ** **
>
> Your query, while maybe not great, isn’t the cause of your problem.  It is
> the table schema, specifically the “key-value” aspect, that is killing you.
> 
>
> ** **
>
> You may want to try:
>
> ** **
>
> SELECT *
>
> FROM (SELECT id FROM …) id_master
>
> NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype =
> ‘’) f1
>
> NATURAL LEFT JOIN (SELECT id, field_value AS … FROM … WHERE fieldtype =
> ‘’) f2
>
> [repeat one left join for every field; though you will then need to decide
> if/how to deal with NULL – not that you are currently doing anything
> special anyway…]
>
> ** **
>
> Mainly the above avoids the use of “max()” and instead uses direct joins
> between the relevant tables.  I have no clue whether that will improve
> things but if you are going to lie in this bed you should at least try
> different positions.
>
> ** **
>
> The better option is to educate yourself on better ways of constructing
> the tables so that you do not have to write this kind of god-awful query.
> In some cases key-value has merit but usually only when done in
> moderation.  Not for the entire database.  You likely should simply have a
> table that looks like the result of the query below.
>
> ** **
>
> As a second (not necessarily mutually exclusive) alternative: install and
> use the hstore extension.
>
> ** **
>
> David J.
>
> ** **
>
>
> Thanks in advance,
>
> Bob
>
> select
>
> t.id_name,
> max(t.begin_time) as begin_time,
> max(t.end_time) as end_time,
>
> max(case when (m.id_name = 'package-version') then v.value end) as
> package_version,
> max(case when (m.id_name = 'database-vendor') then v.value end) as
> database_vendor,
> max(case when (m.id_name = 'bean-name') then v.value end) as bean_name,
> max(case when (m.id_name = 'request-distribution') then v.value end)
> as request_distribution,
> max(case when (m.id_name = 'ycsb-workload') then v.value end) as
> ycsb_workload,
> max(case when (m.id_name = 'record-count') then v.value end) as
> record_count,
> max(case when (m.id_name = 'transaction-engine-count') then v.value
> end) as transaction_engine_count,
> max(case when (m.id_name = 'transaction-engine-maxmem') then v.value
> end) as transaction_engine_maxmem,
> max(case when (m.id_name = 'storage-manager-count') then v.value end)
> as storage_manager_count,
> max(case when (m.id_name = 'test-instance-count') then v.value end) as
> test_instance_count,
> max(case when (m.id_name = 'operation-count') then v.value end) as
> operation_count,
> max(case when (m.id_name = 'update-percent') then v.value end) as
> update_percent,
> max(case when (m.id_name = 'thread-count') then v.value end) as
> thread_count,
>
> max(case when (d.id_name = 'tps') then r.value end) as tps,
> max(case when (d.id_name = 'Memory') then r.value end) as memory,
> max(case when (d.id_n

Re: [SQL] [noob] How to optimize this double pivot query?

2012-10-02 Thread Robert Buck
Hi Samuel 

Thank you. This may be a bit of a stretch for you, but would it be possible for 
me to peek at a sanitized version of your cross tab query, for a good example 
on how to do this for this noob?

This will be pretty common in my case. The biggest tables will get much larger 
as they are raw metrics feeds, which at some point need to be fed through 
reporting engines to analyze and spot regressions.

Lastly, am I simply using the wrong tech for data feeds and analytics? The 
first cut of this used flat files and R and though it scoured thousands of 
files was much faster than the SQL I wrote here. The big goal was to get this 
off disk and into a database, but as its highly variable, very sparse, metric 
data, this is why I chose k-v. SQL databases are internally more politically 
acceptable, though I am personally agnostic on the matter. In the end it would 
be nice to directly report off a database, but so long as I can transform to 
csv I can always perform reporting and analytics in R, and optionally map and 
reduce natively in Ruby. Sane? Ideas? This is early on, and willing to adjust 
course and find a better way if suggestions indicate such. I've heard a couple 
options so far.

Best regards,

Bob

On Oct 2, 2012, at 5:21 AM, Samuel Gendler  wrote:

> 
> 
> On Mon, Oct 1, 2012 at 11:46 PM, Thomas Kellerer  wrote:
> 
> That combined with the tablefunc module (which let's you do pivot queries) 
> might
> make your queries substantially more readable (and maybe faster as well).
> 
> 
> I woud think that using the crosstab functions in tablefunc would solve the 
> problem without needing a complete change of structure. I've built crosstabs 
> over a whole lot more than 54K rows in far, far less time (and resulting in 
> more than 35 columns, too) than the 11 seconds that was quoted here, without 
> feeling the need to deal with hstore or similar.  In fact, wouldn't hstore 
> actually make it more difficult to build a crosstab query than the schema 
> that he has in place now?
> 
> --sam
> 


Re: [SQL] [noob] How to optimize this double pivot query?

2012-10-03 Thread Robert Buck
Thank you, Samuel.

I am trying some of this out right now...

This is great information.

Thanks so much. This is a huge help.

Bob

On Tue, Oct 2, 2012 at 4:09 PM, Samuel Gendler wrote:

> One last comment - an alternative solution to this is to use array_agg(),
> which will simply add each value that matches a group by clause to an
> array.  If you are careful to structure your query with left joins so that
> every grouping will have the same number of rows, then you can get a quick
> and dirty crosstab like this
>
> select row_name, array_agg(value) from test t left join value_table v on
> t.id = v.id group by 1;
>
> Obviously, you'll want to deal with your potential for duplicate rows via
> max() or avg() in a subquery, rather than joining directly to the table,
> but you should get the idea from that example.  You can also use coalesce
> to convert nulls to some other value, if required.
>
> Since the crosstab functions already require you to do all that work with
> regard to determining column names and building up the record structure,
> using array_agg can be every bit as effective, since it basically requires
> the same process.  First query for all possible names, then issue a query
> that will cause the values to be processed by array_agg in column order,
> then iterate over results, getting each array value and associating it with
> a particular name.
>
> Your result will look  like this:
>
> id_name, start_time, end_time, array_of_values
>
> That may or may not be convenient for you, depending upon how you are
> using the resultset you get back. You'll still need to play all the same
> games with regard to unioning multiple queries together to pivot data from
> multiple tables into the same row.
>
>
>
> On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler  > wrote:
>
>>
>>
>> On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck wrote:
>>
>>> Hi Samuel
>>>
>>> Thank you. This may be a bit of a stretch for you, but would it be
>>> possible for me to peek at a sanitized version of your cross tab query, for
>>> a good example on how to do this for this noob?
>>>
>>> This will be pretty common in my case. The biggest tables will get much
>>> larger as they are raw metrics feeds, which at some point need to be fed
>>> through reporting engines to analyze and spot regressions.
>>>
>>> Lastly, am I simply using the wrong tech for data feeds and analytics?
>>> The first cut of this used flat files and R and though it scoured thousands
>>> of files was much faster than the SQL I wrote here. The big goal was to get
>>> this off disk and into a database, but as its highly variable, very sparse,
>>> metric data, this is why I chose k-v. SQL databases are internally more
>>> politically acceptable, though I am personally agnostic on the matter. In
>>> the end it would be nice to directly report off a database, but so long as
>>> I can transform to csv I can always perform reporting and analytics in R,
>>> and optionally map and reduce natively in Ruby. Sane? Ideas? This is early
>>> on, and willing to adjust course and find a better way if suggestions
>>> indicate such. I've heard a couple options so far.
>>>
>>
>> OK, you owe me.  I think I just spent more than an hour writing this up
>> ;-)
>>
>> Given the numbers of rows you are talking about, I can't think of any
>> good reason why the database shouldn't do what you need it to do pretty
>> effectively/quickly.
>>
>> Some questions before I provide crosstab samples - your example query has
>> a hardcoded set of keys that it is looking for, but maybe that was
>> code-generated.  There are multiple forms of the crosstab function, some of
>> which rely on a constant column count and others which can generate the set
>> of columns based on another query.  There are complications in that second
>> form relating to race conditions with regard to new keys showing up between
>> the query for the columns and the query for the data, so it is important to
>> understand that and either structure queries accordingly or make sure you
>> execute in a transaction with sufficient transaction isolation to prevent
>> the race condition from appearing.
>>
>> For crosstab queries, you generally want a query that returns results in
>> the following form:
>>
>> row_name | key | value
>>
>> where row_name would be a date or some other label that all values that
>> should share a row will have in common.  Key is the field that defines a
>> column in the final row.  And value is the v

Re: [SQL] [noob] How to optimize this double pivot query?

2012-10-03 Thread Robert Buck
Samuel,

You asked:

Some questions before I provide crosstab samples - your example query has a
hardcoded set of keys that it is looking for, but maybe that was
code-generated.  There are multiple forms of the crosstab function, some of
which rely on a constant column count and others which can generate the set
of columns based on another query.  There are complications in that second
form relating to race conditions with regard to new keys showing up between
the query for the columns and the query for the data, so it is important to
understand that and either structure queries accordingly or make sure you
execute in a transaction with sufficient transaction isolation to prevent
the race condition from appearing.

The query was hand written from examples I found on the web. The list here
was hand written too. This will run in Ruby, so I can template as much as
needed on that side. I can also run this in a defined transaction boundary
rather than using auto-commit. Right now I am thumbing through your email
and trying it out.

Bob

On Wed, Oct 3, 2012 at 12:23 PM, Robert Buck wrote:

> Thank you, Samuel.
>
> I am trying some of this out right now...
>
> This is great information.
>
> Thanks so much. This is a huge help.
>
> Bob
>
>
> On Tue, Oct 2, 2012 at 4:09 PM, Samuel Gendler 
> wrote:
>
>> One last comment - an alternative solution to this is to use array_agg(),
>> which will simply add each value that matches a group by clause to an
>> array.  If you are careful to structure your query with left joins so that
>> every grouping will have the same number of rows, then you can get a quick
>> and dirty crosstab like this
>>
>> select row_name, array_agg(value) from test t left join value_table v on
>> t.id = v.id group by 1;
>>
>> Obviously, you'll want to deal with your potential for duplicate rows via
>> max() or avg() in a subquery, rather than joining directly to the table,
>> but you should get the idea from that example.  You can also use coalesce
>> to convert nulls to some other value, if required.
>>
>> Since the crosstab functions already require you to do all that work with
>> regard to determining column names and building up the record structure,
>> using array_agg can be every bit as effective, since it basically requires
>> the same process.  First query for all possible names, then issue a query
>> that will cause the values to be processed by array_agg in column order,
>> then iterate over results, getting each array value and associating it with
>> a particular name.
>>
>> Your result will look  like this:
>>
>> id_name, start_time, end_time, array_of_values
>>
>> That may or may not be convenient for you, depending upon how you are
>> using the resultset you get back. You'll still need to play all the same
>> games with regard to unioning multiple queries together to pivot data from
>> multiple tables into the same row.
>>
>>
>>
>> On Tue, Oct 2, 2012 at 12:57 PM, Samuel Gendler <
>> sgend...@ideasculptor.com> wrote:
>>
>>>
>>>
>>> On Tue, Oct 2, 2012 at 2:45 AM, Robert Buck wrote:
>>>
>>>> Hi Samuel
>>>>
>>>> Thank you. This may be a bit of a stretch for you, but would it be
>>>> possible for me to peek at a sanitized version of your cross tab query, for
>>>> a good example on how to do this for this noob?
>>>>
>>>> This will be pretty common in my case. The biggest tables will get much
>>>> larger as they are raw metrics feeds, which at some point need to be fed
>>>> through reporting engines to analyze and spot regressions.
>>>>
>>>> Lastly, am I simply using the wrong tech for data feeds and analytics?
>>>> The first cut of this used flat files and R and though it scoured thousands
>>>> of files was much faster than the SQL I wrote here. The big goal was to get
>>>> this off disk and into a database, but as its highly variable, very sparse,
>>>> metric data, this is why I chose k-v. SQL databases are internally more
>>>> politically acceptable, though I am personally agnostic on the matter. In
>>>> the end it would be nice to directly report off a database, but so long as
>>>> I can transform to csv I can always perform reporting and analytics in R,
>>>> and optionally map and reduce natively in Ruby. Sane? Ideas? This is early
>>>> on, and willing to adjust course and find a better way if suggestions
>>>> indicate such. I've heard a couple options so far.
>>>>
>>>
>>> OK, you owe me.  I think