[GENERAL] Array columns vs normalized table

2010-03-02 Thread Lee Hachadoorian
I work with state labor data which is reported to us in the form

industry, year, quarter1, quarter2, quarter3, quarter4

where each quarter represents an employment count. Obviously, this can
be normalized to

industry, year, quarter, employment

Can anyone comment on, or point to me to an article or discussion
regarding, why one would use an array column instead of normalizing
the data? That is, would there be any benefit to storing it as

industry int, year smallint, employment int[ ]

where the last column would be a four element array with data for the
four quarters.

Thanks,
--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth  Environmental Sciences
CUNY Graduate Center

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Array columns vs normalized table

2010-03-02 Thread Pavel Stehule
Hello

2010/3/2 Lee Hachadoorian lee.hachadoor...@gmail.com:
 I work with state labor data which is reported to us in the form

        industry, year, quarter1, quarter2, quarter3, quarter4

 where each quarter represents an employment count. Obviously, this can
 be normalized to

        industry, year, quarter, employment

 Can anyone comment on, or point to me to an article or discussion
 regarding, why one would use an array column instead of normalizing
 the data? That is, would there be any benefit to storing it as

for very large timeseries you can use arrays. The storage can be more effective.

sample

industry year q1 q2 q3 q4
1, 2001, 10,11,12,13
1, 2002, 14, 15,16,17

can be transformed to tuple (industry: 1, start_year: 2002, data:
10,11,12,13,14,15,16,17, . )

your model isn't more safer then normalised tables and I think is
better prefer normalized tables.

the queries to arrays are difficult - so usually data from arrays are
dynamically unpacked to tables. But  time series stored as array take
significantly less space on disc. Look on
http://www.postgresql.org/docs/8.3/static/storage-page-layout.html -
you safe repeated row headers.

Regards
Pavel Stehule



I hope so nobody uses this technique. It cannot substitute normalizated tables.


        industry int, year smallint, employment int[ ]

if


 where the last column would be a four element array with data for the
 four quarters.

 Thanks,
 --Lee

 --
 Lee Hachadoorian
 PhD Student, Geography
 Program in Earth  Environmental Sciences
 CUNY Graduate Center

 --
 Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-general


-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Array columns vs normalized table

2010-03-02 Thread Peter Hunsberger
On Tue, Mar 2, 2010 at 11:21 AM, Lee Hachadoorian
lee.hachadoor...@gmail.com wrote:
 I work with state labor data which is reported to us in the form

        industry, year, quarter1, quarter2, quarter3, quarter4

 where each quarter represents an employment count. Obviously, this can
 be normalized to

        industry, year, quarter, employment

 Can anyone comment on, or point to me to an article or discussion
 regarding, why one would use an array column instead of normalizing
 the data? That is, would there be any benefit to storing it as

        industry int, year smallint, employment int[ ]

 where the last column would be a four element array with data for the
 four quarters.

I think you might want to step back and ask yourself why you'd want to
normalize this data at all?  Unless you are trying to operate on all 4
columns at once (comparisons or aggregation, etc.) I see no reason to.
 It will take more space and give you no new capabilities.  If you
need do need to operate on all 4 quarters simultaneously then you may
very well want to normalize, but in that case there's certainly no
advantage in going to an array type.  Personally, the only reason I'd
see for ever using an array type is when you have many very closely
related values that would cause some huge number of rows if the data
is stored normalized

-- 
Peter Hunsberger

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Array columns vs normalized table

2010-03-02 Thread Scott Bailey

Lee Hachadoorian wrote:

I work with state labor data which is reported to us in the form

industry, year, quarter1, quarter2, quarter3, quarter4

where each quarter represents an employment count. Obviously, this can
be normalized to

industry, year, quarter, employment

Can anyone comment on, or point to me to an article or discussion
regarding, why one would use an array column instead of normalizing
the data? That is, would there be any benefit to storing it as

industry int, year smallint, employment int[ ]

where the last column would be a four element array with data for the
four quarters.

Thanks,
--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth  Environmental Sciences
CUNY Graduate Center


If you want to do that, I'd recommend:
industry int,
yearsmallint,
emp_q1  int,
emp_q2  int,
emp_q3  int,
emp_q4  int

That way it is more clear, easier to query, uses less space and you wont 
end up with employment data for the 5th quarter or something odd like that.


Arrays are great for working with your data during the query process. 
But you should generally avoid using them to store your data on disk.


Scott


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] Array columns vs normalized table

2010-03-02 Thread Lee Hachadoorian
Pavel, the idea of using arrays to store long time-series data sounds
good, but I take your point that normalized tables might be better and
are easier to query. I suppose the choice will be between normalizing
or using the denormalized

industry int,
year    smallint,
emp_q1  int,
emp_q2  int,
emp_q3  int,
emp_q4  int

as suggested by Peter and Scott. We're mostly actually interested in
annual numbers, but need to preserve the quarterly data for
verification and for unusual analyses. So perhaps storing denormalized
with an additional emp_annual int field, and a view that keeps the
quarterly data out of sight.

Thanks for your replies. Please feel free to comment if you think of
anything else.

Best,
--Lee

--
Lee Hachadoorian
PhD Student, Geography
Program in Earth  Environmental Sciences
CUNY Graduate Center

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general