Re: [GENERAL] JOIN column maximum

2012-01-08 Thread Alban Hertroys
On 6 Jan 2012, at 22:56, Lee Hachadoorian wrote:

 What I am working with is called the summary file, which presents the data in 
 aggregate. That means you have an income table with 16 income classes, plus a 
 total column.

That could be a table: income class and income.

It may make sense to add a record for a virtual class 'TOTAL' there, that 
incorporates the totals column, although I expect those same numbers could be 
obtained by adding up the numbers in the other categories. If those numbers are 
rounded though, then your totals can be off a bit that way (due to lack of 
precision).

 Then you have 9 more tables which show the same income classes by 9 racial 
 and ethnic categories, for a total of 153 more columns.

That could be a table: income class (FK), ethnic category and income.

You could turn these into 9 separate views with one ethnic category 
pre-selected in each.

 Then you also have a table which crosses 9 income classes by 5 mobility 
 statuses (living in same house, moved within county, moved within state, 
 etc.) for a total of 55 columns.

Another table: income class (FK), mobility status, income.

 Then you have income classes crossed with sex, income classes crossed with 
 mode of transportation to work, sex crossed with mode of transportation to 
 work, etc.

...etc.

 When all is said and done, you have 23,000 columns.

You can definitely normalise that data much more than you are doing. I also 
think you may be underestimating your researchers if you think they can't 
handle mapping these Census sheets (which I'm unfamiliar with) to normalised 
tables.

You may even find that they can find out more from those normalised tables, as 
they have the freedom to combine tables that the Census Bureau didn't combine 
for them. Things like how much do female Asians make compared to their 
mobility.

Alban Hertroys

--
The scale of a problem often equals the size of an ego.



-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] JOIN column maximum

2012-01-06 Thread Lee Hachadoorian
On Thu, Jan 5, 2012 at 10:19 PM, Darren Duncan dar...@darrenduncan.net
 wrote:


 This all being said, 23K values per row just sounds wrong, and I can't
 imagine any census forms having that many details.

 Do you, by chance, have multiple values of the same type that are in
 different fields, eg telephone_1, telephone_2 or child_1, child_2 etc?  You
 should take any of those and collect them into array-typed fields, or
 separate tables with just telephone or child columns.  Or do you say have a
 set of coordinates in separate fields?  Or you may have other kinds of
 redundancy within single rows that are best normalized into separate rows.

 With 23K values, these probably have many mutual associations, and you
 could split that table into a bunch of other ones where columns that relate
 more closely together are collected.

 My question is already answered, so this is mostly for anyone curious
about Census data, and if anyone wants to add to this, feel free.

You're right that no census form has that many details. The American
Community Survey replaced the old Census long form, so it does have more
details than the form sent out for the 2010 Census, but not 23,000
questions. It might ask, e.g. income, sex, and how you travel to work. But
the forms are private, so that data (the so-called microdata) is not
released in its entirety. What I am working with is called the summary
file, which presents the data in aggregate. That means you have an income
table with 16 income classes, plus a total column. Then you have 9 more
tables which show the same income classes by 9 racial and ethnic
categories, for a total of 153 more columns. Then you also have a table
which crosses 9 income classes by 5 mobility statuses (living in same
house, moved within county, moved within state, etc.) for a total of 55
columns. Then you have income classes crossed with sex, income classes
crossed with mode of transportation to work, sex crossed with mode of
transportation to work, etc. When all is said and done, you have 23,000
columns.

Believe me, I would all love to be working directly with the microdata. All
of these different ways of slicing and dicing the categories are basically
how the Census Bureau provides as much detail as possible without providing
so much detail that privacy would be compromised (i.e., enough information
is available that specific individuals could be identified). That
inevitably leads to a great deal of redundancy in the data, since the same
individuals are just being counted in different groups in different tables.

Given all that, one could still take the data that came from Census and try
to normalize it and organize it, but my main goal given the size of the
dataset is to keep it as similar as possible to the source, so that a
researcher familiar with the Census data can work with our database using
the Census documentation without our having to produce a new set of
documentation. The way I had done that was to store the sequences (which
are merely a data dissemination convenience, and have no relationship to
the logic of the data) in Postgres tables, and make the subject tables
(which do represent logical groupings of data) into views. I'm thinking
about making the sequences into array columns. The subject tables would
still be views.

--Lee

-- 
Lee Hachadoorian
PhD, Earth  Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/


[GENERAL] JOIN column maximum

2012-01-05 Thread Lee Hachadoorian
How is the number of columns in a join determined? When I combine somewhere
around 90 tables in a JOIN, the query returns:

ERROR: joins can have at most 32767 columns
SQL state: 54000

I'm sure most people will say Why the hell are you joining 90 tables.
I've asked this list before for advice on how to work with the
approximately 23,000 column American Community Survey dataset, and based on
previous responses I am trying to combine 117 sequences (basically
vertical partitions of the dataset) into one table using array columns. Of
course, I can build this up by joining a few tables at a time, so the
question is mostly curiosity, but I haven't been able to find this
documented anywhere. Moreover, the 32767 limit doesn't map to any
immediately intuitive transformation of 90, like squaring (which is much
too low) or factorial (which is much to high).

Any insight?

Regards,
--Lee

-- 
Lee Hachadoorian
PhD, Earth  Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/


Re: [GENERAL] JOIN column maximum

2012-01-05 Thread Tom Lane
Lee Hachadoorian lee.hachadoor...@gmail.com writes:
 How is the number of columns in a join determined? When I combine somewhere
 around 90 tables in a JOIN, the query returns:

 ERROR: joins can have at most 32767 columns

It's the sum of the number of columns in the base tables.

 I'm sure most people will say Why the hell are you joining 90 tables.

Not only that, but why are you working with over-300-column tables?
Seems like your schema design needs rethinking.

 I've asked this list before for advice on how to work with the
 approximately 23,000 column American Community Survey dataset,

Are there really 23000 populated values in each row?  I hesitate to
suggest an EAV approach, but it kinda seems like you need to go in that
direction.  You're never going to get decent performance out of a schema
that requires 100-way joins, even if you avoid bumping up against hard
limits.

regards, tom lane

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] JOIN column maximum

2012-01-05 Thread Lee Hachadoorian

On 01/05/2012 06:18 PM, Tom Lane wrote:



ERROR: joins can have at most 32767 columns

It's the sum of the number of columns in the base tables.

That makes sense. I totally misunderstood the message to be referring to 
the number of joined columns rather than table columns.



I've asked this list before for advice on how to work with the
approximately 23,000 column American Community Survey dataset,

Are there really 23000 populated values in each row?  I hesitate to
suggest an EAV approach, but it kinda seems like you need to go in that
direction.  You're never going to get decent performance out of a schema
that requires 100-way joins, even if you avoid bumping up against hard
limits.
Many of the smaller geographies, e.g. census tracts, do in fact have 
data for the vast majority of the columns. I am trying to combine it all 
into one table to avoid the slowness of multiple JOINs (even though in 
practice I'm never joining all the tables at once). EAV sounds correct 
in terms of normalization, but isn't it usually better performance-wise 
to store write-once/read-many data in a denormalized (i.e. flattened) 
fashion? One of these days I'll have to try to benchmark some different 
approaches, but for now planning on using array columns, with each 
sequence (in the Census sense, not the Postgres sense) of 200+ 
variables in its own array rather than its own table.


--Lee

--
Lee Hachadoorian
PhD, Earth  Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] JOIN column maximum

2012-01-05 Thread Scott Marlowe
On Thu, Jan 5, 2012 at 6:10 PM, Lee Hachadoorian
lee.hachadoor...@gmail.com wrote:


 Many of the smaller geographies, e.g. census tracts, do in fact have data
 for the vast majority of the columns. I am trying to combine it all into one
 table to avoid the slowness of multiple JOINs (even though in practice I'm
 never joining all the tables at once). EAV sounds correct in terms of
 normalization, but isn't it usually better performance-wise to store
 write-once/read-many data in a denormalized (i.e. flattened) fashion? One of
 these days I'll have to try to benchmark some different approaches, but for
 now planning on using array columns, with each sequence (in the Census
 sense, not the Postgres sense) of 200+ variables in its own array rather
 than its own table.

Are you using arrays or hstore?

-- 
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general


Re: [GENERAL] JOIN column maximum

2012-01-05 Thread Darren Duncan

Lee Hachadoorian wrote:

On 01/05/2012 06:18 PM, Tom Lane wrote:

Are there really 23000 populated values in each row?  I hesitate to
suggest an EAV approach, but it kinda seems like you need to go in that
direction.  You're never going to get decent performance out of a schema
that requires 100-way joins, even if you avoid bumping up against hard
limits.
Many of the smaller geographies, e.g. census tracts, do in fact have 
data for the vast majority of the columns. I am trying to combine it all 
into one table to avoid the slowness of multiple JOINs (even though in 
practice I'm never joining all the tables at once). EAV sounds correct 
in terms of normalization, but isn't it usually better performance-wise 
to store write-once/read-many data in a denormalized (i.e. flattened) 
fashion? One of these days I'll have to try to benchmark some different 
approaches, but for now planning on using array columns, with each 
sequence (in the Census sense, not the Postgres sense) of 200+ 
variables in its own array rather than its own table.


EAV is not necessarily more correct than what you're doing.

The most correct solution is one where your database schema defines, and the 
DBMS enforces, all of the constraints or business rules on your data, so that 
you can not put something in the database that violates the business rules.


Traditional EAV, if you're talking about the common binary table of 
unconstrained field-name,field-value pairs, is not an improvement.


A more correct solution is to use different columns for things with different 
business rules or data types.  If the DBMS can't handle this then that is 
grounds for improving the DBMS.


There's no reason that joins *have* to be slow, and in some DBMS designs you can 
join in linear time, its all about how you implement.


This all being said, 23K values per row just sounds wrong, and I can't imagine 
any census forms having that many details.


Do you, by chance, have multiple values of the same type that are in different 
fields, eg telephone_1, telephone_2 or child_1, child_2 etc?  You should take 
any of those and collect them into array-typed fields, or separate tables with 
just telephone or child columns.  Or do you say have a set of coordinates in 
separate fields?  Or you may have other kinds of redundancy within single rows 
that are best normalized into separate rows.


With 23K values, these probably have many mutual associations, and you could 
split that table into a bunch of other ones where columns that relate more 
closely together are collected.


What I said in the last couple paragraphs is probably your earliest best thing 
to fix, so you both have a better design and it performs together on the DBMS 
you have.


-- Darren Duncan

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general