Re: [SQL] Table Design for Hierarchical Data

Steve Crawford Tue, 06 Apr 2010 17:54:34 -0700

Lee Hachadoorian wrote:

I am trying to come up with a structure to store employment data byNAICS (North American Industrial Classification System). The data usesa hierarchical encoding scheme ranging between 2 and 5 digits. Thatis, each 2-digit code includes all industries beginning with the sametwo digits. 61 includes 611 which includes 6111, 6112, 6113, etc. Aportion of the hierarchy is shown after the sig.

From the http://www.census.gov/eos/www/naics/ website:

"NAICS is a two- through six-digit hierarchical classification system,offering five levels of detail. Each digit in the code is part of aseries of progressively narrower categories, and the more digits in thecode signify greater classification detail. The first two digitsdesignate the economic sector, the third digit designates the subsector,the fourth digit designates the industry group, the fifth digitdesignates the NAICS industry, and the sixth digit designates thenational industry. The five-digit NAICS code is the level at which thereis comparability in code and definitions for most of the NAICS sectorsacross the three countries participating in NAICS (the United States,Canada, and Mexico). The six-digit level allows for the United States,Canada, and Mexico each to have country-specific detail. A complete andvalid NAICS code contains six digits."

I think I'd be inclined to store it as defined above with tables forsector, subsector, industry-group and NAICS-industry. So the NAICS tablemight have a primary key of industry_code (11131, Orange Groves) and aindustry_group column with a foreign-key constraint to theindustry-group table (1113, Fruit and Tree Nut Farming). You might adda constraint to ensure that the industry-group is the appropriatesubstring of the naics code and so on up the heirarchy. If you aredealing with importing a large amount of static source data foranalysis, these tables will also be tailor-made places to dopre-aggregation.

Adjacency lists work well in certain cases where the depths of the treesare variable or indeterminate. For example, think of an employee->bossorg-chart for a large company. The maintenance supervisor for an areamight be a dozen levels below the CEO and be a several levels above thebranch night janitor while the CEO's personal assistant is just onelevel down but with no direct reports. The CTE/recursive-query featuresin 8.4 are great for this. But in the case you have described, thenumber of levels is well defined as is the type of informationassociated with each level.

But this all depends on the nature of your source data, how often it isupdated, how big it is and the questions you want answered. It might beperfectly acceptable to just have the 5-digit code on all yourindividual data records and do something like select ... group bysubstr(full_naics_code,1,3) where substr(full_naics_code,1,2)='61'). Inthis case you will still want to keep the NAICS definition tableseparate and link to it.

One question that might impact this is the coding of your source data.Is it all full 5-digit coding or are some records coded at a high levelof detail and others only to the top-level?

One way to store this data would be to store at the most granularlevel (5-digit NAICS) and then aggregate up if I wanted employment atthe 4-, 3-, or 2-digit level. The problem is that because ofnondisclosure rules, the data is sometimes censored at the morespecific level. I might, for example, have data for 6114, but not61141, 61142, 61143. For a different branch of the tree, I might havedata at the 5-digit level while for yet another branch I might havedata only to the 3-digit level (not 4 or 5). I think that means I haveto store all data at multiple levels, even if some of the higher-leveldata could be reconstructed from other, lower-level data.

What do you mean by censored? Is the data supplied to you pre-aggregatedto some level and censored to preserve confidentiality or are do youhave the record-level source data and the responsibility to suppressdata in your reports? Is the data suppression ad-hoc (i.e. someone comesto you and says don't display these five aggregates), based on simplerules (don't display any aggregate with fewer than 15 records) or onmore complex rules (don't display any data that would allow calculationof a group of fewer than 15)? My guess is that the multi-table scenariowill be better suited to flagging aggregates for suppression.


Cheers,
Steve

--
Sent via pgsql-sql mailing list (pgsql-sql@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql

Re: [SQL] Table Design for Hierarchical Data

Reply via email to