Re: [R] Hierarchical factors

Ista Zahn Mon, 03 May 2010 10:24:53 -0700

Hi Marshall,
I'm not aware of any packages that implement these features as you
described them. But most of the tasks are already fairly easy in R --
see below.
On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <ma...@uri.edu> wrote:
>
> Thanks for getting back so quickly Ista,
>
> I was actually casting about for any examples of R software that deals with 
> this kind of structure. But your question is a good one. Here are a few 
> things I'd like to be able to do:
>
> Store data in R at the finest level of detail but easily refer to higher 
> levels of aggregation. If the data include such higher levels, this is 
> trivial, but otherwise I'd like to aggregate fairly easily. The following is 
> not functioning code, but it should give you the idea:
>
> start with a data frame (call it d) having row.names = to the 6 digit NAICS 
> code and columns w/ various variables, assume one is named employment.
> d[,"employment"]                       # Would print all employment data
> d["441222","employment"]        # Would print only Boat Dealer employment
> d["44","employment]                 # Would print total employment for Retail 
> Trade



d[,"employment"] #prints all employment data
d[rownames(d) == "441222","employment"] #prints only boat dealer employment
d[grep("^44", rownames(d)),"employment"] # prints total employment for
retail trade

>
> Recursive nesting. I'm not sure how to convey this except with examples. 
> Suppose the data frame also has a "wages" column with average weekly wages in 
> the industry, and the industry code is also a factor variable (industry). So 
> a simple analysis of variance might look like:
>
>                     w <- aov(wages ~ industry, d)
>
>         But now what I'd like to do is to break this down within 2-digit 
> sectors. Assuming the data frame has another variable, industry 2, this would 
> look like:
>
>                     w <- aov(wages ~ industry2/industry)
>
>          But what if we either (a) don't want to bother creating separate 
> variables for each level of aggregation in industry or (b) want to extended 
> the model formula language to include various nesting strategies. This might 
> look like:
>
>                     w <- aov(wages ~ industry//*)                    # Nest 
> all meaningful levels 
> industry/industry2/industry3/industry4/industry5/industry6. If the coding 
> system skips some levels, R is smart enough to omit the skipped levels.
>                     w <- aov(wages ~ industry//levels 2,4,6)     # I'm using 
> "//" as a hypothetical extension to the model language that is followed by a 
> "levels" keyword and then a list of levels within the hierarchy. This example 
> would expand
>                                                                               
>          # to aov(wages ~ industry2/industry4/industry6)
>
>         One could extend this last example to include a notation allowing the 
> analysis to be repeated at varying levels of depth (e.g., industry||2,6) 
> would repeat the ANOVA for industry2 and industry6)
>

I can see how that might be useful. But it is easy enough to split the
variables out, for example (assuming that each level consists of two
digits):

  d$ind1 <- substr(rownames(d), 1,2)
  d$ind2 <- substr(rownames(d), 3,4)
  d$ind2 <- substr(rownames(d), 5,6)


> Since the factor hierarchy is completely nested (i.e., every 6-digit industry 
> is below a 5 digit industry), a single function can operate on the codes 
> recursively. Three variants come to mind. In the first, we'd use some kind of 
> apply function to drill down to a certain level and return a list of results, 
> one for each level:
>
>                   means <- drill(wages,industry,mean)                        
> # Would return a list. The first component would a vector of mean wages for 
> industries at the 2-digit level, the second, a vector for the 3-digit level, 
> etc.
>                   means <- drill(wages,industry,mean,maxlvl=3)         # 
> Would stop at the 3rd level of the hierarchy (4-digit code). One could also 
> imagine a maxdigits optionas an alternative (maxdigits = y means stop at the 
> y-digit level)
>

Again, I can see how this would be useful, but it's already pretty
easy (once we have split out the grouping variables) to do something
like

grp.means <- list(
l1 = aggregate(d$wages, list(d$ind1), mean),
l2 = aggregate(d$wages, list(d$ind2), mean),
l3 = aggregate(d$wages, list(d$ind3), mean)
)

I know this wasn't what you were looking for (as I said, I'm not aware
of any package that implements the functionality you describe). But
the existing facilities in R are quite flexible, and handling this
kind of data in R is already fairly straightforward.

Best,
Ista

> Second, suppose we have a data frame like d, only this time it's a time 
> series (each row is a different date). Now we might want to generate vectors 
> of the rate of change in employment at each industry level. It might look 
> like:
>
>     rate <- function(x) { (x - lag(x))/lag(x)) }
>     rates <- as.list()
>     i <- 1
>     rates <- for j %in% levels(industry)  {                                   
>              # The levels function parses the hierarchical factor into the 
> various levels of its coding system
>                     rates[[i]] <- rate(emplyment[,level(industry) == j])      
>        # The level function sets a particular one of these levels
>                     i <- i + 1
>                 }
>
> A third variant would be a genuinely recursive function that keeps on calling 
> itself at each level of the factor until it has either reached a 
> pre-specified depth or exhausted all levels of the factor.
>
> I hope this gives you a good idea of the sorts of things one might do with 
> hierarchical factors.
>
>     Marsh Feldman
>
>
>
> On 5/3/2010 9:57 AM, Ista Zahn wrote:
>
> Hi Marshell,
> What exactly do you mean by "handles this kind of data structure"?
> What do you want R to do?
>
> Best,
> Ista
>
> On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman <ma...@uri.edu> wrote:
>
>
> Hello,
>
> Hierarchical factors are a very common data structure. For instance, one
> might have municipalities within states within countries within
> continents. Other examples include occupational codes, biological
> species, software types (R within statistical software within analytical
> software), etc.
>
> Such data structures commonly use hierarchical coding systems. For
> example, the 2007 North American Industry Classification System (NAICS)
> <http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has twenty
> two-digit codes (e.g., 42 = Wholesale trade), within each of these
> varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
> durable goods), then varying numbers of 4-digit codes (4231 = Motor
> Vehicle and Motor Vehicle Parts and Supplies Merchant Wholesalers), then
> varying numbers of five-digit codes, varying numbers of six-digit codes,
> etc. At the lowest level (longest code) one can readily tell all the
> higher levels. For example, 441222 is "Boat Dealers" who are part of
> 44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which is
> part of 4412 (Other Motor Vehicle Dealers), which is part of 441 (Motor
> Vehicle and Parts Dealers), which is part of 44 (Retail Trade). (The US
> Census Bureau has extended the 6-digit NAICS to an even more
> fine-grained 10-digit system.)
>
> I haven't seen any R packages or sample code that handles this kind of
> data, but I don't want to reinvent the wheel and would rather stand on
> the shoulders of you giants. Is there any package or other R-based
> software out there that handles this kind of data structure?
>
>     Thanks,
>     Marsh Feldman
>
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>
> --
> Dr. Marshall Feldman, PhD
> Director of Research and Academic Affairs
> Center for Urban Studies and Research
> The University of Rhode Island
> email: marsh @ uri .edu (remove spaces)
>
> Contact Information:
>
> Kingston:
>
> 202 Hart House
> Charles T. Schmidt Labor Research Center
> The University of Rhode Island
> 36 Upper College Road
> Kingston, RI 02881-0815
> tel. (401) 874-5953:
> fax: (401) 874-5511
>
> Providence:
>
> 206E Shepard Building
> URI Feinstein Providence Campus
> 80 Washington Street
> Providence, RI 02903-1819
> tel. (401) 277-5218
> fax: (401) 277-5464



--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Hierarchical factors

Reply via email to