[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428657#comment-15428657 ]
Wes McKinney commented on ARROW-81: ----------------------------------- Here's a couple examples from Python and R (I believe SAS / Stata / SPSS / Julia all have similar concepts) Python (pandas) {code} In [1]: import pandas as pd In [2]: s = pd.Series(pd.Categorical.from_array(['foo', 'bar', 'foo', 'bar'])) In [3]: s Out[3]: 0 foo 1 bar 2 foo 3 bar dtype: category Categories (2, object): [bar, foo] In [4]: s.dtype Out[4]: category In [5]: s.cat.codes Out[5]: 0 1 1 0 2 1 3 0 dtype: int8 In [6]: s.cat.categories Out[6]: Index(['bar', 'foo'], dtype='object') {code} (take note of the int8 storage class...) In Python, the categories can be any data type that is valid in other pandas contexts: {code} In [9]: s = pd.Series(pd.Categorical.from_array(pd.date_range('2000-01-01', periods=5).repeat(2 ...: ))) In [10]: s Out[10]: 0 2000-01-01 1 2000-01-01 2 2000-01-02 3 2000-01-02 4 2000-01-03 5 2000-01-03 6 2000-01-04 7 2000-01-04 8 2000-01-05 9 2000-01-05 dtype: category Categories (5, datetime64[ns]): [2000-01-01, 2000-01-02, 2000-01-03, 2000-01-04, 2000-01-05] In [11]: s = pd.Series(pd.Categorical.from_array([(100, 1000), (0, 100)] * 4)) In [12]: s Out[12]: 0 (100, 1000) 1 (0, 100) 2 (100, 1000) 3 (0, 100) 4 (100, 1000) 5 (0, 100) 6 (100, 1000) 7 (0, 100) dtype: category Categories (2, object): [(0, 100), (100, 1000)] {code} In R, the category values are constrained to be strings: {code} > f1 <- factor(c("foo", "bar", "foo", "bar")) > f1 [1] foo bar foo bar Levels: bar foo > as.integer(f1) [1] 2 1 2 1 > levels(f1) [1] "bar" "foo" > f2 <- factor(c("foo", "bar", "foo", "bar"), levels=c("foo", "bar"), ordered=T) > f2 [1] foo bar foo bar Levels: foo < bar > levels(f2) [1] "foo" "bar" {code} If the categories have ordering indicated, these can be used automatically in different modeling contexts (for example, in a multinomial logistic regression: https://en.wikipedia.org/wiki/Multinomial_logistic_regression) It's hard to estimate exactly, but based on the data we have (maybe Hadley / RStudio has a better estimate) it suggests that these two communities alone represent several million users worldwide. To [~emkornfi...@gmail.com] comment, there's a couple of things that come immediately to mind: * The category "levels" in general need to be able to accommodate any logical type. For example, we could use {{Struct<lower: Float64, upper: Float64>}} to represent numerical intervals (e.g. the result of a histogram operation) * "High cardinality" categories frequently occur in the wild, so if the categories are part of the schema, then the schema could be arbitrarily large (up to the 2GB limit in leaf nodes). This will eventually cause problems in schema negotiation. * Analytics involving categorical data may apply transformations to the categories (combining similar categories, reordering, etc.). It would seem more parsimonious to me to write a new dictionary and change the dictionary ID in the schema versus generating a new schema with the new dictionary * Dictionaries / categories may be shared by multiple columns. A JSON representation of this type might look like {code} {codes: [0, 0, 0, 0, 1, 1, 1, 1], levels: ['foo', 'bar'], ordered: true} {code} > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)