[ 
https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15428657#comment-15428657
 ] 

Wes McKinney commented on ARROW-81:
-----------------------------------

Here's a couple examples from Python and R (I believe SAS / Stata / SPSS / 
Julia all have similar concepts)

Python (pandas)

{code}
In [1]: import pandas as pd

In [2]: s = pd.Series(pd.Categorical.from_array(['foo', 'bar', 'foo', 'bar']))

In [3]: s
Out[3]: 
0    foo
1    bar
2    foo
3    bar
dtype: category
Categories (2, object): [bar, foo]

In [4]: s.dtype
Out[4]: category

In [5]: s.cat.codes
Out[5]: 
0    1
1    0
2    1
3    0
dtype: int8

In [6]: s.cat.categories
Out[6]: Index(['bar', 'foo'], dtype='object')
{code}

(take note of the int8 storage class...)

In Python, the categories can be any data type that is valid in other pandas 
contexts:

{code}
In [9]: s = pd.Series(pd.Categorical.from_array(pd.date_range('2000-01-01', 
periods=5).repeat(2
   ...: )))

In [10]: s
Out[10]: 
0   2000-01-01
1   2000-01-01
2   2000-01-02
3   2000-01-02
4   2000-01-03
5   2000-01-03
6   2000-01-04
7   2000-01-04
8   2000-01-05
9   2000-01-05
dtype: category
Categories (5, datetime64[ns]): [2000-01-01, 2000-01-02, 2000-01-03, 
2000-01-04, 2000-01-05]

In [11]: s = pd.Series(pd.Categorical.from_array([(100, 1000), (0, 100)] * 4))

In [12]: s
Out[12]: 
0    (100, 1000)
1       (0, 100)
2    (100, 1000)
3       (0, 100)
4    (100, 1000)
5       (0, 100)
6    (100, 1000)
7       (0, 100)
dtype: category
Categories (2, object): [(0, 100), (100, 1000)]
{code}

In R, the category values are constrained to be strings:

{code}
> f1 <- factor(c("foo", "bar", "foo", "bar"))
> f1
[1] foo bar foo bar
Levels: bar foo
> as.integer(f1)
[1] 2 1 2 1
> levels(f1)
[1] "bar" "foo"
> f2 <- factor(c("foo", "bar", "foo", "bar"), levels=c("foo", "bar"), ordered=T)
> f2
[1] foo bar foo bar
Levels: foo < bar
> levels(f2)
[1] "foo" "bar"
{code}

If the categories have ordering indicated, these can be used automatically in 
different modeling contexts (for example, in a multinomial logistic regression: 
https://en.wikipedia.org/wiki/Multinomial_logistic_regression)

It's hard to estimate exactly, but based on the data we have (maybe Hadley / 
RStudio has a better estimate) it suggests that these two communities alone 
represent several million users worldwide. 

To [~emkornfi...@gmail.com] comment, there's a couple of things that come 
immediately to mind:

* The category "levels" in general need to be able to accommodate any logical 
type. For example, we could use {{Struct<lower: Float64, upper: Float64>}} to 
represent numerical intervals (e.g. the result of a histogram operation)

* "High cardinality" categories frequently occur in the wild, so if the 
categories are part of the schema, then the schema could be arbitrarily large 
(up to the 2GB limit in leaf nodes). This will eventually cause problems in 
schema negotiation. 

* Analytics involving categorical data may apply transformations to the 
categories (combining similar categories, reordering, etc.). It would seem more 
parsimonious to me to write a new dictionary and change the dictionary ID in 
the schema versus generating a new schema with the new dictionary

* Dictionaries / categories may be shared by multiple columns. 

A JSON representation of this type might look like

{code}
{codes: [0, 0, 0, 0, 1, 1, 1, 1], levels: ['foo', 'bar'], ordered: true}
{code}

> C++: Add a Category nested type
> -------------------------------
>
>                 Key: ARROW-81
>                 URL: https://issues.apache.org/jira/browse/ARROW-81
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Wes McKinney
>            Assignee: Wes McKinney
>
> A Category (or "factor") is a dictionary-encoded array whose dictionary has 
> semantic meaning. The data consists of
> - An array of integer "codes"
> - A child array of some other type, known as the "categories" or "levels" of 
> the array. Typically there is an "ordered" boolean flag indicating whether 
> the order of the categories is meaningful.
> Category/factor types are used in a number of common statistical analyses. 
> See, for example, 
> http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a 
> basic requirement for Python and R, at least, as Arrow C++ consumers, to have 
> this type. Separately, we should consider what is necessary to be able to 
> transmit category data in IPCs -- possible an expansion of the Arrow format. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to