[ https://issues.apache.org/jira/browse/ARROW-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15427245#comment-15427245 ]
Micah Kornfield commented on ARROW-81: -------------------------------------- For the structure would it pay to not have a specific metadata type? At an abstract level a category type indicates a variable that can take on a fixed number of values. Sometimes these values have mnemonic/semantic meanings to them. I think it is equally valid to have a categorical type, represented by a string vector, and communicate them as fixed types in metadata. this leads me to think that categorical information should be additional metadata on the Field table in Message.fbs. We might want to consider two ways of doing this: 1. Generic key/value metadata with a convention for categorical information (this will allow some level extensibility for other use-cases, None come to mind at the moment). 2. A specific model of something like: {code} table IntCategoryList { // specify the universe of values as a list of integers (non dictionary encoded) value_universe: [int] } table StringCategoryList { // specify the universe of value via a list of strings (non dictionary encoded) value_universe: [string] } table DictionaryCategory { // the universe of values is provided via a dictionary } union CategoryUniverseDescription { IntCategoryList, StringCategoryList, DictionaryCategory } table CategoryInfo { ordered: Boolean category_universe: CategoryUniverseDescription } {code} this could be simplified via to always assume factors are dictionary encoded (names are not well though out either). But in either case we could add a optional category member to the Field table of type CategoryInfo. Regarding the indexing, I would vote to stick with int32s for V1. Sizing the integer type and other more advanced features like bitweaving can be left for V2. > C++: Add a Category nested type > ------------------------------- > > Key: ARROW-81 > URL: https://issues.apache.org/jira/browse/ARROW-81 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ > Reporter: Wes McKinney > Assignee: Wes McKinney > > A Category (or "factor") is a dictionary-encoded array whose dictionary has > semantic meaning. The data consists of > - An array of integer "codes" > - A child array of some other type, known as the "categories" or "levels" of > the array. Typically there is an "ordered" boolean flag indicating whether > the order of the categories is meaningful. > Category/factor types are used in a number of common statistical analyses. > See, for example, > http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a > basic requirement for Python and R, at least, as Arrow C++ consumers, to have > this type. Separately, we should consider what is necessary to be able to > transmit category data in IPCs -- possible an expansion of the Arrow format. -- This message was sent by Atlassian JIRA (v6.3.4#6332)