Re: [Question][Python] Columns with Limited Value Set

David Li Wed, 05 Jan 2022 06:54:02 -0800

Hi Sam,

For categoricals, you likely want an Arrow dictionary array. (See docs at [1].) 
For example:


>>> import pyarrow as pa
>>> ty = pa.dictionary(pa.int8(), pa.string())
>>> arr = pa.array(["a", "a", None, "d"], type=ty)
>>> arr
<pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>

-- dictionary:
  [
    "a",
    "d"
  ]
-- indices:
  [
    0,
    0,
    null,
    1
  ]
>>> table = pa.table([arr], names=["col1"])
>>> table.to_pandas()
  col1
0    a
1    a
2  NaN
3    d
>>> table.to_pandas()["col1"]
0      a
1      a
2    NaN
3      d
Name: col1, dtype: category
Categories (2, object): ['a', 'd']

Is this sufficient?

[1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays

-David

On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
> Hi,
> 
> I'm looking at defining a schema for a table where one of the values is 
> inherently categorical/enumerable and we're ultimately ending up loading it 
> as a Pandas DataFrame. I cannot seem to find a decent way of achieving this.
> 
> For example, the column may always be known to contain the values ["a", "b", 
> "c", "d"]. Stating this as a stringly-typed column in the schema is a bad 
> idea as it permits all strings and requires more storage than necessary for 
> longer strings, stating it as an integer column is a bad idea as you lose 
> context and force the user to cast after loading, and the dictionary type 
> does not allow you to specify the values in the schema so similarly loses all 
> meaning.
> 
> I have been playing with the API all morning and from what I can tell there 
> is no easy way of achieving this. Am I missing something obvious? 
> 
> ---
> 
> One possible route I thought of is to define an extension type and then 
> implement the `to_pandas_dtype` method. Yes this method permits all known 
> values whilst in Arrow-land, but it at least documents the known type and, so 
> I thought, any values not within the `to_pandas_dtype` return will be set to 
> null on conversion anyway.
> 
> However, this seems to require unnecessarily special-casing a whole bunch of 
> code to handle extension types. e.g. just creating a scalar of this type 
> requires using a different API. It seems like `pa.scalar` should be able to 
> work this out? This example defines a wrapper for int32, and then tries to 
> create a scalar of this type showing that the user has to call a special 
> method rather than just the normal API:
> 
> ```
> import pyarrow as pa 
> 
> 
> class IntegerWrapper(pa.ExtensionType):
> 
>     def __init__(self):
>         pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
> 
>     def __arrow_ext_serialize__(self):
>         # since we don't have a parameterized type, we don't need extra
>         # metadata to be deserialized
>         return b''
> 
>     @classmethod
>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>         # return an instance of this subclass given the serialized
>         # metadata.
>         return IntegerWrapper()
>    
> 
> iw_type = IntegerWrapper()
> 
> pa.register_extension_type(iw_type)
> 
> # throws `ArrowNotImplementedError`
> # pa.scalar(0, iw_type)
> 
> # user must do this, but code should be able to do this?
> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, iw_type.storage_type))
> ```
> 
> and I can't seem to get the `to_pandas_dtype` to actually work for a wrapped 
> dictionary. e.g. 
> 
> ```
> import pyarrow as pa 
> 
> 
> class DictWrapper(pa.ExtensionType):
> 
>     def __init__(self):
>         pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), 
> pa.string()), "dict_wrapper")
> 
>     def __arrow_ext_serialize__(self):
>         # since we don't have a parameterized type, we don't need extra
>         # metadata to be deserialized
>         return b''
> 
>     @classmethod
>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>         # return an instance of this subclass given the serialized
>         # metadata.
>         return DictWrapper()
>    
>     def to_pandas_dtype(self):
>         from pandas.api.types import CategoricalDtype
>         return CategoricalDtype(categories=["a", "b"])
> 
> dw_type = DictWrapper()
> 
> pa.register_extension_type(dw_type)
> 
> arr = pa.ExtensionArray.from_storage( 
>     dw_type,
>     pa.array(["a", "b", "c"], dw_type.storage_type)
> )
> 
> arr
> 
> arr.to_pandas()
> 
> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
> ```
> 
> Best,
> 
> Sam
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.

Re: [Question][Python] Columns with Limited Value Set

Reply via email to