Hi,
I'm looking at defining a schema for a table where one of the values is
inherently categorical/enumerable and we're ultimately ending up loading it as
a Pandas DataFrame. I cannot seem to find a decent way of achieving this.
For example, the column may always be known to contain the values ["a", "b",
"c", "d"]. Stating this as a stringly-typed column in the schema is a bad idea
as it permits all strings and requires more storage than necessary for longer
strings, stating it as an integer column is a bad idea as you lose context and
force the user to cast after loading, and the dictionary type does not allow
you to specify the values in the schema so similarly loses all meaning.
I have been playing with the API all morning and from what I can tell there is
no easy way of achieving this. Am I missing something obvious?
---
One possible route I thought of is to define an extension type and then
implement the `to_pandas_dtype` method. Yes this method permits all known
values whilst in Arrow-land, but it at least documents the known type and, so I
thought, any values not within the `to_pandas_dtype` return will be set to null
on conversion anyway.
However, this seems to require unnecessarily special-casing a whole bunch of
code to handle extension types. e.g. just creating a scalar of this type
requires using a different API. It seems like `pa.scalar` should be able to
work this out? This example defines a wrapper for int32, and then tries to
create a scalar of this type showing that the user has to call a special method
rather than just the normal API:
```
import pyarrow as pa
class IntegerWrapper(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
def __arrow_ext_serialize__(self):
# since we don't have a parameterized type, we don't need extra
# metadata to be deserialized
return b''
@classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return IntegerWrapper()
iw_type = IntegerWrapper()
pa.register_extension_type(iw_type)
# throws `ArrowNotImplementedError`
# pa.scalar(0, iw_type)
# user must do this, but code should be able to do this?
pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, iw_type.storage_type))
```
and I can't seem to get the `to_pandas_dtype` to actually work for a wrapped
dictionary. e.g.
```
import pyarrow as pa
class DictWrapper(pa.ExtensionType):
def __init__(self):
pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), pa.string()),
"dict_wrapper")
def __arrow_ext_serialize__(self):
# since we don't have a parameterized type, we don't need extra
# metadata to be deserialized
return b''
@classmethod
def __arrow_ext_deserialize__(self, storage_type, serialized):
# return an instance of this subclass given the serialized
# metadata.
return DictWrapper()
def to_pandas_dtype(self):
from pandas.api.types import CategoricalDtype
return CategoricalDtype(categories=["a", "b"])
dw_type = DictWrapper()
pa.register_extension_type(dw_type)
arr = pa.ExtensionArray.from_storage(
dw_type,
pa.array(["a", "b", "c"], dw_type.storage_type)
)
arr
arr.to_pandas()
arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
```
Best,
Sam
IMPORTANT NOTICE: The information transmitted is intended only for the person
or entity to which it is addressed and may contain confidential and/or
privileged material. Any review, re-transmission, dissemination or other use
of, or taking of any action in reliance upon, this information by persons or
entities other than the intended recipient is prohibited. If you received this
in error, please contact the sender and delete the material from any computer.
Although we routinely screen for viruses, addressees should check this e-mail
and any attachment for viruses. We make no warranty as to absence of viruses in
this e-mail or any attachments.