How big are your dictionaries typically? What are your upper and lower bounds?
On Wed, Jan 5, 2022 at 10:22 PM David Li <[email protected]> wrote: > > Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make > the dictionary part of the schema itself (and the format even allows for > dictionaries to be updated over time). I wonder if the dictionary type could > be extended to handle this; alternatively, passing around explicit > dictionaries alongside the schema might get you most of the way there. (It > looks like we might need some way to pass a dictionary to from_pandas, or > otherwise provide some way to dictionary-encode an Arrow array according to > an existing dictionary.) > > -David > > On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote: > > Hi Rok, David, > > I think the problem is that the DictionaryType loses the semantic information > about the categories. > > Right now I define the schema for the tables and have logic to parse > files/receive data and convert it into RecordBatchs ready for writing. This > is quite simple: for each row we generate a dictionary of {key: value, ...} > as the data comes in, pass a set of these to `pd.DataFrame(...)`, and then > convert using `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware > newer versions have a `pa.record_batch` that can now be used). > > In this instance the schema species to the code and to the user what columns > should be present and what the type, and values, of these should be. > > The use of DictionaryArray breaks this as there is no way of specifying the > permitted set of values (`dictionary` in your example) in the schema itself? > Pandas has CategoricalDtype whereby you can specify `categories` but this > information needs to be stored somewhere other than the schema itself and > special cased for categorical columns. > > This suggests that it may be a good idea to add the categorical type > information? > > Right now it looks like I'll have to define my own schema/field classes that > return PyArrow and Pandas types when requested �� > > Sam > > > > > ________________________________ > From: David Li <[email protected]> > Sent: 05 January 2022 14:53 > To: [email protected] <[email protected]> > Subject: Re: [Question][Python] Columns with Limited Value Set > > Hi Sam, > > For categoricals, you likely want an Arrow dictionary array. (See docs at > [1].) For example: > > >>> import pyarrow as pa > >>> ty = pa.dictionary(pa.int8(), pa.string()) > >>> arr = pa.array(["a", "a", None, "d"], type=ty) > >>> arr > <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890> > > -- dictionary: > [ > "a", > "d" > ] > -- indices: > [ > 0, > 0, > null, > 1 > ] > >>> table = pa.table([arr], names=["col1"]) > >>> table.to_pandas() > col1 > 0 a > 1 a > 2 NaN > 3 d > >>> table.to_pandas()["col1"] > 0 a > 1 a > 2 NaN > 3 d > Name: col1, dtype: category > Categories (2, object): ['a', 'd'] > > Is this sufficient? > > [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays > > -David > > > On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote: > > Hi, > > I'm looking at defining a schema for a table where one of the values is > inherently categorical/enumerable and we're ultimately ending up loading it > as a Pandas DataFrame. I cannot seem to find a decent way of achieving this. > > For example, the column may always be known to contain the values ["a", "b", > "c", "d"]. Stating this as a stringly-typed column in the schema is a bad > idea as it permits all strings and requires more storage than necessary for > longer strings, stating it as an integer column is a bad idea as you lose > context and force the user to cast after loading, and the dictionary type > does not allow you to specify the values in the schema so similarly loses all > meaning. > > I have been playing with the API all morning and from what I can tell there > is no easy way of achieving this. Am I missing something obvious? > > --- > > One possible route I thought of is to define an extension type and then > implement the `to_pandas_dtype` method. Yes this method permits all known > values whilst in Arrow-land, but it at least documents the known type and, so > I thought, any values not within the `to_pandas_dtype` return will be set to > null on conversion anyway. > > However, this seems to require unnecessarily special-casing a whole bunch of > code to handle extension types. e.g. just creating a scalar of this type > requires using a different API. It seems like `pa.scalar` should be able to > work this out? This example defines a wrapper for int32, and then tries to > create a scalar of this type showing that the user has to call a special > method rather than just the normal API: > > ``` > import pyarrow as pa > > > class IntegerWrapper(pa.ExtensionType): > > def __init__(self): > pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper") > > def __arrow_ext_serialize__(self): > # since we don't have a parameterized type, we don't need extra > # metadata to be deserialized > return b'' > > @classmethod > def __arrow_ext_deserialize__(self, storage_type, serialized): > # return an instance of this subclass given the serialized > # metadata. > return IntegerWrapper() > > > iw_type = IntegerWrapper() > > pa.register_extension_type(iw_type) > > # throws `ArrowNotImplementedError` > # pa.scalar(0, iw_type) > > # user must do this, but code should be able to do this? > pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, iw_type.storage_type)) > ``` > > and I can't seem to get the `to_pandas_dtype` to actually work for a wrapped > dictionary. e.g. > > ``` > import pyarrow as pa > > > class DictWrapper(pa.ExtensionType): > > def __init__(self): > pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), > pa.string()), "dict_wrapper") > > def __arrow_ext_serialize__(self): > # since we don't have a parameterized type, we don't need extra > # metadata to be deserialized > return b'' > > @classmethod > def __arrow_ext_deserialize__(self, storage_type, serialized): > # return an instance of this subclass given the serialized > # metadata. > return DictWrapper() > > def to_pandas_dtype(self): > from pandas.api.types import CategoricalDtype > return CategoricalDtype(categories=["a", "b"]) > > dw_type = DictWrapper() > > pa.register_extension_type(dw_type) > > arr = pa.ExtensionArray.from_storage( > dw_type, > pa.array(["a", "b", "c"], dw_type.storage_type) > ) > > arr > > arr.to_pandas() > > arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values) > ``` > > Best, > > Sam > IMPORTANT NOTICE: The information transmitted is intended only for the person > or entity to which it is addressed and may contain confidential and/or > privileged material. Any review, re-transmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. Although we routinely screen for viruses, addressees should check > this e-mail and any attachment for viruses. We make no warranty as to absence > of viruses in this e-mail or any attachments. > > > IMPORTANT NOTICE: The information transmitted is intended only for the person > or entity to which it is addressed and may contain confidential and/or > privileged material. Any review, re-transmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. Although we routinely screen for viruses, addressees should check > this e-mail and any attachment for viruses. We make no warranty as to absence > of viruses in this e-mail or any attachments. > >
