The extension APIs could be improved, yes. I don't think there's a real reason other than perhaps there hasn't been too much usage yet. If there's any other issues you have, feel free to chime in here or file a JIRA [1] - I'll file JIRAs for the issues already raised in this thread when I get a chance.
[1]: https://issues.apache.org/jira/secure/Dashboard.jspa -David On Thu, Jan 6, 2022, at 04:11, Sam Davis wrote: > > We could use an extension type here: wrap the dictionary type on an > > extension type whose metadata contains the expected keys. This way the keys > > are stored in the schema. > > Yes, in theory this should work but I have found extension types very clumsy > to work with. See original post for examples, but unless I'm using the wrong > API it seems like you must special case most things you want to do with them > (`pa.ExtensionScalar.from_storage` vs `pa.scalar`, etc) making them a less > useful abstraction for this sort of task? Is there a reason for this? > > > *From:* Jorge Cardoso Leitão <[email protected]> > *Sent:* 06 January 2022 06:30 > *To:* [email protected] <[email protected]> > *Subject:* Re: [Question][Python] Columns with Limited Value Set > > We could use an extension type here: wrap the dictionary type on an extension > type whose metadata contains the expected keys. This way the keys are stored > in the schema. > > > On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson <[email protected]> > wrote: >> For what it's worth, I encountered a similar issue in working on the R >> bindings: if you're querying a dataset or filtering a dictionary array and >> you end up with a ChunkedArray with 0 chunks, you can't populate the factor >> levels when converting to R because the type doesn't have the dictionary >> values, only the corresponding arrays, of which there are none in this case. >> In practice it hasn't been a huge problem (AFAIK) but it is a difference in >> expectations. >> >> That said, there are good, practical reasons not to include the dictionary >> values in the type/schema (updating/deltas, as David mentioned, being one of >> them). It seems like an intentional design trade-off. >> >> Neal >> >> On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote: >>> __ >>> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make >>> the dictionary part of the schema itself (and the format even allows for >>> dictionaries to be updated over time). I wonder if the dictionary type >>> could be extended to handle this; alternatively, passing around explicit >>> dictionaries alongside the schema might get you most of the way there. (It >>> looks like we might need some way to pass a dictionary to from_pandas, or >>> otherwise provide some way to dictionary-encode an Arrow array according to >>> an existing dictionary.) >>> >>> -David >>> >>> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote: >>>> Hi Rok, David, >>>> >>>> I think the problem is that the DictionaryType loses the semantic >>>> information about the categories. >>>> >>>> Right now I define the schema for the tables and have logic to parse >>>> files/receive data and convert it into RecordBatchs ready for writing. >>>> This is quite simple: for each row we generate a dictionary of {key: >>>> value, ...} as the data comes in, pass a set of these to >>>> `pd.DataFrame(...)`, and then convert using >>>> `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware newer versions >>>> have a `pa.record_batch` that can now be used). >>>> >>>> In this instance the schema species to the code and to the user what >>>> columns should be present and what the type, and values, of these should >>>> be. >>>> >>>> The use of DictionaryArray breaks this as there is no way of specifying >>>> the permitted set of values (`dictionary` in your example) in the schema >>>> itself? Pandas has CategoricalDtype whereby you can specify `categories` >>>> but this information needs to be stored somewhere other than the schema >>>> itself and special cased for categorical columns. >>>> >>>> This suggests that it may be a good idea to add the categorical type >>>> information? >>>> >>>> Right now it looks like I'll have to define my own schema/field classes >>>> that return PyArrow and Pandas types when requested �� >>>> >>>> Sam >>>> >>>> >>>> >>>> >>>> >>>> *From:* David Li <[email protected]> >>>> *Sent:* 05 January 2022 14:53 >>>> *To:* [email protected] <[email protected]> >>>> *Subject:* Re: [Question][Python] Columns with Limited Value Set >>>> >>>> Hi Sam, >>>> >>>> For categoricals, you likely want an Arrow dictionary array. (See docs at >>>> [1].) For example: >>>> >>>> >>> import pyarrow as pa >>>> >>> ty = pa.dictionary(pa.int8(), pa.string()) >>>> >>> arr = pa.array(["a", "a", None, "d"], type=ty) >>>> >>> arr >>>> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890> >>>> >>>> -- dictionary: >>>> [ >>>> "a", >>>> "d" >>>> ] >>>> -- indices: >>>> [ >>>> 0, >>>> 0, >>>> null, >>>> 1 >>>> ] >>>> >>> table = pa.table([arr], names=["col1"]) >>>> >>> table.to_pandas() >>>> col1 >>>> 0 a >>>> 1 a >>>> 2 NaN >>>> 3 d >>>> >>> table.to_pandas()["col1"] >>>> 0 a >>>> 1 a >>>> 2 NaN >>>> 3 d >>>> Name: col1, dtype: category >>>> Categories (2, object): ['a', 'd'] >>>> >>>> Is this sufficient? >>>> >>>> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays >>>> >>>> -David >>>> >>>> >>>> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote: >>>>> Hi, >>>>> >>>>> I'm looking at defining a schema for a table where one of the values is >>>>> inherently categorical/enumerable and we're ultimately ending up loading >>>>> it as a Pandas DataFrame. I cannot seem to find a decent way of achieving >>>>> this. >>>>> >>>>> For example, the column may always be known to contain the values ["a", >>>>> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is >>>>> a bad idea as it permits all strings and requires more storage than >>>>> necessary for longer strings, stating it as an integer column is a bad >>>>> idea as you lose context and force the user to cast after loading, and >>>>> the dictionary type does not allow you to specify the values in the >>>>> schema so similarly loses all meaning. >>>>> >>>>> I have been playing with the API all morning and from what I can tell >>>>> there is no easy way of achieving this. Am I missing something obvious? >>>>> >>>>> --- >>>>> >>>>> One possible route I thought of is to define an extension type and then >>>>> implement the `to_pandas_dtype` method. Yes this method permits all known >>>>> values whilst in Arrow-land, but it at least documents the known type >>>>> and, so I thought, any values not within the `to_pandas_dtype` return >>>>> will be set to null on conversion anyway. >>>>> >>>>> However, this seems to require unnecessarily special-casing a whole bunch >>>>> of code to handle extension types. e.g. just creating a scalar of this >>>>> type requires using a different API. It seems like `pa.scalar` should be >>>>> able to work this out? This example defines a wrapper for int32, and then >>>>> tries to create a scalar of this type showing that the user has to call a >>>>> special method rather than just the normal API: >>>>> >>>>> ``` >>>>> import pyarrow as pa >>>>> >>>>> >>>>> class IntegerWrapper(pa.ExtensionType): >>>>> >>>>> def __init__(self): >>>>> pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper") >>>>> >>>>> def __arrow_ext_serialize__(self): >>>>> # since we don't have a parameterized type, we don't need extra >>>>> # metadata to be deserialized >>>>> return b'' >>>>> >>>>> @classmethod >>>>> def __arrow_ext_deserialize__(self, storage_type, serialized): >>>>> # return an instance of this subclass given the serialized >>>>> # metadata. >>>>> return IntegerWrapper() >>>>> >>>>> >>>>> iw_type = IntegerWrapper() >>>>> >>>>> pa.register_extension_type(iw_type) >>>>> >>>>> # throws `ArrowNotImplementedError` >>>>> # pa.scalar(0, iw_type) >>>>> >>>>> # user must do this, but code should be able to do this? >>>>> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, >>>>> iw_type.storage_type)) >>>>> ``` >>>>> >>>>> and I can't seem to get the `to_pandas_dtype` to actually work for a >>>>> wrapped dictionary. e.g. >>>>> >>>>> ``` >>>>> import pyarrow as pa >>>>> >>>>> >>>>> class DictWrapper(pa.ExtensionType): >>>>> >>>>> def __init__(self): >>>>> pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), >>>>> pa.string()), "dict_wrapper") >>>>> >>>>> def __arrow_ext_serialize__(self): >>>>> # since we don't have a parameterized type, we don't need extra >>>>> # metadata to be deserialized >>>>> return b'' >>>>> >>>>> @classmethod >>>>> def __arrow_ext_deserialize__(self, storage_type, serialized): >>>>> # return an instance of this subclass given the serialized >>>>> # metadata. >>>>> return DictWrapper() >>>>> >>>>> def to_pandas_dtype(self): >>>>> from pandas.api.types import CategoricalDtype >>>>> return CategoricalDtype(categories=["a", "b"]) >>>>> >>>>> dw_type = DictWrapper() >>>>> >>>>> pa.register_extension_type(dw_type) >>>>> >>>>> arr = pa.ExtensionArray.from_storage( >>>>> dw_type, >>>>> pa.array(["a", "b", "c"], dw_type.storage_type) >>>>> ) >>>>> >>>>> arr >>>>> >>>>> arr.to_pandas() >>>>> >>>>> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values) >>>>> ``` >>>>> >>>>> Best, >>>>> >>>>> Sam >>>>> IMPORTANT NOTICE: The information transmitted is intended only for the >>>>> person or entity to which it is addressed and may contain confidential >>>>> and/or privileged material. Any review, re-transmission, dissemination or >>>>> other use of, or taking of any action in reliance upon, this information >>>>> by persons or entities other than the intended recipient is prohibited. >>>>> If you received this in error, please contact the sender and delete the >>>>> material from any computer. Although we routinely screen for viruses, >>>>> addressees should check this e-mail and any attachment for viruses. We >>>>> make no warranty as to absence of viruses in this e-mail or any >>>>> attachments. >>>> >>>> IMPORTANT NOTICE: The information transmitted is intended only for the >>>> person or entity to which it is addressed and may contain confidential >>>> and/or privileged material. Any review, re-transmission, dissemination or >>>> other use of, or taking of any action in reliance upon, this information >>>> by persons or entities other than the intended recipient is prohibited. If >>>> you received this in error, please contact the sender and delete the >>>> material from any computer. Although we routinely screen for viruses, >>>> addressees should check this e-mail and any attachment for viruses. We >>>> make no warranty as to absence of viruses in this e-mail or any >>>> attachments. >>> > IMPORTANT NOTICE: The information transmitted is intended only for the person > or entity to which it is addressed and may contain confidential and/or > privileged material. Any review, re-transmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. Although we routinely screen for viruses, addressees should check > this e-mail and any attachment for viruses. We make no warranty as to absence > of viruses in this e-mail or any attachments.
