We could use an extension type here: wrap the dictionary type on an extension type whose metadata contains the expected keys. This way the keys are stored in the schema.
On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson <[email protected]> wrote: > For what it's worth, I encountered a similar issue in working on the R > bindings: if you're querying a dataset or filtering a dictionary array and > you end up with a ChunkedArray with 0 chunks, you can't populate the factor > levels when converting to R because the type doesn't have the dictionary > values, only the corresponding arrays, of which there are none in this > case. In practice it hasn't been a huge problem (AFAIK) but it is a > difference in expectations. > > That said, there are good, practical reasons not to include the dictionary > values in the type/schema (updating/deltas, as David mentioned, being one > of them). It seems like an intentional design trade-off. > > Neal > > On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote: > >> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't >> make the dictionary part of the schema itself (and the format even allows >> for dictionaries to be updated over time). I wonder if the dictionary type >> could be extended to handle this; alternatively, passing around explicit >> dictionaries alongside the schema might get you most of the way there. (It >> looks like we might need some way to pass a dictionary to from_pandas, or >> otherwise provide some way to dictionary-encode an Arrow array according to >> an existing dictionary.) >> >> -David >> >> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote: >> >> Hi Rok, David, >> >> I think the problem is that the DictionaryType loses the semantic >> information about the categories. >> >> Right now I define the schema for the tables and have logic to parse >> files/receive data and convert it into RecordBatchs ready for writing. This >> is quite simple: for each row we generate a dictionary of {key: value, ...} >> as the data comes in, pass a set of these to `pd.DataFrame(...)`, and then >> convert using `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware >> newer versions have a `pa.record_batch` that can now be used). >> >> In this instance the schema species to the code and to the user what >> columns should be present and what the type, and values, of these should be. >> >> The use of DictionaryArray breaks this as there is no way of specifying >> the permitted set of values (`dictionary` in your example) in the schema >> itself? Pandas has CategoricalDtype whereby you can specify `categories` >> but this information needs to be stored somewhere other than the schema >> itself and special cased for categorical columns. >> >> This suggests that it may be a good idea to add the categorical type >> information? >> >> Right now it looks like I'll have to define my own schema/field classes >> that return PyArrow and Pandas types when requested �� >> >> Sam >> >> >> >> >> ------------------------------ >> >> *From:* David Li <[email protected]> >> *Sent:* 05 January 2022 14:53 >> *To:* [email protected] <[email protected]> >> *Subject:* Re: [Question][Python] Columns with Limited Value Set >> >> Hi Sam, >> >> For categoricals, you likely want an Arrow dictionary array. (See docs at >> [1].) For example: >> >> >>> import pyarrow as pa >> >>> ty = pa.dictionary(pa.int8(), pa.string()) >> >>> arr = pa.array(["a", "a", None, "d"], type=ty) >> >>> arr >> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890> >> >> -- dictionary: >> [ >> "a", >> "d" >> ] >> -- indices: >> [ >> 0, >> 0, >> null, >> 1 >> ] >> >>> table = pa.table([arr], names=["col1"]) >> >>> table.to_pandas() >> col1 >> 0 a >> 1 a >> 2 NaN >> 3 d >> >>> table.to_pandas()["col1"] >> 0 a >> 1 a >> 2 NaN >> 3 d >> Name: col1, dtype: category >> Categories (2, object): ['a', 'd'] >> >> Is this sufficient? >> >> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays >> >> -David >> >> >> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote: >> >> Hi, >> >> I'm looking at defining a schema for a table where one of the values is >> inherently categorical/enumerable and we're ultimately ending up loading it >> as a Pandas DataFrame. I cannot seem to find a decent way of achieving this. >> >> For example, the column may always be known to contain the values ["a", >> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is a >> bad idea as it permits all strings and requires more storage than necessary >> for longer strings, stating it as an integer column is a bad idea as you >> lose context and force the user to cast after loading, and the dictionary >> type does not allow you to specify the values in the schema so similarly >> loses all meaning. >> >> I have been playing with the API all morning and from what I can tell >> there is no easy way of achieving this. Am I missing something obvious? >> >> --- >> >> One possible route I thought of is to define an extension type and then >> implement the `to_pandas_dtype` method. Yes this method permits all >> known values whilst in Arrow-land, but it at least documents the known type >> and, so I thought, any values not within the `to_pandas_dtype` return will >> be set to null on conversion anyway. >> >> However, this seems to require unnecessarily special-casing a whole bunch >> of code to handle extension types. e.g. just creating a scalar of this type >> requires using a different API. It seems like `pa.scalar` should be able to >> work this out? This example defines a wrapper for int32, and then tries to >> create a scalar of this type showing that the user has to call a special >> method rather than just the normal API: >> >> ``` >> import pyarrow as pa >> >> >> class IntegerWrapper(pa.ExtensionType): >> >> def __init__(self): >> pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper") >> >> def __arrow_ext_serialize__(self): >> # since we don't have a parameterized type, we don't need extra >> # metadata to be deserialized >> return b'' >> >> @classmethod >> def __arrow_ext_deserialize__(self, storage_type, serialized): >> # return an instance of this subclass given the serialized >> # metadata. >> return IntegerWrapper() >> >> >> iw_type = IntegerWrapper() >> >> pa.register_extension_type(iw_type) >> >> # throws `ArrowNotImplementedError` >> # pa.scalar(0, iw_type) >> >> # user must do this, but code should be able to do this? >> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, >> iw_type.storage_type)) >> ``` >> >> and I can't seem to get the `to_pandas_dtype` to actually work for a >> wrapped dictionary. e.g. >> >> ``` >> import pyarrow as pa >> >> >> class DictWrapper(pa.ExtensionType): >> >> def __init__(self): >> pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), >> pa.string()), "dict_wrapper") >> >> def __arrow_ext_serialize__(self): >> # since we don't have a parameterized type, we don't need extra >> # metadata to be deserialized >> return b'' >> >> @classmethod >> def __arrow_ext_deserialize__(self, storage_type, serialized): >> # return an instance of this subclass given the serialized >> # metadata. >> return DictWrapper() >> >> def to_pandas_dtype(self): >> from pandas.api.types import CategoricalDtype >> return CategoricalDtype(categories=["a", "b"]) >> >> dw_type = DictWrapper() >> >> pa.register_extension_type(dw_type) >> >> arr = pa.ExtensionArray.from_storage( >> dw_type, >> pa.array(["a", "b", "c"], dw_type.storage_type) >> ) >> >> arr >> >> arr.to_pandas() >> >> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values) >> ``` >> >> Best, >> >> Sam >> IMPORTANT NOTICE: The information transmitted is intended only for the >> person or entity to which it is addressed and may contain confidential >> and/or privileged material. Any review, re-transmission, dissemination or >> other use of, or taking of any action in reliance upon, this information by >> persons or entities other than the intended recipient is prohibited. If you >> received this in error, please contact the sender and delete the material >> from any computer. Although we routinely screen for viruses, addressees >> should check this e-mail and any attachment for viruses. We make no >> warranty as to absence of viruses in this e-mail or any attachments. >> >> >> IMPORTANT NOTICE: The information transmitted is intended only for the >> person or entity to which it is addressed and may contain confidential >> and/or privileged material. Any review, re-transmission, dissemination or >> other use of, or taking of any action in reliance upon, this information by >> persons or entities other than the intended recipient is prohibited. If you >> received this in error, please contact the sender and delete the material >> from any computer. Although we routinely screen for viruses, addressees >> should check this e-mail and any attachment for viruses. We make no >> warranty as to absence of viruses in this e-mail or any attachments. >> >> >>
