Re: [Question][Python] Columns with Limited Value Set

Jorge Cardoso Leitão Wed, 05 Jan 2022 22:31:25 -0800

We could use an extension type here: wrap the dictionary type on an
extension type whose metadata contains the expected keys. This way the keys
are stored in the schema.



On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson <[email protected]>
wrote:

> For what it's worth, I encountered a similar issue in working on the R
> bindings: if you're querying a dataset or filtering a dictionary array and
> you end up with a ChunkedArray with 0 chunks, you can't populate the factor
> levels when converting to R because the type doesn't have the dictionary
> values, only the corresponding arrays, of which there are none in this
> case. In practice it hasn't been a huge problem (AFAIK) but it is a
> difference in expectations.
>
> That said, there are good, practical reasons not to include the dictionary
> values in the type/schema (updating/deltas, as David mentioned, being one
> of them). It seems like an intentional design trade-off.
>
> Neal
>
> On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote:
>
>> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't
>> make the dictionary part of the schema itself (and the format even allows
>> for dictionaries to be updated over time). I wonder if the dictionary type
>> could be extended to handle this; alternatively, passing around explicit
>> dictionaries alongside the schema might get you most of the way there. (It
>> looks like we might need some way to pass a dictionary to from_pandas, or
>> otherwise provide some way to dictionary-encode an Arrow array according to
>> an existing dictionary.)
>>
>> -David
>>
>> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote:
>>
>> Hi Rok, David,
>>
>> I think the problem is that the DictionaryType loses the semantic
>> information about the categories.
>>
>> Right now I define the schema for the tables and have logic to parse
>> files/receive data and convert it into RecordBatchs ready for writing. This
>> is quite simple: for each row we generate a dictionary of {key: value, ...}
>> as the data comes in, pass a set of these to `pd.DataFrame(...)`, and then
>> convert using `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware
>> newer versions have a `pa.record_batch` that can now be used).
>>
>> In this instance the schema species to the code and to the user what
>> columns should be present and what the type, and values, of these should be.
>>
>> The use of DictionaryArray breaks this as there is no way of specifying
>> the permitted set of values (`dictionary` in your example) in the schema
>> itself? Pandas has CategoricalDtype whereby you can specify `categories`
>> but this information needs to be stored somewhere other than the schema
>> itself and special cased for categorical columns.
>>
>> This suggests that it may be a good idea to add the categorical type
>> information?
>>
>> Right now it looks like I'll have to define my own schema/field classes
>> that return PyArrow and Pandas types when requested ��
>>
>> Sam
>>
>>
>>
>>
>> ------------------------------
>>
>> *From:* David Li <[email protected]>
>> *Sent:* 05 January 2022 14:53
>> *To:* [email protected] <[email protected]>
>> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>>
>> Hi Sam,
>>
>> For categoricals, you likely want an Arrow dictionary array. (See docs at
>> [1].) For example:
>>
>> >>> import pyarrow as pa
>> >>> ty = pa.dictionary(pa.int8(), pa.string())
>> >>> arr = pa.array(["a", "a", None, "d"], type=ty)
>> >>> arr
>> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>
>>
>> -- dictionary:
>>   [
>>     "a",
>>     "d"
>>   ]
>> -- indices:
>>   [
>>     0,
>>     0,
>>     null,
>>     1
>>   ]
>> >>> table = pa.table([arr], names=["col1"])
>> >>> table.to_pandas()
>>   col1
>> 0    a
>> 1    a
>> 2  NaN
>> 3    d
>> >>> table.to_pandas()["col1"]
>> 0      a
>> 1      a
>> 2    NaN
>> 3      d
>> Name: col1, dtype: category
>> Categories (2, object): ['a', 'd']
>>
>> Is this sufficient?
>>
>> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays
>>
>> -David
>>
>>
>> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
>>
>> Hi,
>>
>> I'm looking at defining a schema for a table where one of the values is
>> inherently categorical/enumerable and we're ultimately ending up loading it
>> as a Pandas DataFrame. I cannot seem to find a decent way of achieving this.
>>
>> For example, the column may always be known to contain the values ["a",
>> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is a
>> bad idea as it permits all strings and requires more storage than necessary
>> for longer strings, stating it as an integer column is a bad idea as you
>> lose context and force the user to cast after loading, and the dictionary
>> type does not allow you to specify the values in the schema so similarly
>> loses all meaning.
>>
>> I have been playing with the API all morning and from what I can tell
>> there is no easy way of achieving this. Am I missing something obvious?
>>
>> ---
>>
>> One possible route I thought of is to define an extension type and then
>> implement the `to_pandas_dtype` method. Yes this method permits all
>> known values whilst in Arrow-land, but it at least documents the known type
>> and, so I thought, any values not within the `to_pandas_dtype` return will
>> be set to null on conversion anyway.
>>
>> However, this seems to require unnecessarily special-casing a whole bunch
>> of code to handle extension types. e.g. just creating a scalar of this type
>> requires using a different API. It seems like `pa.scalar` should be able to
>> work this out? This example defines a wrapper for int32, and then tries to
>> create a scalar of this type showing that the user has to call a special
>> method rather than just the normal API:
>>
>> ```
>> import pyarrow as pa
>>
>>
>> class IntegerWrapper(pa.ExtensionType):
>>
>>     def __init__(self):
>>         pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
>>
>>     def __arrow_ext_serialize__(self):
>>         # since we don't have a parameterized type, we don't need extra
>>         # metadata to be deserialized
>>         return b''
>>
>>     @classmethod
>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>         # return an instance of this subclass given the serialized
>>         # metadata.
>>         return IntegerWrapper()
>>
>>
>> iw_type = IntegerWrapper()
>>
>> pa.register_extension_type(iw_type)
>>
>> # throws `ArrowNotImplementedError`
>> # pa.scalar(0, iw_type)
>>
>> # user must do this, but code should be able to do this?
>> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0,
>> iw_type.storage_type))
>> ```
>>
>> and I can't seem to get the `to_pandas_dtype` to actually work for a
>> wrapped dictionary. e.g.
>>
>> ```
>> import pyarrow as pa
>>
>>
>> class DictWrapper(pa.ExtensionType):
>>
>>     def __init__(self):
>>         pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(),
>> pa.string()), "dict_wrapper")
>>
>>     def __arrow_ext_serialize__(self):
>>         # since we don't have a parameterized type, we don't need extra
>>         # metadata to be deserialized
>>         return b''
>>
>>     @classmethod
>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>         # return an instance of this subclass given the serialized
>>         # metadata.
>>         return DictWrapper()
>>
>>     def to_pandas_dtype(self):
>>         from pandas.api.types import CategoricalDtype
>>         return CategoricalDtype(categories=["a", "b"])
>>
>> dw_type = DictWrapper()
>>
>> pa.register_extension_type(dw_type)
>>
>> arr = pa.ExtensionArray.from_storage(
>>     dw_type,
>>     pa.array(["a", "b", "c"], dw_type.storage_type)
>> )
>>
>> arr
>>
>> arr.to_pandas()
>>
>> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
>> ```
>>
>> Best,
>>
>> Sam
>> IMPORTANT NOTICE: The information transmitted is intended only for the
>> person or entity to which it is addressed and may contain confidential
>> and/or privileged material. Any review, re-transmission, dissemination or
>> other use of, or taking of any action in reliance upon, this information by
>> persons or entities other than the intended recipient is prohibited. If you
>> received this in error, please contact the sender and delete the material
>> from any computer. Although we routinely screen for viruses, addressees
>> should check this e-mail and any attachment for viruses. We make no
>> warranty as to absence of viruses in this e-mail or any attachments.
>>
>>
>> IMPORTANT NOTICE: The information transmitted is intended only for the
>> person or entity to which it is addressed and may contain confidential
>> and/or privileged material. Any review, re-transmission, dissemination or
>> other use of, or taking of any action in reliance upon, this information by
>> persons or entities other than the intended recipient is prohibited. If you
>> received this in error, please contact the sender and delete the material
>> from any computer. Although we routinely screen for viruses, addressees
>> should check this e-mail and any attachment for viruses. We make no
>> warranty as to absence of viruses in this e-mail or any attachments.
>>
>>
>>

Re: [Question][Python] Columns with Limited Value Set

Reply via email to