Re: [Question][Python] Columns with Limited Value Set

David Li Thu, 06 Jan 2022 08:22:15 -0800

The extension APIs could be improved, yes. I don't think there's a real reason 
other than perhaps there hasn't been too much usage yet. If there's any other 
issues you have, feel free to chime in here or file a JIRA [1] - I'll file 
JIRAs for the issues already raised in this thread when I get a chance.


[1]: https://issues.apache.org/jira/secure/Dashboard.jspa

-David

On Thu, Jan 6, 2022, at 04:11, Sam Davis wrote:
> > We could use an extension type here: wrap the dictionary type on an 
> > extension type whose metadata contains the expected keys. This way the keys 
> > are stored in the schema.
> 
> Yes, in theory this should work but I have found extension types very clumsy 
> to work with. See original post for examples, but unless I'm using the wrong 
> API it seems like you must special case most things you want to do with them 
> (`pa.ExtensionScalar.from_storage` vs `pa.scalar`, etc) making them a less 
> useful abstraction for this sort of task? Is there a reason for this?
> 
> 
> *From:* Jorge Cardoso Leitão <[email protected]>
> *Sent:* 06 January 2022 06:30
> *To:* [email protected] <[email protected]>
> *Subject:* Re: [Question][Python] Columns with Limited Value Set 
>  
> We could use an extension type here: wrap the dictionary type on an extension 
> type whose metadata contains the expected keys. This way the keys are stored 
> in the schema.
> 
> 
> On Wed, Jan 5, 2022 at 11:32 PM Neal Richardson <[email protected]> 
> wrote:
>> For what it's worth, I encountered a similar issue in working on the R 
>> bindings: if you're querying a dataset or filtering a dictionary array and 
>> you end up with a ChunkedArray with 0 chunks, you can't populate the factor 
>> levels when converting to R because the type doesn't have the dictionary 
>> values, only the corresponding arrays, of which there are none in this case. 
>> In practice it hasn't been a huge problem (AFAIK) but it is a difference in 
>> expectations.
>> 
>> That said, there are good, practical reasons not to include the dictionary 
>> values in the type/schema (updating/deltas, as David mentioned, being one of 
>> them). It seems like an intentional design trade-off. 
>> 
>> Neal
>> 
>> On Wed, Jan 5, 2022 at 4:22 PM David Li <[email protected]> wrote:
>>> __
>>> Ah, thank you for the clarification. Indeed, Arrow dictionaries don't make 
>>> the dictionary part of the schema itself (and the format even allows for 
>>> dictionaries to be updated over time). I wonder if the dictionary type 
>>> could be extended to handle this; alternatively, passing around explicit 
>>> dictionaries alongside the schema might get you most of the way there. (It 
>>> looks like we might need some way to pass a dictionary to from_pandas, or 
>>> otherwise provide some way to dictionary-encode an Arrow array according to 
>>> an existing dictionary.)
>>> 
>>> -David
>>> 
>>> On Wed, Jan 5, 2022, at 10:21, Sam Davis wrote:
>>>> Hi Rok, David,
>>>> 
>>>> I think the problem is that the DictionaryType loses the semantic 
>>>> information about the categories. 
>>>> 
>>>> Right now I define the schema for the tables and have logic to parse 
>>>> files/receive data and convert it into RecordBatchs ready for writing. 
>>>> This is quite simple: for each row we generate a dictionary of {key: 
>>>> value, ...} as the data comes in, pass a set of these to 
>>>> `pd.DataFrame(...)`, and then convert using 
>>>> `pa.RecordBatch.from_pandas(df, schema=schema)` (I'm aware newer versions 
>>>> have a `pa.record_batch` that can now be used). 
>>>> 
>>>> In this instance the schema species to the code and to the user what 
>>>> columns should be present and what the type, and values, of these should 
>>>> be.
>>>> 
>>>> The use of DictionaryArray breaks this as there is no way of specifying 
>>>> the permitted set of values (`dictionary` in your example) in the schema 
>>>> itself? Pandas has CategoricalDtype whereby you can specify `categories` 
>>>> but this information needs to be stored somewhere other than the schema 
>>>> itself and special cased for categorical columns.
>>>> 
>>>> This suggests that it may be a good idea to add the categorical type 
>>>> information?
>>>> 
>>>> Right now it looks like I'll have to define my own schema/field classes 
>>>> that return PyArrow and Pandas types when requested ��
>>>> 
>>>> Sam
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> *From:* David Li <[email protected]>
>>>> *Sent:* 05 January 2022 14:53
>>>> *To:* [email protected] <[email protected]>
>>>> *Subject:* Re: [Question][Python] Columns with Limited Value Set
>>>>  
>>>> Hi Sam,
>>>> 
>>>> For categoricals, you likely want an Arrow dictionary array. (See docs at 
>>>> [1].) For example:
>>>> 
>>>> >>> import pyarrow as pa
>>>> >>> ty = pa.dictionary(pa.int8(), pa.string())
>>>> >>> arr = pa.array(["a", "a", None, "d"], type=ty)
>>>> >>> arr
>>>> <pyarrow.lib.DictionaryArray object at 0x7fe2fff70890>
>>>> 
>>>> -- dictionary:
>>>>   [
>>>>     "a",
>>>>     "d"
>>>>   ]
>>>> -- indices:
>>>>   [
>>>>     0,
>>>>     0,
>>>>     null,
>>>>     1
>>>>   ]
>>>> >>> table = pa.table([arr], names=["col1"])
>>>> >>> table.to_pandas()
>>>>   col1
>>>> 0    a
>>>> 1    a
>>>> 2  NaN
>>>> 3    d
>>>> >>> table.to_pandas()["col1"]
>>>> 0      a
>>>> 1      a
>>>> 2    NaN
>>>> 3      d
>>>> Name: col1, dtype: category
>>>> Categories (2, object): ['a', 'd']
>>>> 
>>>> Is this sufficient?
>>>> 
>>>> [1]: https://arrow.apache.org/docs/python/data.html#dictionary-arrays
>>>> 
>>>> -David
>>>> 
>>>> 
>>>> On Wed, Jan 5, 2022, at 09:34, Sam Davis wrote:
>>>>> Hi,
>>>>> 
>>>>> I'm looking at defining a schema for a table where one of the values is 
>>>>> inherently categorical/enumerable and we're ultimately ending up loading 
>>>>> it as a Pandas DataFrame. I cannot seem to find a decent way of achieving 
>>>>> this.
>>>>> 
>>>>> For example, the column may always be known to contain the values ["a", 
>>>>> "b", "c", "d"]. Stating this as a stringly-typed column in the schema is 
>>>>> a bad idea as it permits all strings and requires more storage than 
>>>>> necessary for longer strings, stating it as an integer column is a bad 
>>>>> idea as you lose context and force the user to cast after loading, and 
>>>>> the dictionary type does not allow you to specify the values in the 
>>>>> schema so similarly loses all meaning.
>>>>> 
>>>>> I have been playing with the API all morning and from what I can tell 
>>>>> there is no easy way of achieving this. Am I missing something obvious? 
>>>>> 
>>>>> ---
>>>>> 
>>>>> One possible route I thought of is to define an extension type and then 
>>>>> implement the `to_pandas_dtype` method. Yes this method permits all known 
>>>>> values whilst in Arrow-land, but it at least documents the known type 
>>>>> and, so I thought, any values not within the `to_pandas_dtype` return 
>>>>> will be set to null on conversion anyway.
>>>>> 
>>>>> However, this seems to require unnecessarily special-casing a whole bunch 
>>>>> of code to handle extension types. e.g. just creating a scalar of this 
>>>>> type requires using a different API. It seems like `pa.scalar` should be 
>>>>> able to work this out? This example defines a wrapper for int32, and then 
>>>>> tries to create a scalar of this type showing that the user has to call a 
>>>>> special method rather than just the normal API:
>>>>> 
>>>>> ```
>>>>> import pyarrow as pa 
>>>>> 
>>>>> 
>>>>> class IntegerWrapper(pa.ExtensionType):
>>>>> 
>>>>>     def __init__(self):
>>>>>         pa.ExtensionType.__init__(self, pa.int32(), "integer_wrapper")
>>>>> 
>>>>>     def __arrow_ext_serialize__(self):
>>>>>         # since we don't have a parameterized type, we don't need extra
>>>>>         # metadata to be deserialized
>>>>>         return b''
>>>>> 
>>>>>     @classmethod
>>>>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>>>>         # return an instance of this subclass given the serialized
>>>>>         # metadata.
>>>>>         return IntegerWrapper()
>>>>>    
>>>>> 
>>>>> iw_type = IntegerWrapper()
>>>>> 
>>>>> pa.register_extension_type(iw_type)
>>>>> 
>>>>> # throws `ArrowNotImplementedError`
>>>>> # pa.scalar(0, iw_type)
>>>>> 
>>>>> # user must do this, but code should be able to do this?
>>>>> pa.ExtensionScalar.from_storage(iw_type, pa.scalar(0, 
>>>>> iw_type.storage_type))
>>>>> ```
>>>>> 
>>>>> and I can't seem to get the `to_pandas_dtype` to actually work for a 
>>>>> wrapped dictionary. e.g. 
>>>>> 
>>>>> ```
>>>>> import pyarrow as pa 
>>>>> 
>>>>> 
>>>>> class DictWrapper(pa.ExtensionType):
>>>>> 
>>>>>     def __init__(self):
>>>>>         pa.ExtensionType.__init__(self, pa.dictionary(pa.int8(), 
>>>>> pa.string()), "dict_wrapper")
>>>>> 
>>>>>     def __arrow_ext_serialize__(self):
>>>>>         # since we don't have a parameterized type, we don't need extra
>>>>>         # metadata to be deserialized
>>>>>         return b''
>>>>> 
>>>>>     @classmethod
>>>>>     def __arrow_ext_deserialize__(self, storage_type, serialized):
>>>>>         # return an instance of this subclass given the serialized
>>>>>         # metadata.
>>>>>         return DictWrapper()
>>>>>    
>>>>>     def to_pandas_dtype(self):
>>>>>         from pandas.api.types import CategoricalDtype
>>>>>         return CategoricalDtype(categories=["a", "b"])
>>>>> 
>>>>> dw_type = DictWrapper()
>>>>> 
>>>>> pa.register_extension_type(dw_type)
>>>>> 
>>>>> arr = pa.ExtensionArray.from_storage( 
>>>>>     dw_type,
>>>>>     pa.array(["a", "b", "c"], dw_type.storage_type)
>>>>> )
>>>>> 
>>>>> arr
>>>>> 
>>>>> arr.to_pandas()
>>>>> 
>>>>> arr.to_pandas(categories=dw_type.to_pandas_dtype().categories.values)
>>>>> ```
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Sam
>>>>> IMPORTANT NOTICE: The information transmitted is intended only for the 
>>>>> person or entity to which it is addressed and may contain confidential 
>>>>> and/or privileged material. Any review, re-transmission, dissemination or 
>>>>> other use of, or taking of any action in reliance upon, this information 
>>>>> by persons or entities other than the intended recipient is prohibited. 
>>>>> If you received this in error, please contact the sender and delete the 
>>>>> material from any computer. Although we routinely screen for viruses, 
>>>>> addressees should check this e-mail and any attachment for viruses. We 
>>>>> make no warranty as to absence of viruses in this e-mail or any 
>>>>> attachments.
>>>> 
>>>> IMPORTANT NOTICE: The information transmitted is intended only for the 
>>>> person or entity to which it is addressed and may contain confidential 
>>>> and/or privileged material. Any review, re-transmission, dissemination or 
>>>> other use of, or taking of any action in reliance upon, this information 
>>>> by persons or entities other than the intended recipient is prohibited. If 
>>>> you received this in error, please contact the sender and delete the 
>>>> material from any computer. Although we routinely screen for viruses, 
>>>> addressees should check this e-mail and any attachment for viruses. We 
>>>> make no warranty as to absence of viruses in this e-mail or any 
>>>> attachments.
>>> 
> IMPORTANT NOTICE: The information transmitted is intended only for the person 
> or entity to which it is addressed and may contain confidential and/or 
> privileged material. Any review, re-transmission, dissemination or other use 
> of, or taking of any action in reliance upon, this information by persons or 
> entities other than the intended recipient is prohibited. If you received 
> this in error, please contact the sender and delete the material from any 
> computer. Although we routinely screen for viruses, addressees should check 
> this e-mail and any attachment for viruses. We make no warranty as to absence 
> of viruses in this e-mail or any attachments.

Re: [Question][Python] Columns with Limited Value Set

Reply via email to