[ 
https://issues.apache.org/jira/browse/ARROW-17834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17609948#comment-17609948
 ] 

Joris Van den Bossche commented on ARROW-17834:
-----------------------------------------------

> One additional tricky thing here is what if the storage array also need 
> additional arguments. 

Hmm, such a case wouldn't be solved by this simple solution. 

I was thinking that one possible solution for this would be to encode this 
dictionary in the actual extension type (eg that you need or can pass it to the 
type constructor, like {{LabelType(dictionary=...)}}), and then the cast could 
take care of that. However, in arrow the dictionary is part of the data, not 
the type, so casting to an extension type (under the hood casting to the 
storage type) won't actually do any checking of dictionary values. 

For such a use case, you would still have to manually create the storage array 
first (and in this case actually manually create the DictionaryArray with 
passing the indices and dictionary manually, to ensure you use a certain 
dictionary array), before converting to an extension array. 

The only way I can think of to enable control over this for the extension array 
author, would be to add a method like {{\_\_arrow_construct_storage_array\_\_}} 
to the extension type, so that we can call this instead of doing a 
{{pa.array(data, type=ext_type.storage_type)}}. But I am not fully sure this is 
useful enough in general to warrant adding such a method (a more general 
mechanism that might be interesting to add is to enable to register custom cast 
methods).


> [Python] Allow creating ExtensionArray through pa.array(..) constructor
> -----------------------------------------------------------------------
>
>                 Key: ARROW-17834
>                 URL: https://issues.apache.org/jira/browse/ARROW-17834
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Currently, creating an ExtensionArray from a python sequence (or numpy array, 
> ..) requires the following:
> {code:python}
> from pyarrow.tests.test_extension_type import IntegerType
> storage_array = pa.array([1, 2, 3])
> ext_arr = pa.ExtensionArray.from_storage(IntegerType(), storage_array)
> {code}
> While doing this directly in {{pa.array(..)}} doesn't work:
> {code:python}
> >>> pa.array([1, 2, 3], type=IntegerType())
> ArrowNotImplementedError: extension
> {code}
> I think it should be possible to basically to the ExtensionArray.from_storage 
> under the hood in {{pa.array(..)}} when the specified type is an extension 
> type?
> I think this should also enable converting from a pandas DataFrame (with a 
> column with matching storage values) to a Table with a specified schema that 
> includes an extension type. Like:
> {code}
> df = pd.DataFrame({'a': [1, 2, 3]})
> pa.table(df, schema=pa.schema([('a', IntegerType())]))
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to