Hi!

Yes - each chunk can have a different mapping. And yes, you can use
unify_dictionaries/combine_chunks if you need it.
Or you can keep a separate dictionary for each chunk as well.

There is one thing I would like to have - any number of bits as indices
(e.g. 4bit or 12bit)
and further, similar decimals.

I am operating in limited memory and this would reduce size of my arrays.

BR

J


niedz., 28 kwi 2024 o 20:51 Laurent Gautier <[email protected]> napisał(a):

> Thanks!
>
> I read somewhere that the string->int mapping are not guaranteed to be the
> same across chunks. Is this correct?
> If so, is calling first unify_dictionaries() necessary?
>
> Also, if the operations only work on chunks is it up to the user to
> iterate through all chunks to create the resulting array of integers?
>
> Best,
>
> Laurent
>
>
> Le dim. 28 avr. 2024 à 14:28, Jacek Pliszka <[email protected]> a
> écrit :
>
>> Hi!
>>
>> table.column('a').chunk(0).dictionary returns dictionary values as an
>> array that you can map...
>>
>> Then you can construct new Dictionary Type columns from the mapped values
>> and table.column('a').chunk(0).indices
>> using pa.DictionaryArray.from_arrays
>>
>> BR
>>
>> J
>>
>>
>>
>> niedz., 28 kwi 2024 o 20:19 Laurent Gautier <[email protected]>
>> napisał(a):
>>
>>> Hi,
>>>
>>> Is there a way to cast an Array of data type DictionaryType ( for
>>> example, I have DictionaryType(dictionary<values=large_string,
>>> indices=uint32, ordered=0>)) into integers (the indices) and retrieve the
>>> mapping (string -> integer)?
>>>
>>> I cannot find anything about this in the documentation. For the first
>>> ask (cast to integers), trying to cast does not work:
>>>
>>> >>> pyarrow.compute.cast(foo, pyarrow.int32())
>>> ArrowInvalid: Failed to parse string: 'Some String' as a scalar of type
>>> int32
>>>
>>>
>>> Best,
>>>
>>>
>>> Laurent
>>>
>>>

Reply via email to