Hi! Yes - each chunk can have a different mapping. And yes, you can use unify_dictionaries/combine_chunks if you need it. Or you can keep a separate dictionary for each chunk as well.
There is one thing I would like to have - any number of bits as indices (e.g. 4bit or 12bit) and further, similar decimals. I am operating in limited memory and this would reduce size of my arrays. BR J niedz., 28 kwi 2024 o 20:51 Laurent Gautier <[email protected]> napisał(a): > Thanks! > > I read somewhere that the string->int mapping are not guaranteed to be the > same across chunks. Is this correct? > If so, is calling first unify_dictionaries() necessary? > > Also, if the operations only work on chunks is it up to the user to > iterate through all chunks to create the resulting array of integers? > > Best, > > Laurent > > > Le dim. 28 avr. 2024 à 14:28, Jacek Pliszka <[email protected]> a > écrit : > >> Hi! >> >> table.column('a').chunk(0).dictionary returns dictionary values as an >> array that you can map... >> >> Then you can construct new Dictionary Type columns from the mapped values >> and table.column('a').chunk(0).indices >> using pa.DictionaryArray.from_arrays >> >> BR >> >> J >> >> >> >> niedz., 28 kwi 2024 o 20:19 Laurent Gautier <[email protected]> >> napisał(a): >> >>> Hi, >>> >>> Is there a way to cast an Array of data type DictionaryType ( for >>> example, I have DictionaryType(dictionary<values=large_string, >>> indices=uint32, ordered=0>)) into integers (the indices) and retrieve the >>> mapping (string -> integer)? >>> >>> I cannot find anything about this in the documentation. For the first >>> ask (cast to integers), trying to cast does not work: >>> >>> >>> pyarrow.compute.cast(foo, pyarrow.int32()) >>> ArrowInvalid: Failed to parse string: 'Some String' as a scalar of type >>> int32 >>> >>> >>> Best, >>> >>> >>> Laurent >>> >>>
