Re: [C++][Parquet] Best practice to write duplicated strings / enums into parquet

2023-05-22 Thread Haocheng Liu
StringDictionaryBuilder sounds like a perfect candidate for my use case. Thanks Weston! On Mon, May 22, 2023 at 3:01 PM Weston Pace wrote: > Arrow can also represent dictionary encoding. If you like StringBuilder > then there is also a StringDictionaryBuilder which should be more or less >

Re: [C++][Parquet] Best practice to write duplicated strings / enums into parquet

2023-05-22 Thread Weston Pace
Arrow can also represent dictionary encoding. If you like StringBuilder then there is also a StringDictionaryBuilder which should be more or less compatible: TEST(TestStringDictionaryBuilder, Basic) { // Build the dictionary Array StringDictionaryBuilder builder;

[C++][Parquet] Best practice to write duplicated strings / enums into parquet

2023-05-22 Thread Haocheng Liu
Hi, I have a use case which can be simplified as there are {0-> "RED", 1->"GREEN":1, 2->"BLUE", etc} and I need to write them hundreds of millions of times. In each row, there may be tens of int -> string maps. When user read the data, they want to see "RED", "GREED" and "BLUE" rather than some