I realized after converting the string array to dictionary encoded. I didn't see any difference. There are too many filters to compute roughly 50-60K. I'm using Acero's Filter node. I'll try your suggestion of using list of indices.
Thanks, Surya On Fri, Apr 14, 2023 at 10:17 AM Weston Pace <[email protected]> wrote: > It's unlikely there will be much benefit in using dictionary encoding. > How are you doing the filtering? It sounds like you might have many > filters. `arrow::compute::Filter` will not get great performance if you > want to apply many filters. No one has added support for selection vectors > but that would probably be the fastest way to apply many filters against > the same array. Ideally you could avoid any kind of allocation between > each filter pass. Although, if the filters were highly selective you might > want to use a list of indices instead of a selection vector. However, this > has also not been implemented. > > On Wed, Apr 12, 2023 at 7:57 PM Surya Kiran Gullapalli < > [email protected]> wrote: > >> Hi, >> I’m trying to run filter based on multiple columns (decided at run time) >> in a table. There are more than a million of filters I have to run and >> even though I’m getting results it was taking humongous amount of time to >> complete the operation. I’m using Acero filter node to get the filtered >> table >> >> I’d like to know if there’ll be a performance improvement if I use >> dictionary encoded arrays instead of string arrays? >> >> Also, any other pointers to speed up the operation >> >> Thanks, >> Surya >> >> >>
