leprechaunt33 commented on issue #33049: URL: https://github.com/apache/arrow/issues/33049#issuecomment-1459102439
@Ben-Epstein yup its really only a workaround that's applicable to cases where you need the data in memory after filtering, and since vaex is meant to be lazy execution memory map as required a lot of folks will be encountering the problem in other contexts. Its certainly possible its a take explosion; the context in this case is that I'm working with a df.take on indices of rows identified as having matching report ids due to the lack of indexing and a many to one record relationship between two tables that prevents a join from being used. Column failing are free text reports relating to leiomyosarcoma cases, so there's less than 100 scattered throughout this table of >2M reports that get filtered via a regex query on a MedDRA term. Its possible the take is being multiplied across different tables/arrays from the different hdf5 files in the dataset multiplied by the separate chunks of those files and just creating a polynomial complexity, but I'm not familiar enough yet with t he vaex internals to confirm that. As you figured out there the Dataframe take vs arrow take and the code complexity makes it a little challenging to debug. I'll be able to look more under the hood of whats going on in a couple of days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org