leprechaunt33 commented on issue #33049:
URL: https://github.com/apache/arrow/issues/33049#issuecomment-1459102439

   @Ben-Epstein yup its really only a workaround that's applicable to cases 
where you need the data in memory after filtering, and since vaex is meant to 
be lazy execution memory map as required a lot of folks will be encountering 
the problem in other contexts.  Its certainly possible its a take explosion; 
the context in this case is that I'm working with a df.take on indices of rows 
identified as having matching report ids due to the lack of indexing and a many 
to one record relationship between two tables that prevents a join from being 
used.  Column failing are free text reports relating to leiomyosarcoma cases, 
so there's less than 100 scattered throughout this table of >2M reports that 
get filtered via a regex query on a MedDRA term.  Its possible the take is 
being multiplied across different tables/arrays from the different hdf5 files 
in the dataset multiplied by the separate chunks of those files and just 
creating a polynomial complexity, but I'm not familiar enough yet with t
 he vaex internals to confirm that. As you figured out there the Dataframe take 
vs arrow take and the code complexity makes it a little challenging to debug. 
I'll be able to look more under the hood of whats going on in a couple of days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to