On Mon, 6 Jun 2022 14:28:41 -0800, Israel Brewster <ijbrews...@alaska.edu> declaimed the following:
>I have some large (>100GB) datasets loaded into memory in a two-dimensional (X >and Y) NumPy array backed Unless you have some massive number cruncher machine, with TB RAM, you are running with a lot of page swap -- and not just cached pages in unused RAM; actual disk I/O. Pretty much anything that has to scan the data is going to be slow! > >Currently I am doing this by creating a boolean array (data[‘latitude’]>50, >for example), and then applying that boolean array to the dataset using >.where(), with drop=True. This appears to work, but has two issues: > FYI: your first paragraph said "longitude", not "latitude". >1) It’s slow. On my large datasets, applying where can take several minutes >(vs. just seconds to use a boolean array to index a similarly sized numpy >array) >2) It uses large amounts of memory (which is REALLY a problem when the array >is already using 100GB+) > Personally, given the size of the data, and that it is going to involve lots of page swapping... I'd try to convert the datasets into some RDBM -- maybe with indices defined for latitude/longitude columns, allowing queries to scan the index to find matching records, and return those (perhaps for processing one at a time "for rec in cursor:" rather than doing a .fetchall(). Some RDBMs even have extensions for spatial data handling. -- Wulfraed Dennis Lee Bieber AF6VN wlfr...@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/ -- https://mail.python.org/mailman/listinfo/python-list