On Mon, 6 Jun 2022 14:28:41 -0800, Israel Brewster <ijbrews...@alaska.edu>
declaimed the following:

>I have some large (>100GB) datasets loaded into memory in a two-dimensional (X 
>and Y) NumPy array backed

        Unless you have some massive number cruncher machine, with TB RAM, you
are running with a lot of page swap -- and not just cached pages in unused
RAM; actual disk I/O.

        Pretty much anything that has to scan the data is going to be slow!

>
>Currently I am doing this by creating a boolean array (data[‘latitude’]>50, 
>for example), and then applying that boolean array to the dataset using 
>.where(), with drop=True. This appears to work, but has two issues:
>

        FYI: your first paragraph said "longitude", not "latitude".

>1) It’s slow. On my large datasets, applying where can take several minutes 
>(vs. just seconds to use a boolean array to index a similarly sized numpy 
>array)
>2) It uses large amounts of memory (which is REALLY a problem when the array 
>is already using 100GB+)
>

        Personally, given the size of the data, and that it is going to involve
lots of page swapping... I'd try to convert the datasets into some RDBM --
maybe with indices defined for latitude/longitude columns, allowing queries
to scan the index to find matching records, and return those (perhaps for
processing one at a time "for rec in cursor:" rather than doing a
.fetchall().

        Some RDBMs even have extensions for spatial data handling.


-- 
        Wulfraed                 Dennis Lee Bieber         AF6VN
        wlfr...@ix.netcom.com    http://wlfraed.microdiversity.freeddns.org/
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to