Re: Filtering XArray Datasets?

2022-06-07 Thread Peter Otten

On 07/06/2022 00:28, Israel Brewster wrote:

I have some large (>100GB) datasets loaded into memory in a two-dimensional (X 
and Y) NumPy array backed XArray dataset. At one point I want to filter the data 
using a boolean array created by performing a boolean operation on the dataset 
that is, I want to filter the dataset for all points with a longitude value 
greater than, say, 50 and less than 60, just to give an example (hopefully that 
all makes sense?).

Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for 
example), and then applying that boolean array to the dataset using .where(), with 
drop=True. This appears to work, but has two issues:

1) It’s slow. On my large datasets, applying where can take several minutes 
(vs. just seconds to use a boolean array to index a similarly sized numpy array)
2) It uses large amounts of memory (which is REALLY a problem when the array is 
already using 100GB+)

What it looks like is that values corresponding to True in the boolean array 
are copied to a new XArray object, thereby potentially doubling memory usage 
until it is complete, at which point the original object can be dropped, 
thereby freeing the memory.

Is there any solution for these issues? Some way to do an in-place filtering?


Can XArray-s be sorted, resized  in-place? If so, you can sort by
longitude <= 50, search the index of the first row with longitude <= 50
and then resize the array.

(If the order of rows matters the sort algorithme has to be stable)
--
https://mail.python.org/mailman/listinfo/python-list


Re: Filtering XArray Datasets?

2022-06-07 Thread Martin Di Paola

Hi, I'm not an expert on this so this is an educated guess:

You are calling drop=True and I presume that you want to delete the rows
of your dataset that satisfy a condition.

That's a problem.

If the underlying original data is stored in a dense contiguous array,
deleting chunks of it will leave it with "holes". Unless the backend
supports sparse implementations, it is likely that it will go for the
easiest solution: copy the non-deleted rows in a new array.

I don't know the details of you particular problem but most of the time
the trick is in not letting the whole data to be loaded.

Try to see if instead of loading all the dataset and then performing the
filtering/selection, you can do the filtering during the loading.

An alternative could use filtering "before" doing the real work. For
example, if you have a CSV of >100GB you could write a program X that
copies the dataset into a new CSV but doing the filtering. Then, you
load the filtered dataset and do the real work in a program Y.

I explicitly named X and Y as, in principle, they are 2 different programs using
even 2 different technologies.

I hope this email can give you hints of how to fix it. In my last
project I had a similar problem and I ended up doing the filtering on
Python and the "real work" in Julia.

Thanks!
Martin.


On Mon, Jun 06, 2022 at 02:28:41PM -0800, Israel Brewster wrote:

I have some large (>100GB) datasets loaded into memory in a two-dimensional (X 
and Y) NumPy array backed XArray dataset. At one point I want to filter the data 
using a boolean array created by performing a boolean operation on the dataset 
that is, I want to filter the dataset for all points with a longitude value 
greater than, say, 50 and less than 60, just to give an example (hopefully that 
all makes sense?).

Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for 
example), and then applying that boolean array to the dataset using .where(), with 
drop=True. This appears to work, but has two issues:

1) It’s slow. On my large datasets, applying where can take several minutes 
(vs. just seconds to use a boolean array to index a similarly sized numpy array)
2) It uses large amounts of memory (which is REALLY a problem when the array is 
already using 100GB+)

What it looks like is that values corresponding to True in the boolean array 
are copied to a new XArray object, thereby potentially doubling memory usage 
until it is complete, at which point the original object can be dropped, 
thereby freeing the memory.

Is there any solution for these issues? Some way to do an in-place filtering?
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory
Geophysical Institute - UAF
2156 Koyukuk Drive
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

--
https://mail.python.org/mailman/listinfo/python-list

--
https://mail.python.org/mailman/listinfo/python-list


Re: Filtering XArray Datasets?

2022-06-06 Thread Dennis Lee Bieber
On Mon, 6 Jun 2022 14:28:41 -0800, Israel Brewster 
declaimed the following:

>I have some large (>100GB) datasets loaded into memory in a two-dimensional (X 
>and Y) NumPy array backed

Unless you have some massive number cruncher machine, with TB RAM, you
are running with a lot of page swap -- and not just cached pages in unused
RAM; actual disk I/O.

Pretty much anything that has to scan the data is going to be slow!

>
>Currently I am doing this by creating a boolean array (data[‘latitude’]>50, 
>for example), and then applying that boolean array to the dataset using 
>.where(), with drop=True. This appears to work, but has two issues:
>

FYI: your first paragraph said "longitude", not "latitude".

>1) It’s slow. On my large datasets, applying where can take several minutes 
>(vs. just seconds to use a boolean array to index a similarly sized numpy 
>array)
>2) It uses large amounts of memory (which is REALLY a problem when the array 
>is already using 100GB+)
>

Personally, given the size of the data, and that it is going to involve
lots of page swapping... I'd try to convert the datasets into some RDBM --
maybe with indices defined for latitude/longitude columns, allowing queries
to scan the index to find matching records, and return those (perhaps for
processing one at a time "for rec in cursor:" rather than doing a
.fetchall().

Some RDBMs even have extensions for spatial data handling.


-- 
Wulfraed Dennis Lee Bieber AF6VN
wlfr...@ix.netcom.comhttp://wlfraed.microdiversity.freeddns.org/
-- 
https://mail.python.org/mailman/listinfo/python-list


Filtering XArray Datasets?

2022-06-06 Thread Israel Brewster
I have some large (>100GB) datasets loaded into memory in a two-dimensional (X 
and Y) NumPy array backed XArray dataset. At one point I want to filter the 
data using a boolean array created by performing a boolean operation on the 
dataset that is, I want to filter the dataset for all points with a longitude 
value greater than, say, 50 and less than 60, just to give an example 
(hopefully that all makes sense?).

Currently I am doing this by creating a boolean array (data[‘latitude’]>50, for 
example), and then applying that boolean array to the dataset using .where(), 
with drop=True. This appears to work, but has two issues:

1) It’s slow. On my large datasets, applying where can take several minutes 
(vs. just seconds to use a boolean array to index a similarly sized numpy array)
2) It uses large amounts of memory (which is REALLY a problem when the array is 
already using 100GB+)

What it looks like is that values corresponding to True in the boolean array 
are copied to a new XArray object, thereby potentially doubling memory usage 
until it is complete, at which point the original object can be dropped, 
thereby freeing the memory.

Is there any solution for these issues? Some way to do an in-place filtering? 
---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145

-- 
https://mail.python.org/mailman/listinfo/python-list