RKuttruff opened a new pull request, #61: URL: https://github.com/apache/incubator-sdap-ingester/pull/61
This PR has several improvements to the granule ingester 1. Made the writes to data and metadata stores more fault tolerant. In the existing implementation, we write to the underlying stores very frequently. This seems to overburden them, and eventually a write may fail. This was treated as an unrecoverable loss of connection and the GI fails with the granule needing to be restarted from the beginning. This PR has the writes retry a few times on failure with some backoff time. 2. Consolidated writes. For small tile sizes (and thus a large number of tiles/granule), there seems to be a lot of time spent writing the data to the underlying stores. This PR consolidates them into a large push after all the tiles have been generated, significantly reducing the number of network I/O operations. 3. Optimized data subset conversion to `np.ndarray`. Initially done through a call to `np.ma.filled` with a `list` of `xr.DataArrays`. This is unnecessary because a) `np.ma.filled` is supposed to be used with masked arrays, and with `list` objects, just calls `np.array`; and b) there seems to be a lot on inefficiency with calling `np.array` with `xr.DataArrays`, especially when the `xr.DataArray` objects are backed by `np.ndarray` objects that can be easily fetched from them. This PR does just that. With these changes listed above, a benchmark ingest of 1 VIIRS granule into 100x100 tiles went from 4580 seconds to just 163 seconds. Further testing + benchmarking coming soon. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@sdap.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org