RKuttruff opened a new pull request, #61:
URL: https://github.com/apache/incubator-sdap-ingester/pull/61

   This PR has several improvements to the granule ingester
   
   1. Made the writes to data and metadata stores more fault tolerant. In the 
existing implementation, we write to the underlying stores very frequently. 
This seems to overburden them, and eventually a write may fail. This was 
treated as an unrecoverable loss of connection and the GI fails with the 
granule needing to be restarted from the beginning. This PR has the writes 
retry a few times on failure with some backoff time.
   2. Consolidated writes. For small tile sizes (and thus a large number of 
tiles/granule), there seems to be a lot of time spent writing the data to the 
underlying stores. This PR consolidates them into a large push after all the 
tiles have been generated, significantly reducing the number of network I/O 
operations.
   3. Optimized data subset conversion to `np.ndarray`. Initially done through 
a call to `np.ma.filled` with a `list` of `xr.DataArrays`. This is unnecessary 
because a) `np.ma.filled` is supposed to be used with masked arrays, and with 
`list` objects, just calls `np.array`; and b) there seems to be a lot on 
inefficiency with calling `np.array` with `xr.DataArrays`, especially when the 
`xr.DataArray` objects are backed by `np.ndarray` objects that can be easily 
fetched from them. This PR does just that.
   
   With these changes listed above, a benchmark ingest of 1 VIIRS granule into 
100x100 tiles went from 4580 seconds to just 163 seconds.
   
   Further testing + benchmarking coming soon.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sdap.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to