[PR] SDAP-472 - data-access overhaul to support multiple simultaneous data backends [incubator-sdap-nexus]

via GitHub Thu, 01 Feb 2024 15:14:45 -0800


RKuttruff opened a new pull request, #294:
URL: https://github.com/apache/incubator-sdap-nexus/pull/294


   # SDAP-472
   
   Major overhaul of the `data-access` component of SDAP to support multiple 
data store backends simultaneously with one new backend implemented to support 
gridded Zarr data stored either locally or in S3. Datasets are defined in the 
`nexusdatasets` Solr collection. SDAP will poll that collection (currently done 
hourly + on startup + on execution of a dataset management query) and attempt 
to add/open any new datasets and will drop any datasets that no longer are 
present. Datasets can still be defined in the manner they currently are; 
`nexusproto` is the default backend and requires no additional data. 
   
   ## Adding Datasets
   
   There are 2 ways to add new Zarr datasets. The 'hardcoded` approach through 
the Collections Manager, or the dynamic approach through the dataset management 
endpoints.
   
   ### Collection Manager
   
   A zarr collection can be specified in the collection config YAML file as 
follows:
   ```yaml
   collections:
   - id: dataset_name
     path: file:///path/to/zarr/root/
     projection: Grid
     priority: <number>
     dimensionNames:
       latitude: <latitude name>
       longitude: <longitude name>
       time: <time name>
       variable: <data var>
     storeType: zarr
   - id: dataset_s3
     path: s3://bucket/key/
     projection: GridMulti
     priority: <number>
     dimensionNames:
       latitude: <latitude name>
       longitude: <longitude name>
       time: <time name>
       variables:
       - <data var>
       - <data var>
       - <data var>
     storeType: zarr
     config:
       aws:
         accessKeyID: <AWS access key ID>
         secretAccessKey: <AWS secret access key>
         public: false
   ```
   
   These datasets are strictly hardcoded and  can only (currently) be removed 
by manually deleting the associated document from Solr; they cannot be deleted 
or altered through the dataset management endpoints.
   
   There is an [accompanying ingester 
PR](https://github.com/apache/incubator-sdap-ingester/pull/86) to facilitate 
this.
   
   ### Dataset Management Endpoints
   
   Included are a set of endpoints to add, update and remove zarr datasets on 
the fly.
   
   #### Add Dataset
   
   - Path: `/datasets/add`
   - Type: `POST`
   - Params:
     - `name`: Name of the dataset to add
     - `path`: Path of the root of the Zarr group to add
   
   Body:
   Content types: `application/json`, `application/yaml`
   
   ```yaml
   variable: <var>
   coords:
     latitude: <lat name>
     longitude: <lon name>
     time: <time name>
   aws: # required if in S3
     public: false
     accessKeyID: <AWS access key ID>
     secretAccessKey: <AWS secret access key>
     region: <AWS region>
   ```
   
   #### Update Dataset
   
   - Path: `/datasets/update`
   - Type: `POST`
   - Params:
     - `name`: Name of the dataset to update
   
   Body:
   Content types: `application/json`, `application/yaml`
   
   Body is the same format as `/datasets/add`
   
   #### Delete Dataset
   
   - Path: `/datasets/remove`
   - Type: `GET`
   - Params:
     - `name`: Name of the dataset to delete
   
   ## Testing
   
   This PR will require extensive testing to ensure that a) the added Zarr 
backend either fully, or at least mostly, supports all existing (prior 
functioning) SDAP algorithms, and b) that nothing is broken when using the 
`nexusproto` backend (ie, all existing functionality is preserved). Ideally, 
this will require little to no adaptation for the individual algorithm 
implementations; however, it's seeming like a number of them will already 
require small changes that _should_ not have any impact on `nexusproto` 
functionality.
   
   ### `nexusproto` Testing
   
   Currently, the interface for the algorithms and the data backends will route 
data requests to the `nexusproto` backend by default (i.e., if no target 
dataset was given / could be determined). This may not end up being desirable 
and may end up removed depending on discussion for this PR. With that in mind, 
the net result of this defaulting is that this PR _should_ not break any 
existing functionality. As a quick check of this, I ran a test suite for 
endpoints used by the CDMS project and found no endpoints failing or returning 
inconsistent data.
   
   Further tests will be conducted when verifying that queries return the same 
when running against the same dataset in `nexusproto` and zarr.
   
   ### `zarr` Testing
   
   The following table lists the various algorithms/endpoints that have been 
tested with Zarr support. The 'Working' column indicates that the endpoint 
successfully returns with data; the 'Validated' column indicates that the data 
returned is identical to the same query on the same dataset ingested to 
`nexusproto`; the 'Alterations' column lists the various alterations used to 
get the algorithm working (detailed below).
   
   | Endpoint                        | Working | Validated | Alterations |
   |---------------------------------|:---------:|:-----------:|-------------|
   | `/datainbounds`                 |    X    |     X     |             |
   | `/cdmssubset`                   |    X    |     X     | e           |
   | `/timeSeriesSpark`              |    X    |           | c           |
   | `/latitudeTimeHofMoellerSpark`  |    X    |           | b,c         |
   | `/longitudeTimeHofMoellerSpark` |    X    |           | b,c         |
   | `/timeAvgMapSpark`              |    X    |           | a           |
   | `/match_spark`                  |    X    |     X     | c           |
   | `/corrMapSpark`                 |    X    |           | b,d         |
   | `/dailydifferenceaverage_spark` |    X    |           | c           |
   | `/maxMinMapSpark`               |    X    |           |             |
   | `/climMapSpark`                 |    X    |           | b           |
   | `/varianceSpark`                |    X    |           |  b           |
   
   a. Dependent on @kevinmarlis's #259 -- Now merged; no longer a concern
   b. Dependent on similar changes to (a), outlined in #272 
   c. Modifications to some NTS calls (specified source dataset, &c)
   d. Bug fix unrelated to backend changes
   e. Dependent on #268 
   
   <hr>
   
   The hope for this implementation was that it would integrate seamlessly with 
the existing algorithms; however, it appears some algorithms will be required 
to add some kwargs to certain `NexusTileService` functions. In particular, it 
is now imperative that the target dataset name is given via the `dataset` or 
`ds` kwarg (if neither are specified by the function definition (ie, as in 
`find_tile_by_id`), either kwarg can be used)
   
   <hr>
   
   Originally #265 which was closed automatically on merge & delete so I had to 
reopen.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] SDAP-472 - data-access overhaul to support multiple simultaneous data backends [incubator-sdap-nexus]

Reply via email to