RKuttruff opened a new pull request, #294:
URL: https://github.com/apache/incubator-sdap-nexus/pull/294
# SDAP-472
Major overhaul of the `data-access` component of SDAP to support multiple
data store backends simultaneously with one new backend implemented to support
gridded Zarr data stored either locally or in S3. Datasets are defined in the
`nexusdatasets` Solr collection. SDAP will poll that collection (currently done
hourly + on startup + on execution of a dataset management query) and attempt
to add/open any new datasets and will drop any datasets that no longer are
present. Datasets can still be defined in the manner they currently are;
`nexusproto` is the default backend and requires no additional data.
## Adding Datasets
There are 2 ways to add new Zarr datasets. The 'hardcoded` approach through
the Collections Manager, or the dynamic approach through the dataset management
endpoints.
### Collection Manager
A zarr collection can be specified in the collection config YAML file as
follows:
```yaml
collections:
- id: dataset_name
path: file:///path/to/zarr/root/
projection: Grid
priority: <number>
dimensionNames:
latitude: <latitude name>
longitude: <longitude name>
time: <time name>
variable: <data var>
storeType: zarr
- id: dataset_s3
path: s3://bucket/key/
projection: GridMulti
priority: <number>
dimensionNames:
latitude: <latitude name>
longitude: <longitude name>
time: <time name>
variables:
- <data var>
- <data var>
- <data var>
storeType: zarr
config:
aws:
accessKeyID: <AWS access key ID>
secretAccessKey: <AWS secret access key>
public: false
```
These datasets are strictly hardcoded and can only (currently) be removed
by manually deleting the associated document from Solr; they cannot be deleted
or altered through the dataset management endpoints.
There is an [accompanying ingester
PR](https://github.com/apache/incubator-sdap-ingester/pull/86) to facilitate
this.
### Dataset Management Endpoints
Included are a set of endpoints to add, update and remove zarr datasets on
the fly.
#### Add Dataset
- Path: `/datasets/add`
- Type: `POST`
- Params:
- `name`: Name of the dataset to add
- `path`: Path of the root of the Zarr group to add
Body:
Content types: `application/json`, `application/yaml`
```yaml
variable: <var>
coords:
latitude: <lat name>
longitude: <lon name>
time: <time name>
aws: # required if in S3
public: false
accessKeyID: <AWS access key ID>
secretAccessKey: <AWS secret access key>
region: <AWS region>
```
#### Update Dataset
- Path: `/datasets/update`
- Type: `POST`
- Params:
- `name`: Name of the dataset to update
Body:
Content types: `application/json`, `application/yaml`
Body is the same format as `/datasets/add`
#### Delete Dataset
- Path: `/datasets/remove`
- Type: `GET`
- Params:
- `name`: Name of the dataset to delete
## Testing
This PR will require extensive testing to ensure that a) the added Zarr
backend either fully, or at least mostly, supports all existing (prior
functioning) SDAP algorithms, and b) that nothing is broken when using the
`nexusproto` backend (ie, all existing functionality is preserved). Ideally,
this will require little to no adaptation for the individual algorithm
implementations; however, it's seeming like a number of them will already
require small changes that _should_ not have any impact on `nexusproto`
functionality.
### `nexusproto` Testing
Currently, the interface for the algorithms and the data backends will route
data requests to the `nexusproto` backend by default (i.e., if no target
dataset was given / could be determined). This may not end up being desirable
and may end up removed depending on discussion for this PR. With that in mind,
the net result of this defaulting is that this PR _should_ not break any
existing functionality. As a quick check of this, I ran a test suite for
endpoints used by the CDMS project and found no endpoints failing or returning
inconsistent data.
Further tests will be conducted when verifying that queries return the same
when running against the same dataset in `nexusproto` and zarr.
### `zarr` Testing
The following table lists the various algorithms/endpoints that have been
tested with Zarr support. The 'Working' column indicates that the endpoint
successfully returns with data; the 'Validated' column indicates that the data
returned is identical to the same query on the same dataset ingested to
`nexusproto`; the 'Alterations' column lists the various alterations used to
get the algorithm working (detailed below).
| Endpoint | Working | Validated | Alterations |
|---------------------------------|:---------:|:-----------:|-------------|
| `/datainbounds` | X | X | |
| `/cdmssubset` | X | X | e |
| `/timeSeriesSpark` | X | | c |
| `/latitudeTimeHofMoellerSpark` | X | | b,c |
| `/longitudeTimeHofMoellerSpark` | X | | b,c |
| `/timeAvgMapSpark` | X | | a |
| `/match_spark` | X | X | c |
| `/corrMapSpark` | X | | b,d |
| `/dailydifferenceaverage_spark` | X | | c |
| `/maxMinMapSpark` | X | | |
| `/climMapSpark` | X | | b |
| `/varianceSpark` | X | | b |
a. Dependent on @kevinmarlis's #259 -- Now merged; no longer a concern
b. Dependent on similar changes to (a), outlined in #272
c. Modifications to some NTS calls (specified source dataset, &c)
d. Bug fix unrelated to backend changes
e. Dependent on #268
<hr>
The hope for this implementation was that it would integrate seamlessly with
the existing algorithms; however, it appears some algorithms will be required
to add some kwargs to certain `NexusTileService` functions. In particular, it
is now imperative that the target dataset name is given via the `dataset` or
`ds` kwarg (if neither are specified by the function definition (ie, as in
`find_tile_by_id`), either kwarg can be used)
<hr>
Originally #265 which was closed automatically on merge & delete so I had to
reopen.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]