[jira] [Updated] (SDAP-472) General Zarr support for gridded datasets

Riley Kuttruff (Jira) Tue, 27 Jun 2023 12:57:06 -0700


     [ 
https://issues.apache.org/jira/browse/SDAP-472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Riley Kuttruff updated SDAP-472:
--------------------------------
    Description: 
End goal would be SDAP being able to onboard existing Zarr datasets with 
minimal to no interaction with the data (ie, no scanning the data for metadata 
generation). Gridded formats allow for this, with only the need to record some 
(additional) dataset-level metadata. Swath data will require a different and 
much more labor-intensive approach, so we should just focus on gridded data as 
it will likely be more commonly used by our users. 

 

Collections should be able to be specified in the collection config yaml. 
Currently we should implement zarr available in an S3 bucket and the local 
filesystem; however, we should leave the door open for other storage options 
(explicitly set in CC or determined by URL) - essentially zarr plugins we can 
add in the future: 

 
{code:yaml}
collections:   
  - id: zarr_example_ds_s3   # Zarr array in S3; need to give creds
    store-type: zarr
    path: s3://sdap-zarr-bucket/zarr_example_ds
    priority: 5
    forward-processing-priority: 5
    projection: Grid
    dimensionNames:       
      latitude: lat
      longitude: lon
      time: time
      variable: analysed_sst
    slices:       
      lat: 100
      lon: 100
      time: 1
    aws:       
      accessKeyID: <id>
      secretAccessKey: <id>
      public: false
  - id: zarr_example_ds_local # Zarr array in local fs
    store-type: zarr 
    path: file:///data/zarr_example_ds_local
    priority: 5
    forward-processing-priority: 5
    projection: Grid
    dimensionNames: 
      latitude: lat
      longitude: lon
      time: time
      variable: analysed_sst
    slices:  
     lat: 100
      lon: 100
      time: 1
  - id: AVHRR_OI_L4_GHRSST_NCEI # Standard ingest to tiles in Cassandra
    store-type: nexusproto 
    path: /data/granules/*.nc
    priority: 10
    forward-processing-priority: 10
    projection: Grid
    dimensionNames: 
      latitude: lat
      longitude: lon
      time: time
      variable: analysed_sst
    slices:      
      lat: 100
      lon: 100
      time: 1{code}
 

  was:
End goal would be SDAP being able to onboard existing Zarr datasets with 
minimal to no interaction with the data (ie, no scanning the data for metadata 
generation). Gridded formats allow for this, with only the need to record some 
(additional) dataset-level metadata. Swath data will require a different and 
much more labor-intensive approach, so we should just focus on gridded data as 
it will likely be more commonly used by our users. 

 

Collections should be able to be specified in the collection config yaml. 
Currently we should implement zarr available in an S3 bucket and the local 
filesystem; however, we should leave the door open for other storage options 
(explicitly set in CC or determined by URL) - essentially zarr plugins we can 
add in the future: 

 
{code:java}
collections:
  - id: zarr_example_ds_s3   # Zarr array in S3; need to give creds
    store-type: zarr
    path: s3://sdap-zarr-bucket/zarr_example_ds
    priority: 5
    forward-processing-priority: 5
    projection: Grid
    dimensionNames:
      latitude: lat
      longitude: lon
      time: time
      variable: analysed_sst
    slices:
      lat: 100
      lon: 100
      time: 1
    aws:
      accessKeyID: <id>
      secretAccessKey: <id>
      public: falsecollections:
 - id: zarr_example_ds_local # Zarr array in local fs
   store-type: zarr 
   path: file:///data/zarr_example_ds_local
   priority: 5
   forward-processing-priority: 5
   projection: Grid
   dimensionNames:
     latitude: lat
     longitude: lon
     time: time
     variable: analysed_sst
   slices:
     lat: 100
     lon: 100
     time: 1
  - id: AVHRR_OI_L4_GHRSST_NCEI # Standard ingest to tiles in Cassandra
    store-type: nexusproto 
    path: /data/granules/*.nc
    priority: 10
    forward-processing-priority: 10
    projection: Grid
    dimensionNames:
      latitude: lat
      longitude: lon
      time: time
      variable: analysed_sst
    slices:
      lat: 100
      lon: 100
      time: 1{code}
 


> General Zarr support for gridded datasets
> -----------------------------------------
>
>                 Key: SDAP-472
>                 URL: https://issues.apache.org/jira/browse/SDAP-472
>             Project: Apache Science Data Analytics Platform
>          Issue Type: New Feature
>          Components: analysis, collection-ingester
>            Reporter: Riley Kuttruff
>            Assignee: Riley Kuttruff
>            Priority: Major
>
> End goal would be SDAP being able to onboard existing Zarr datasets with 
> minimal to no interaction with the data (ie, no scanning the data for 
> metadata generation). Gridded formats allow for this, with only the need to 
> record some (additional) dataset-level metadata. Swath data will require a 
> different and much more labor-intensive approach, so we should just focus on 
> gridded data as it will likely be more commonly used by our users. 
>  
> Collections should be able to be specified in the collection config yaml. 
> Currently we should implement zarr available in an S3 bucket and the local 
> filesystem; however, we should leave the door open for other storage options 
> (explicitly set in CC or determined by URL) - essentially zarr plugins we can 
> add in the future: 
>  
> {code:yaml}
> collections:   
>   - id: zarr_example_ds_s3   # Zarr array in S3; need to give creds
>     store-type: zarr
>     path: s3://sdap-zarr-bucket/zarr_example_ds
>     priority: 5
>     forward-processing-priority: 5
>     projection: Grid
>     dimensionNames:       
>       latitude: lat
>       longitude: lon
>       time: time
>       variable: analysed_sst
>     slices:       
>       lat: 100
>       lon: 100
>       time: 1
>     aws:       
>       accessKeyID: <id>
>       secretAccessKey: <id>
>       public: false
>   - id: zarr_example_ds_local # Zarr array in local fs
>     store-type: zarr 
>     path: file:///data/zarr_example_ds_local
>     priority: 5
>     forward-processing-priority: 5
>     projection: Grid
>     dimensionNames: 
>       latitude: lat
>       longitude: lon
>       time: time
>       variable: analysed_sst
>     slices:  
>      lat: 100
>       lon: 100
>       time: 1
>   - id: AVHRR_OI_L4_GHRSST_NCEI # Standard ingest to tiles in Cassandra
>     store-type: nexusproto 
>     path: /data/granules/*.nc
>     priority: 10
>     forward-processing-priority: 10
>     projection: Grid
>     dimensionNames: 
>       latitude: lat
>       longitude: lon
>       time: time
>       variable: analysed_sst
>     slices:      
>       lat: 100
>       lon: 100
>       time: 1{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (SDAP-472) General Zarr support for gridded datasets

Reply via email to