Re: Data Catalog Search API

Christie, Marcus Aaron Thu, 19 Jan 2023 09:41:55 -0800

Hi Amila,

Thanks for taking a look. No apologies necessary, I think I didn't cover enough 
background.

The term "data product" is borrowed from Airavata's replica catalog. [1]  In 
Airavata, a data product is typically a user provided input file or an 
application generated output file. There are two types of data products, "FILE" 
and "COLLECTION", and data products can contain other data products, roughly 
mapping to a POSIX directory structure, but not limited to such. For 
Cybershuttle, we also need input and output file registration, but there are 
other sources of data that need to be captured, such as instrument generated 
data.

A "data product" is distinct from its "replica locations", again borrowed from 
Airavata's replica catalog.  The data catalog only knows about the metadata of 
a dataset, not where and how it could be accessed.

> In relation to the "Data product", when we do "createMetadataSchema(..., name 
> = "smilesdb")" -- does this create a "data product" named "smilesdb" ?

No, "createDataProduct(DataProduct dataProduct)" is used to create a data 
product.  createMetadataSchema() creates a named reference to a schema for 
metadata. So in this example, "smilesdb" is a named metadata schema. A named 
metadata schema doesn't mean much until fields are added to it. Once fields are 
added to it, a data product can then be added to the metadata schema via 
addDataProductToMetadataSchema(). This indicates to the data catalog that a 
subset of this data product's metadata adheres to the "smilesdb" metadata 
schema. That is, for every field that is defined for the "smilesdb" metadata 
schema, this data product should have that field in its metadata.

An example may be helpful. Let's say we create a metadata schema named "zenodo" 
[2] and we add some fields to it

- one called "zenodo_doi" with a JSON path of "$.zenodo.doi".
- one called "zenodo_published_date" with a JSON path of 
"$.zenodo.published-date".
- etc.

Then let's say we have some data products. Some of the data products have a 
zenodo field in their JSON metadata:

{
  ...,
  "zenodo", {
    "doi": "...",
    "published-date": "..."
  },
  ...
}

For such data products, we can call addDataProductToMetadataSchema() to add 
them to the "zenodo" metadata schema. Now, for the data products added to the 
"zenodo" metadata schema, the data catalog can support queries on the 
"zenodo_doi" and "zenodo_published_date" fields. We can also, for example, add 
an index on "zenodo_published_date" to support range queries.

A data product's metadata might adhere to more than one metadata schema. Or 
none at all.

> Also, I am curious to know what motivated you to keep the metadata as a json 
> (other than PG's json indexing) ?

Different scientific domains have their own domain specific metadata that they 
want to be able to store and query on. So we need a schemaless mechanism for 
storing such metadata.  For a relational database there are a couple of 
approach to storing such metadata. One would be with a table for holding key 
value pairs and another approach is to store a JSON document with the values. 
The JSON approach has the advantage that the metadata need not be flat but can 
be hierarchical. Also we want to support search over the data catalog's 
metadata and there are many good JSON document-oriented search solutions, such 
as MongoDB, Solr, Elasticsearch and of course PostgreSQL's JSON support also 
makes it possible to efficiently search over a JSON column there.

> Will be great if you could convert the design document into a google doc -- 
> it is easy to provide feedback in the google doc format.

That's funny because the attached PDF is an export from the google doc where I 
wrote the design document, so yeah it will be very easy to provide a google 
doc. I'll provide a link in a followup email. I only went with the approach of 
exporting it to PDF and discussing on the mailing list because that seems more 
in line with the Apache way of discussing things in the open on public mailing 
lists. If no one has a problem with it, we can move the feedback to the google 
doc.

> When i try to access 
> "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio
> ns/data/molecule.json", i get a 404 error.

Sorry about that. Here's the URL: 
https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migrations/data/molecule.json

Thanks again for taking time to critique this design, Amila. I appreciate your 
feedback.

Thanks,

Marcus

[1] 
https://github.com/apache/airavata/blob/master/thrift-interface-descriptions/data-models/replica-catalog-models/replica_catalog_models.thrift#L58
[2] https://about.zenodo.org/

> On Jan 18, 2023, at 12:26 PM, Thejaka Amila J Kanewala 
> <[email protected]> wrote:
> 
> You don't often get email from [email protected]. Learn why this is 
> important
> Hi Marcus,
> 
> Sorry for my lack of knowledge on this.
> 
> Just for my understanding, could you please define what a "data product" is ? 
> -- I can see that from the schema diagram it has an id and has a parent-child 
> relationship, but I would like to understand functionally what a "data 
> product" is.
> 
> In relation to the "Data product", when we do "createMetadataSchema(..., name 
> = "smilesdb")" -- does this create a "data product" named "smilesdb" ?
> Also, I am curious to know what motivated you to keep the metadata as a json 
> (other than PG's json indexing) ?
> 
> Cosmetic:
> Will be great if you could convert the design document into a google doc -- 
> it is easy to provide feedback in the google doc format.
> When i try to access 
> "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio
> ns/data/molecule.json", i get a 404 error.
> 
> Thanks.
> Best Regards,
> Thejaka Amila Kanewala, PhD
> https://github.com/thejkane/agm
> http://valagamba.net/
> 
> 
> On Tue, Jan 17, 2023 at 9:42 AM Christie, Marcus Aaron <[email protected]> 
> wrote:
> Hi All,
> 
> I've attached a design document for the search API of the redesigned Data 
> Catalog and I'm looking for some feedback on it.
> 
> Some context: for the Cybershuttle project, we're creating a redesigned data 
> catalog to store metadata about directories and files that may come from 
> several sources: user provided, instrument generated, generated as output 
> from a computation. Metadata about these data products may also come from 
> several sources.
> 
> The high-level requirements are:
> 
> - support searching and filtering for data products using schemaless metadata
> - supports bursts of writes, for example when scanning and registering all of 
> the files in a directory
> - basic CRUD operations on data products, including bulk operations
> - capture parent/child relationships (i.e., directory, sub-directory, file 
> relationships) and allow querying based on these relationships
> - basic CRUD operations on data product's metadata
> 
> Most of the basic CRUD operations are omitted from the design document. The 
> design document focuses on the search and querying API.
> 
> This redesign builds on Airavata's Replica Catalog and the DRMS Resource 
> Service [1] in airavata-data-lake.
> 
> The main difference in this design is that it uses the built-in JSON querying 
> and indexing capabilities of PostgreSQL. The goal is to support whatever 
> metadata is available as long as it is in JSON format and make it efficiently 
> searchable and filterable. Also, to make the API more developer friendly, the 
> API supports querying via SQL (which will be transformed to the actual 
> backend query using Apache Calcite).
> 
> Your feedback is most welcome.
> 
> Sincerely,
> 
> Marcus
> 
> 
> 
> [1] 
> https://github.com/apache/airavata-data-lake/blob/master/data-resource-management-service/drms-stubs/src/main/proto/resource/DRMSResourceService.proto
>

smime.p7s
Description: S/MIME cryptographic signature

Re: Data Catalog Search API

Reply via email to