Hi Amila, Thanks for taking a look. No apologies necessary, I think I didn't cover enough background.
The term "data product" is borrowed from Airavata's replica catalog. [1] In
Airavata, a data product is typically a user provided input file or an
application generated output file. There are two types of data products, "FILE"
and "COLLECTION", and data products can contain other data products, roughly
mapping to a POSIX directory structure, but not limited to such. For
Cybershuttle, we also need input and output file registration, but there are
other sources of data that need to be captured, such as instrument generated
data.
A "data product" is distinct from its "replica locations", again borrowed from
Airavata's replica catalog. The data catalog only knows about the metadata of
a dataset, not where and how it could be accessed.
> In relation to the "Data product", when we do "createMetadataSchema(..., name
> = "smilesdb")" -- does this create a "data product" named "smilesdb" ?
No, "createDataProduct(DataProduct dataProduct)" is used to create a data
product. createMetadataSchema() creates a named reference to a schema for
metadata. So in this example, "smilesdb" is a named metadata schema. A named
metadata schema doesn't mean much until fields are added to it. Once fields are
added to it, a data product can then be added to the metadata schema via
addDataProductToMetadataSchema(). This indicates to the data catalog that a
subset of this data product's metadata adheres to the "smilesdb" metadata
schema. That is, for every field that is defined for the "smilesdb" metadata
schema, this data product should have that field in its metadata.
An example may be helpful. Let's say we create a metadata schema named "zenodo"
[2] and we add some fields to it
- one called "zenodo_doi" with a JSON path of "$.zenodo.doi".
- one called "zenodo_published_date" with a JSON path of
"$.zenodo.published-date".
- etc.
Then let's say we have some data products. Some of the data products have a
zenodo field in their JSON metadata:
{
...,
"zenodo", {
"doi": "...",
"published-date": "..."
},
...
}
For such data products, we can call addDataProductToMetadataSchema() to add
them to the "zenodo" metadata schema. Now, for the data products added to the
"zenodo" metadata schema, the data catalog can support queries on the
"zenodo_doi" and "zenodo_published_date" fields. We can also, for example, add
an index on "zenodo_published_date" to support range queries.
A data product's metadata might adhere to more than one metadata schema. Or
none at all.
> Also, I am curious to know what motivated you to keep the metadata as a json
> (other than PG's json indexing) ?
Different scientific domains have their own domain specific metadata that they
want to be able to store and query on. So we need a schemaless mechanism for
storing such metadata. For a relational database there are a couple of
approach to storing such metadata. One would be with a table for holding key
value pairs and another approach is to store a JSON document with the values.
The JSON approach has the advantage that the metadata need not be flat but can
be hierarchical. Also we want to support search over the data catalog's
metadata and there are many good JSON document-oriented search solutions, such
as MongoDB, Solr, Elasticsearch and of course PostgreSQL's JSON support also
makes it possible to efficiently search over a JSON column there.
> Will be great if you could convert the design document into a google doc --
> it is easy to provide feedback in the google doc format.
That's funny because the attached PDF is an export from the google doc where I
wrote the design document, so yeah it will be very easy to provide a google
doc. I'll provide a link in a followup email. I only went with the approach of
exporting it to PDF and discussing on the mailing list because that seems more
in line with the Apache way of discussing things in the open on public mailing
lists. If no one has a problem with it, we can move the feedback to the google
doc.
> When i try to access
> "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio
> ns/data/molecule.json", i get a 404 error.
Sorry about that. Here's the URL:
https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migrations/data/molecule.json
Thanks again for taking time to critique this design, Amila. I appreciate your
feedback.
Thanks,
Marcus
[1]
https://github.com/apache/airavata/blob/master/thrift-interface-descriptions/data-models/replica-catalog-models/replica_catalog_models.thrift#L58
[2] https://about.zenodo.org/
> On Jan 18, 2023, at 12:26 PM, Thejaka Amila J Kanewala
> <[email protected]> wrote:
>
> You don't often get email from [email protected]. Learn why this is
> important
> Hi Marcus,
>
> Sorry for my lack of knowledge on this.
>
> Just for my understanding, could you please define what a "data product" is ?
> -- I can see that from the schema diagram it has an id and has a parent-child
> relationship, but I would like to understand functionally what a "data
> product" is.
>
> In relation to the "Data product", when we do "createMetadataSchema(..., name
> = "smilesdb")" -- does this create a "data product" named "smilesdb" ?
> Also, I am curious to know what motivated you to keep the metadata as a json
> (other than PG's json indexing) ?
>
> Cosmetic:
> Will be great if you could convert the design document into a google doc --
> it is easy to provide feedback in the google doc format.
> When i try to access
> "https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio
> ns/data/molecule.json", i get a 404 error.
>
> Thanks.
> Best Regards,
> Thejaka Amila Kanewala, PhD
> https://github.com/thejkane/agm
> http://valagamba.net/
>
>
> On Tue, Jan 17, 2023 at 9:42 AM Christie, Marcus Aaron <[email protected]>
> wrote:
> Hi All,
>
> I've attached a design document for the search API of the redesigned Data
> Catalog and I'm looking for some feedback on it.
>
> Some context: for the Cybershuttle project, we're creating a redesigned data
> catalog to store metadata about directories and files that may come from
> several sources: user provided, instrument generated, generated as output
> from a computation. Metadata about these data products may also come from
> several sources.
>
> The high-level requirements are:
>
> - support searching and filtering for data products using schemaless metadata
> - supports bursts of writes, for example when scanning and registering all of
> the files in a directory
> - basic CRUD operations on data products, including bulk operations
> - capture parent/child relationships (i.e., directory, sub-directory, file
> relationships) and allow querying based on these relationships
> - basic CRUD operations on data product's metadata
>
> Most of the basic CRUD operations are omitted from the design document. The
> design document focuses on the search and querying API.
>
> This redesign builds on Airavata's Replica Catalog and the DRMS Resource
> Service [1] in airavata-data-lake.
>
> The main difference in this design is that it uses the built-in JSON querying
> and indexing capabilities of PostgreSQL. The goal is to support whatever
> metadata is available as long as it is in JSON format and make it efficiently
> searchable and filterable. Also, to make the API more developer friendly, the
> API supports querying via SQL (which will be transformed to the actual
> backend query using Apache Calcite).
>
> Your feedback is most welcome.
>
> Sincerely,
>
> Marcus
>
>
>
> [1]
> https://github.com/apache/airavata-data-lake/blob/master/data-resource-management-service/drms-stubs/src/main/proto/resource/DRMSResourceService.proto
>
smime.p7s
Description: S/MIME cryptographic signature
