Re: Data Catalog Search API

Thejaka Amila J Kanewala Wed, 18 Jan 2023 09:27:33 -0800

Hi Marcus,

Sorry for my lack of knowledge on this.

Just for my understanding, could you please define what a "data product" is
? -- I can see that from the schema diagram it has an id and has a
parent-child relationship, but I would like to understand functionally what
a "data product" is.

In relation to the "Data product", when we do "createMetadataSchema(...,
name = "smilesdb")" -- does this create a "data product" named "smilesdb" ?
Also, I am curious to know what motivated you to keep the metadata as a
json (other than PG's json indexing) ?

Cosmetic:
Will be great if you could convert the design document into a google doc --
it is easy to provide feedback in the google doc format.
When i try to access "
https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio
ns/data/molecule.json", i get a 404 error.

Thanks.
Best Regards,
Thejaka Amila Kanewala, PhD
https://github.com/thejkane/agm
http://valagamba.net/

On Tue, Jan 17, 2023 at 9:42 AM Christie, Marcus Aaron <[email protected]>
wrote:

> Hi All,
>
> I've attached a design document for the search API of the redesigned Data
> Catalog and I'm looking for some feedback on it.
>
> Some context: for the Cybershuttle project, we're creating a redesigned
> data catalog to store metadata about directories and files that may come
> from several sources: user provided, instrument generated, generated as
> output from a computation. Metadata about these data products may also come
> from several sources.
>
> The high-level requirements are:
>
> - support searching and filtering for data products using schemaless
> metadata
> - supports bursts of writes, for example when scanning and registering all
> of the files in a directory
> - basic CRUD operations on data products, including bulk operations
> - capture parent/child relationships (i.e., directory, sub-directory, file
> relationships) and allow querying based on these relationships
> - basic CRUD operations on data product's metadata
>
> Most of the basic CRUD operations are omitted from the design document.
> The design document focuses on the search and querying API.
>
> This redesign builds on Airavata's Replica Catalog and the DRMS Resource
> Service [1] in airavata-data-lake.
>
> The main difference in this design is that it uses the built-in JSON
> querying and indexing capabilities of PostgreSQL. The goal is to support
> whatever metadata is available as long as it is in JSON format and make it
> efficiently searchable and filterable. Also, to make the API more developer
> friendly, the API supports querying via SQL (which will be transformed to
> the actual backend query using Apache Calcite).
>
> Your feedback is most welcome.
>
> Sincerely,
>
> Marcus
>
>
>
> [1]
> https://github.com/apache/airavata-data-lake/blob/master/data-resource-management-service/drms-stubs/src/main/proto/resource/DRMSResourceService.proto
>
>

Re: Data Catalog Search API

Reply via email to