Hi Marcus, Sorry for my lack of knowledge on this.
Just for my understanding, could you please define what a "data product" is ? -- I can see that from the schema diagram it has an id and has a parent-child relationship, but I would like to understand functionally what a "data product" is. In relation to the "Data product", when we do "createMetadataSchema(..., name = "smilesdb")" -- does this create a "data product" named "smilesdb" ? Also, I am curious to know what motivated you to keep the metadata as a json (other than PG's json indexing) ? Cosmetic: Will be great if you could convert the design document into a google doc -- it is easy to provide feedback in the google doc format. When i try to access " https://raw.githubusercontent.com/apache/airavata-sandbox/master/gsoc2022/smilesdb/Migratio ns/data/molecule.json", i get a 404 error. Thanks. Best Regards, Thejaka Amila Kanewala, PhD https://github.com/thejkane/agm http://valagamba.net/ On Tue, Jan 17, 2023 at 9:42 AM Christie, Marcus Aaron <machr...@iu.edu> wrote: > Hi All, > > I've attached a design document for the search API of the redesigned Data > Catalog and I'm looking for some feedback on it. > > Some context: for the Cybershuttle project, we're creating a redesigned > data catalog to store metadata about directories and files that may come > from several sources: user provided, instrument generated, generated as > output from a computation. Metadata about these data products may also come > from several sources. > > The high-level requirements are: > > - support searching and filtering for data products using schemaless > metadata > - supports bursts of writes, for example when scanning and registering all > of the files in a directory > - basic CRUD operations on data products, including bulk operations > - capture parent/child relationships (i.e., directory, sub-directory, file > relationships) and allow querying based on these relationships > - basic CRUD operations on data product's metadata > > Most of the basic CRUD operations are omitted from the design document. > The design document focuses on the search and querying API. > > This redesign builds on Airavata's Replica Catalog and the DRMS Resource > Service [1] in airavata-data-lake. > > The main difference in this design is that it uses the built-in JSON > querying and indexing capabilities of PostgreSQL. The goal is to support > whatever metadata is available as long as it is in JSON format and make it > efficiently searchable and filterable. Also, to make the API more developer > friendly, the API supports querying via SQL (which will be transformed to > the actual backend query using Apache Calcite). > > Your feedback is most welcome. > > Sincerely, > > Marcus > > > > [1] > https://github.com/apache/airavata-data-lake/blob/master/data-resource-management-service/drms-stubs/src/main/proto/resource/DRMSResourceService.proto > >