Hi,

Quick update on the dataset indexing work. I pushed a second commit that
adds the mdCATH connector and restructures the project to be generic rather
than MD-specific, as we discussed in previous meetings.
What's new:
* mdCATH adapter that reads from HDF5 (or pre-extracted JSON), parses the
CATH domain hierarchy, and aggregates per-temperature/replica stats into
the same POSIX tree format as ATLAS
* Refactored the lib layer so it doesn't know about specific datasets.
Adapters control what fields go where through callbacks instead of
hardcoded parameters like sequence_field
* Added a .meta/ directory convention (source.json for provenance,
stats.json for index metadata, schema.json for field definitions)
* Base crawler class with static and incremental strategies, ready for when
we add datasets that update over time
* CI pipeline with pytest, ruff, and schema validation
* Go FUSE mount stub under mount/. Not functional yet but it sets up the
module and documents how the .search/ virtual directory will work
The goal is that any dataset, not just MD ones, can go through the same
pipeline and come out as a standard POSIX tree. That way when we wire up
the FUSE mount and CyberShuttle, everything just works the same.
68 tests passing, linter is clean. Repo:
https://github.com/jayvenn21/gsoc-dataset-indexing
Let me know if the direction looks good or if you'd want anything changed
before I move on to GPCRmd/MemProtMD or the FUSE side.

Thanks,
Jayanth

Reply via email to