[ https://issues.apache.org/jira/browse/AIRAVATA-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380830#comment-14380830 ]
Suresh Marru commented on AIRAVATA-1646: ---------------------------------------- Hi Doug, Please see the responses embedded below: Do we have access to the apache thrift data model currently in use by Airavata? If so, can we modify this model? -- I consider this project as a exploratory, so yes we could branch the master and have you modify the thrift data models. You can look at them here - https://github.com/apache/airavata/tree/master/airavata-api/thrift-interface-descriptions What other object store technologies are you interested in (Cassandra and MongoDB)? --It will be premature to state a preference. The key thing here is to understand the problem well enough and make a recommendation if relational databases are good, or if key-value or column, document or graph databases can better address Airavata metadata needs. How will the metadata be used? Depending on metadata usage it can affect which technologies and which features of that specific technology we should enable. --This is a very open ended question. I will hope you can propose a project keeping in mind you will need to explore this answer in interactions with airavata community. What are some examples of meta data is being stored? Is the data structured or unstructured? --Currently all the metadata is very structured. An example would be to look the experiment model. User requests an experiment which will get executed on remote resources, in the process transforms data. The metadata capture also included states of simulation or data analysis tasks. Once you run sample experiments, this will be more clearer. What kind of provenance data is being stored? --Currently very minimal to none. Basic information like user provided metadata, resources used to compute, job dimensions. A big missing piece is to collate provenance of input data and augment provenance of generated data with application details and simulation/analysis configurations. What kind of queries would you expect to be run on the provenance data? --this will be very subjective to the data domain. An example could be, query for all radar assimilation data which have a quality score of 5. We could find more concrete pointers. Do we need look into Apache Storm for querying streaming data? -- Not right away, but I could foresee some usage. For instance, if we have to process metadata extraction from all the archived data, I could see storm helping to run such a topology. We could also employ a storm cluster to shred deep data from all input requests. Again, we need to adapt with the usecases a bit here. Will we receive accounts on NSF XSEDE clusters for this project? --Yes we could get you access to various clusters including XSEDE if absolutely needed by the project. > [GSoC] Brainstorm Airavata Data Management Needs > ------------------------------------------------ > > Key: AIRAVATA-1646 > URL: https://issues.apache.org/jira/browse/AIRAVATA-1646 > Project: Airavata > Issue Type: Brainstorming > Reporter: Suresh Marru > Labels: gsoc, gsoc2015,, mentor > > Currently Airavata focuses on Execution Management and the Registry > Sub-System (with app, resource and experiment catalogs) capture metadata > about applications and executions. There were few efforts (primarily from > student projects) to explore this void. It will be good to concretely propose > data management solutions to for input data registration, input and generated > retrieval, data transfers and replication management. > Metadata Catalog: In addition current metadata management is based on > shredding thrift data models into mysql/derby schema. This is described in > [1]. We have discussed extensively on using Object Store data bases with a > conclusion of understanding the requirements more systematically. A good > stand alone task would be to understand current metadata management and > propose alternative solutions with proof of concept implementations. Once the > community is convinced, we can then plan on implementing them into > production. > Provenance: Airavata could be enhanced to capture provenance to organize the > data for reuse, discovery, comparison and sharing. This is a well explored > field. There might be good compelling third party solutions. Especially it > will be good to explore in the bigdata space and identify leverages (either > concepts, or even better implementations). > Auditing and Traceability: As Airavata mediates executions on behalf of > gateways, it has to strike a balance between abstracting the compute resource > interactions at the same time providing transparent execution trace. This > will bloat the amount of data to be catalogued. A good effort will be to > understand the current extent of airavata audits and provide suggestions. > BigData Leverage: Airavata needs to leverage the influx of tools in this > space. Any suggestions on relevant tools which will enhance Airavata > experience will be a good fit. > [1] - > https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Data+Models+0.12 > [2] - http://markmail.org/thread/4lguliiktjohjmsd -- This message was sent by Atlassian JIRA (v6.3.4#6332)