[jira] [Commented] (AIRAVATA-1646) [GSoC] Brainstorm Airavata Data Management Needs

Suresh Marru (JIRA) Wed, 25 Mar 2015 14:38:24 -0700

    [ 
https://issues.apache.org/jira/browse/AIRAVATA-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380830#comment-14380830
 ]


Suresh Marru commented on AIRAVATA-1646:
----------------------------------------

Hi Doug, Please see the responses embedded below:

Do we have access to the apache thrift data model currently in use by Airavata? 
If so, can we modify this model?
-- I consider this project as a exploratory, so yes we could branch the master 
and have you modify the thrift data models. You can look at them here - 
https://github.com/apache/airavata/tree/master/airavata-api/thrift-interface-descriptions
What other object store technologies are you interested in (Cassandra and 
MongoDB)?
--It will be premature to state a preference. The key thing here is to 
understand the problem well enough and make a recommendation if relational 
databases are good, or if key-value or column, document or graph databases can 
better address Airavata metadata needs.  
How will the metadata be used? Depending on metadata usage it can affect which 
technologies and which features of that specific technology we should enable.
--This is a very open ended question. I will hope you can propose a project 
keeping in mind you will need to explore this answer in interactions with 
airavata community.
What are some examples of meta data is being stored? Is the data structured or 
unstructured?
--Currently all the metadata is very structured. An example would be to look 
the experiment model. User requests an experiment which will get executed on 
remote resources, in the process transforms data. The metadata capture also 
included states of simulation or data analysis tasks. Once you run sample 
experiments, this will be more clearer. 
What kind of provenance data is being stored?
--Currently very minimal to none. Basic information like user provided 
metadata, resources used to compute, job dimensions. A big missing piece is to 
collate provenance of input data and augment provenance of generated data with 
application details and simulation/analysis configurations. 
What kind of queries would you expect to be run on the provenance data?
--this will be very subjective to the data domain. An example could be, query 
for all radar assimilation data which have a quality score of 5. We could find 
more concrete pointers. 
Do we need look into Apache Storm for querying streaming data?
-- Not right away, but I could foresee some usage. For instance, if we have to 
process metadata extraction from all the archived data, I could see storm 
helping to run such a topology. We could also employ a storm cluster to shred 
deep data from all input requests. Again, we need to adapt with the usecases a 
bit here. 
Will we receive accounts on NSF XSEDE clusters for this project?
--Yes we could get you access to various clusters including XSEDE if absolutely 
needed by the project. 



> [GSoC] Brainstorm Airavata Data Management Needs
> ------------------------------------------------
>
>                 Key: AIRAVATA-1646
>                 URL: https://issues.apache.org/jira/browse/AIRAVATA-1646
>             Project: Airavata
>          Issue Type: Brainstorming
>            Reporter: Suresh Marru
>              Labels: gsoc, gsoc2015,, mentor
>
> Currently Airavata focuses on Execution Management and the Registry 
> Sub-System (with app, resource and experiment catalogs) capture metadata 
> about applications and executions. There were few efforts (primarily from 
> student projects) to explore this void. It will be good to concretely propose 
> data management solutions to for input data registration, input and generated 
> retrieval, data transfers and replication management. 
> Metadata Catalog: In addition current metadata management is based on 
> shredding thrift data models into mysql/derby schema. This is described in 
> [1]. We have discussed extensively on using Object Store data bases with a 
> conclusion of understanding the requirements more systematically. A good 
> stand alone task would be to understand current metadata management and 
> propose alternative solutions with proof of concept implementations. Once the 
> community is convinced, we can then plan on implementing them into 
> production. 
> Provenance: Airavata could be enhanced to capture provenance to organize the 
> data for reuse, discovery, comparison and sharing. This is a well explored 
> field. There might be good compelling third party solutions. Especially it 
> will be good to explore in the bigdata space and identify leverages (either 
> concepts, or even better implementations).
> Auditing and Traceability:  As Airavata mediates executions on behalf of 
> gateways, it has to strike a balance between abstracting the compute resource 
> interactions at the same time providing transparent execution trace. This 
> will bloat the amount of data to be catalogued. A good effort will be to 
> understand the current extent of airavata audits and provide suggestions. 
> BigData Leverage: Airavata needs to leverage the influx of tools in this 
> space. Any suggestions on relevant tools which will enhance Airavata 
> experience will be a good fit. 
> [1] - 
> https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Data+Models+0.12
> [2] - http://markmail.org/thread/4lguliiktjohjmsd



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (AIRAVATA-1646) [GSoC] Brainstorm Airavata Data Management Needs

Reply via email to