[ https://issues.apache.org/jira/browse/AIRAVATA-1646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14380770#comment-14380770 ]
pankaj saha commented on AIRAVATA-1646: --------------------------------------- I am really enthusiastic about the option to explore big data technologies (concepts and implementations) for Airavata Data Management needs. I have some questions: Is information such as Application Characteristics available? For example, are applications Read/Write/IO/CPU/memory intensive? Also, is one of the outcomes of this project a recommendation for a software pipeline that shows which big data technologies should be combined? Do we need look into Apache Storm for querying streaming data? And, what is the data size and how many nodes can we run data on? Finally, where will these experiments be run? Locally on my own machine? Pankaj > [GSoC] Brainstorm Airavata Data Management Needs > ------------------------------------------------ > > Key: AIRAVATA-1646 > URL: https://issues.apache.org/jira/browse/AIRAVATA-1646 > Project: Airavata > Issue Type: Brainstorming > Reporter: Suresh Marru > Labels: gsoc, gsoc2015,, mentor > > Currently Airavata focuses on Execution Management and the Registry > Sub-System (with app, resource and experiment catalogs) capture metadata > about applications and executions. There were few efforts (primarily from > student projects) to explore this void. It will be good to concretely propose > data management solutions to for input data registration, input and generated > retrieval, data transfers and replication management. > Metadata Catalog: In addition current metadata management is based on > shredding thrift data models into mysql/derby schema. This is described in > [1]. We have discussed extensively on using Object Store data bases with a > conclusion of understanding the requirements more systematically. A good > stand alone task would be to understand current metadata management and > propose alternative solutions with proof of concept implementations. Once the > community is convinced, we can then plan on implementing them into > production. > Provenance: Airavata could be enhanced to capture provenance to organize the > data for reuse, discovery, comparison and sharing. This is a well explored > field. There might be good compelling third party solutions. Especially it > will be good to explore in the bigdata space and identify leverages (either > concepts, or even better implementations). > Auditing and Traceability: As Airavata mediates executions on behalf of > gateways, it has to strike a balance between abstracting the compute resource > interactions at the same time providing transparent execution trace. This > will bloat the amount of data to be catalogued. A good effort will be to > understand the current extent of airavata audits and provide suggestions. > BigData Leverage: Airavata needs to leverage the influx of tools in this > space. Any suggestions on relevant tools which will enhance Airavata > experience will be a good fit. > [1] - > https://cwiki.apache.org/confluence/display/AIRAVATA/Airavata+Data+Models+0.12 > [2] - http://markmail.org/thread/4lguliiktjohjmsd -- This message was sent by Atlassian JIRA (v6.3.4#6332)