Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Kotabagi, Karan Tue, 25 Sep 2018 06:35:45 -0700

Hi Dev,


As we are working on this and exploring the way to pre-stage the data.


We need to know if there  is a way to call the API in Airavata to register the 
product URI after the file is uploaded from the client end ?


Regards

Karan

________________________________
From: Kotabagi, Karan <[email protected]>
Sent: Sunday, September 9, 2018 5:51 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data 
transfer between the compute resource and airavata server


Hi Supun,


Thank you for the detailed suggestions and insights, this gave us a in-depth 
understanding with the future aspects of the project.


We will have more questions as we move along.


Regards

Karan

________________________________
From: Supun Nakandala <[email protected]>
Sent: Saturday, September 8, 2018 7:39 PM
To: dev
Subject: Re: Achieve the Pre-Data Staging and explore ways to reduce the data 
transfer between the compute resource and airavata server

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking about 
this sometime back and below are some ideas that I think will be worth sharing 
with you.

1. I think the value added by this project goes beyond the current SEAGrid use 
cases. As of now (if I am correct), Airavata is following an "application 
first" approach for creating experiments. This makes sense as most of the 
existing experiments are simulation type experiments. But there will be growing 
interest for "data first" type experiments such as machine learning and 
bioinformatics type experiments where you keep on reusing the same 
inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to maintain 
a catalog containing information about the data items (similar to the 
Application catalog that we currently have). This catalog should have at least 
the basic metadata describing the data format, origin, supported applications 
etc. The best place to add this information will be the data catalog. But if I 
am correct, that will require adding these capabilities to the data catalog as 
currently it only supports cataloging output data from some of the selected 
applications.

3. For reducing the overhead of data movement you will need to have a replica 
catalog. The same data files can be distributed to multiple places and replica 
catalog will help keep track of them. In an ideal scenario, based on the 
available locations of a data item and a network cost model, the system should 
be able to decide which data item to be used for a particular experiment to 
minimize data movement cost. Also in some cases, the availability of replicas 
will be subjected to some expiration time as in some computer resources the 
scratch space will be subjected to purging. In an ideal scenario, the replica 
catalog should capture this information too.

As you can see this project spans across almost all aspects of Airavata data 
infrastructure and there are some interesting distributed systems problems. But 
as you have shown in the Wiki you can start with SEAGrid as a concrete use 
case. I hope the big picture will give you more interesting ideas for extending 
your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan 
<[email protected]<mailto:[email protected]>> wrote:

**************Re-sending the previous email*****************

Hi Dev,


We have discussed few changes with Sudhakar and updated the Wiki with the new 
Napkin Drawing and User Story, please review the same and let us know if there 
are any of the suggestions.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan

________________________________
From: Kotabagi, Karan
Sent: Thursday, September 6, 2018 12:09 AM
To: [email protected]<mailto:[email protected]>
Subject: Achieve the Pre-Data Staging and explore ways to reduce the data 
transfer between the compute resource and airavata server


Hi Dev,


As part of the Science Gateway Architecture course we have got the project 
proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.


Please find the following project proposal and wiki link for the project 
ideation phase, please review the same and advise if there are any of the 
points that can be useful to

start with the project.


Project Proposal:

Achieve pre-data staging of the files using the Nextcloud file storage and 
explore ways to reduce the data transfer movements between the compute 
resources and local airavata server.


Wiki Link:

https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation


Regards

Karan

Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Reply via email to