Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Supun Nakandala Sat, 08 Sep 2018 16:39:55 -0700

Hi Karan,

This will be a very useful addition for Airavata. I have been thinking
about this sometime back and below are some ideas that I think will be
worth sharing with you.

1. I think the value added by this project goes beyond the current SEAGrid
use cases. As of now (if I am correct), Airavata is following an
"application first" approach for creating experiments. This makes sense as
most of the existing experiments are simulation type experiments. But there
will be growing interest for "data first" type experiments such as machine
learning and bioinformatics type experiments where you keep on reusing the
same inputs/dataset for a variety of ML models or applications.

2. In order to support a full capable data first approach, we need to
maintain a catalog containing information about the data items (similar to
the Application catalog that we currently have). This catalog should have
at least the basic metadata describing the data format, origin, supported
applications etc. The best place to add this information will be the data
catalog. But if I am correct, that will require adding these capabilities
to the data catalog as currently it only supports cataloging output data
from some of the selected applications.

3. For reducing the overhead of data movement you will need to have a
replica catalog. The same data files can be distributed to multiple places
and replica catalog will help keep track of them. In an ideal scenario,
based on the available locations of a data item and a network cost model,
the system should be able to decide which data item to be used for a
particular experiment to minimize data movement cost. Also in some cases,
the availability of replicas will be subjected to some expiration time as
in some computer resources the scratch space will be subjected to purging.
In an ideal scenario, the replica catalog should capture this information
too.

As you can see this project spans across almost all aspects of Airavata
data infrastructure and there are some interesting distributed systems
problems. But as you have shown in the Wiki you can start with SEAGrid as a
concrete use case. I hope the big picture will give you more interesting
ideas for extending your project further.

Best
-Supun

On Sat, Sep 8, 2018 at 3:29 PM Kotabagi, Karan <[email protected]> wrote:

> **************Re-sending the previous email*****************
>
> Hi Dev,
>
>
> We have discussed few changes with Sudhakar and updated the Wiki with the
> new Napkin Drawing and User Story, please review the same and let us know
> if there are any of the suggestions.
>
>
> *Wiki Link:*
>
> https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation
>
>
> Regards
>
> Karan
> ------------------------------
> *From:* Kotabagi, Karan
> *Sent:* Thursday, September 6, 2018 12:09 AM
> *To:* [email protected]
> *Subject:* Achieve the Pre-Data Staging and explore ways to reduce the
> data transfer between the compute resource and airavata server
>
>
> Hi Dev,
>
>
> As part of the Science Gateway Architecture course we have got the project
> proposal from Sudhakar to achieve the pre-data staging using the Nextcloud.
>
>
> Please find the following *project proposal* and *wiki link* for the
> project ideation phase, please review the same and advise if there are any
> of the points that can be useful to
>
> start with the project.
>
>
> *Project Proposal:*
>
> Achieve pre-data staging of the files using the Nextcloud file storage and
> explore ways to reduce the data transfer movements between the compute
> resources and local airavata server.
>
>
> *Wiki Link:*
>
> https://github.com/airavata-courses/airavata-nextcoud/wiki/Project-Ideation
>
>
> Regards
>
> Karan
>
>
>

Re: Achieve the Pre-Data Staging and explore ways to reduce the data transfer between the compute resource and airavata server

Reply via email to