Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Pierce, Marlon Fri, 22 Jun 2018 05:50:17 -0700

Yes, +1 on the detailed email summaries.

Marlon

From: Suresh Marru <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, June 22, 2018 at 8:46 AM
To: Airavata Dev <[email protected]>
Cc: Supun Nakandala <[email protected]>
Subject: Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Hi Lahiru, 

Thank you for sharing the detailed summary. I do not have comments on your 
questions, may be Supun can weigh in. I have couple of meta requests though:

Can you consider adding few Molecular dynamics parsers in this order LAMMPS,  
Amber, and CHARMM. The cclib library you used for others do not cover these, 
but InterMol [1] provides a python library to parse these. We have to be 
careful here, InterMol itself is MIT licensed and we can have its dependency 
but it depends upon ParamEd[2] which is LGPL license. Its a TODO for me on how 
to deal wit this but please see if you can include adding these parsers into 
your timeline. 

Can you evaluate if we can provide export to Quantum Chemistry JSON Scheme [3]? 
Is this is trivial we can pursue it. 

Lastly, can you see if Apache Tikka will help with any of your efforts. 

I will say my kudos again for your mailing list communications,

Suresh 

[1] - https://github.com/shirtsgroup/InterMol

[2] - https://github.com/ParmEd/ParmEd 

[3] - https://github.com/MolSSI/QC_JSON_Schema 

On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <[email protected]> 
wrote:

Hi Everyone, 

In the last couple of days, I've been working on the data parsing tasks. To 
give an update about it, I have already converted the code-base of Gaussian, 
Molpro, Newchem, and Gamess parsers to python[1]. With compared to code-base of 
seagrid-data there won't be any codes related to experiments in the project(for 
example no JSON mappings). The main reason for doing this because to de-couple 
experiments with the data parsing tasks. 

While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess I 
found some JSON key value-pairs in the data-catalog docker container have not 
been used in the seagrid-data to generate the final output file. I have 
commented unused key-value pairs in the code itself [2], [3], [4], [5]. I would 
like to know is there any specific reason for this, hope @Supun Nakandala can 
answer it. 

The next update about the data parsing architecture.

The new requirement is to come up with a framework which is capable of parsing 
any kind of document to a known type when the metadata is given. By this new 
design, data parsing will not be restricted only to experiments(Gaussian, 
Molpro, etc.)  

The following architecture is designed according to the requirements specified 
by @dimuthu in the last GSoC meeting.

The following diagram depicts the top level architecture.

<suggested architecture.png>

Following are the key components.

Abstract Parser 

This is a basic template for the Parser which specifies the parameters required 
for parsing task. For example, input file type, output file type, experiment 
type( if this is related to an experiment), etc.

Parser Manager

Constructs the set of parsers considering the input file type, output file 
type, and the experiment type.

Parser Manager will construct a graph to find the shortest path between input 
file type and output file type. Then it will return the constructed set of 
Parsers.

<graph.png>

Catalog 

A mapping which has records to get a Docker container that can be used to parse 
from one file type to another file type. For example, if the requirement is to 
parse a Gaussian .out file to JSON then "app/gaussian .out to JSON" docker will 
be fetched

Parsers

There are two types of parsers (according to the suggested way) 

The first type is the parsers those will be directly coded into the project 
code-base. For example, parsing Text file to a JSON will be straightforward, 
then it is not necessarily required to maintain a separate docker container to 
convert text file to JSON. With the help of a library and putting an entry to 
the catalog will be enough to get the work done.

The second type is parsers which have a separate docker container. For example 
Gaussian .out file to JSON docker container

For the overall scenario consider the following examples to get an idea

Example 1

Suppose a PDF should be parsed to XML

Parser Manager will look the catalog and find the shortest path to get the XML 
output from PDF. The available parsers are(both the coded parsers in the 
project and the dockerized parsers),

• PDF to Text

• Text to JSON

• JSON to XML

• application/gaussian .out to JSON (This is a very specific parsing mechanism 
not similar  to parsing a simple .out file to a JSON)

and the rest which I have included in the diagram

Then Parser Manager will construct the graph and find the shortest path as 

PDF -> Text -> JSON -> XML from the available parsers. 

<graph 2.png>

Then Parser Manager will return 3 Parsers. From the three parsers a DAG will be 
constructed as follows,

<parser dag.png>

The reason for this architectural decision to have three parsers than doing in 
the single parser because if one of the parsers fails it would be easy to 
identify which parser it is. 

Example 2

Consider a separate example to parse a Gaussian .out file to JSON then it is 
pretty straightforward. Same as the aforementioned example it will construct a 
Parser which linking the dockerized app/gaussian .out to JSON container. 

Example 3

Problem is when it is needed to parse a Gaussian .out file to XML. There are 
two options.

1st option - If an application related parsing should happen there must be 
application typed parsers to get the work done if not it is not allowed. 

In the list of parsers, there is no application related parser to convert .out 
file to XML. In this case even Parser Manager could construct a path like, 

.out/gaussian -> JSON/gaussian -> XML, this process is not allowed.

2nd option - Once the application-specific content has been parsed it will be 
same as converting a normal JSON to XML assuming that we could allow the path 

.out/gaussian -> JSON/gaussian -> XML. 

What actually should be done? 1st option or the 2nd option? This is one point I 
need a suggestion.

I would really appreciate any suggestions to improve this.

[1] https://github.com/Lahiru-J/airavata-data-parser

[2] 
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288

[3] 
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175

[4] 
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175

[5] 
https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175

Cheers,

On 28 May 2018 at 18:05, Lahiru Jayathilake <[email protected]> wrote:

Note this is the High-level architecture diagram. (Since it was not visible in 
the previous email) 

<Screen Shot 2018-05-28 at 9.30.43 AM.png>

Thanks,

Lahiru

On 28 May 2018 at 18:02, Lahiru Jayathilake <[email protected]> wrote:

Hi Everyone, 

During the past few days, I’ve been implementing the tasks which are related to 
the Data Parsing. To give a heads up, the following image depicts the top level 
architecture of the implementation.

Following are the main task components have been identified,

1. DataParsing Task

This task will get the stored output and will find the matching Parser 
(Gaussian, Lammps, QChem, etc.) and send the output through the selected parser 
to get a well-structured JSON

2. Validating Task

This is to validate the desired JSON output is achieved or not. That is JSON 
output should match with the respective schema(Gaussian Schema, Lammps Schema, 
QChem Schema, etc.)

3. Persisting Task

This task will persist the validated JSON outputs

The successfully stored outputs will be exposed to the outer world. 

According to the diagram the generated JSON should be shared between the 
tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing task 
nor Validating task persists the JSON, therefore, helix task framework should 
make sure to share the content between the tasks.

In this Helix tutorial [1] it says how to share the content between Helix 
tasks. The problem is, the method [2] which has been given only capable of 
sharing String typed key-value data. 

However, I can come up with an implementation to share all the values related 
to the JSON output. That involves calling this method [2] many times. I believe 
that is not a very efficient method because Helix task framework has to call 
this [3] method many times (taking into consideration that the generated JSON 
output can be larger).

I have already sent an email to the Helix mailing list to clarify whether there 
is another way and also will it be efficient if this method [2] is called 
multiple times to get the work done.

Am I on the right track? Your suggestions would be very helpful and please add 
if anything is missing.

[1] 
http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs

[2] 
https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75

[3] 
https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361

Thanks,

Lahiru

On 26 March 2018 at 19:44, Lahiru Jayathilake <[email protected]> wrote:

Hi Dimuthu, Suresh, 

Thanks a lot for the feedback. I will update the proposal accordingly.

Regards,

Lahiru

On 26 March 2018 at 08:48, Suresh Marru <[email protected]> wrote:

Hi Lahiru, 

I echo Dimuthu’s comment. You have a good starting point, it will be nice if 
you can cover how users can interact with the parsed data. Essentially adding 
API access to the parsed metadata database and having proof of concept UI’s. 
This task could be challenging as the queries are very data specific and 
generalizing API access and building custom UI’s can be explanatory (less  
defined) portions of your proposal. 

Cheers,

Suresh 

On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <[email protected]> wrote:

Hi Lahiru, 

Nice document. And I like how you illustrate the systems through diagrams. 
However try to address how you are going to expose parsed data to outside 
through thrift APIs and how to design those data APIs in application specific 
manner. And in the persisting task, you have to make sure data integrity is 
preserved. For example in a Gaussian parsed output, you might have to validate 
the parsed output using a schema before persisting them in the database. 

Thanks

Dimuthu

On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <[email protected]> 
wrote:

Hi Everyone, 

I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718 [2]. Any 
comments would be very helpful to improve it.

[1] 
https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing

[2] https://issues.apache.org/jira/browse/AIRAVATA-2718

Thanks & Regards,

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

-- 

Lahiru Jayathilake 

Department of Computer Science and Engineering,

Faculty of Engineering,

University of Moratuwa

smime.p7s
Description: S/MIME cryptographic signature

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Reply via email to