Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Suresh Marru Fri, 22 Jun 2018 05:46:38 -0700

Hi Lahiru,

Thank you for sharing the detailed summary. I do not have comments on your 
questions, may be Supun can weigh in. I have couple of meta requests though:


Can you consider adding few Molecular dynamics parsers in this order LAMMPS,  
Amber, and CHARMM. The cclib library you used for others do not cover these, 
but InterMol [1] provides a python library to parse these. We have to be 
careful here, InterMol itself is MIT licensed and we can have its dependency 
but it depends upon ParamEd[2] which is LGPL license. Its a TODO for me on how 
to deal wit this but please see if you can include adding these parsers into 
your timeline. 

Can you evaluate if we can provide export to Quantum Chemistry JSON Scheme [3]? 
Is this is trivial we can pursue it. 

Lastly, can you see if Apache Tikka will help with any of your efforts. 

I will say my kudos again for your mailing list communications,
Suresh 

[1] - https://github.com/shirtsgroup/InterMol 
<https://github.com/shirtsgroup/InterMol>
[2] - https://github.com/ParmEd/ParmEd <https://github.com/ParmEd/ParmEd> 
[3] - https://github.com/MolSSI/QC_JSON_Schema 
<https://github.com/MolSSI/QC_JSON_Schema> 


> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <lahiruj...@cse.mrt.ac.lk> 
> wrote:
> 
> Hi Everyone,
> 
> In the last couple of days, I've been working on the data parsing tasks. To 
> give an update about it, I have already converted the code-base of Gaussian, 
> Molpro, Newchem, and Gamess parsers to python[1]. With compared to code-base 
> of seagrid-data there won't be any codes related to experiments in the 
> project(for example no JSON mappings). The main reason for doing this because 
> to de-couple experiments with the data parsing tasks. 
> 
> While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess I 
> found some JSON key value-pairs in the data-catalog docker container have not 
> been used in the seagrid-data to generate the final output file. I have 
> commented unused key-value pairs in the code itself [2], [3], [4], [5]. I 
> would like to know is there any specific reason for this, hope @Supun 
> Nakandala <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can 
> answer it. 
> 
> The next update about the data parsing architecture.
> The new requirement is to come up with a framework which is capable of 
> parsing any kind of document to a known type when the metadata is given. By 
> this new design, data parsing will not be restricted only to 
> experiments(Gaussian, Molpro, etc.)  
> 
> The following architecture is designed according to the requirements 
> specified by @dimuthu in the last GSoC meeting.
> 
> The following diagram depicts the top level architecture.
> 
> <suggested architecture.png>
> 
> Following are the key components.
> 
> Abstract Parser 
> This is a basic template for the Parser which specifies the parameters 
> required for parsing task. For example, input file type, output file type, 
> experiment type( if this is related to an experiment), etc.
> 
> Parser Manager
> Constructs the set of parsers considering the input file type, output file 
> type, and the experiment type.
> Parser Manager will construct a graph to find the shortest path between input 
> file type and output file type. Then it will return the constructed set of 
> Parsers.
> 
> <graph.png>
> Catalog 
> A mapping which has records to get a Docker container that can be used to 
> parse from one file type to another file type. For example, if the 
> requirement is to parse a Gaussian .out file to JSON then "app/gaussian .out 
> to JSON" docker will be fetched
> 
> Parsers
> There are two types of parsers (according to the suggested way) 
> 
> The first type is the parsers those will be directly coded into the project 
> code-base. For example, parsing Text file to a JSON will be straightforward, 
> then it is not necessarily required to maintain a separate docker container 
> to convert text file to JSON. With the help of a library and putting an entry 
> to the catalog will be enough to get the work done.
> 
> The second type is parsers which have a separate docker container. For 
> example Gaussian .out file to JSON docker container
> 
> For the overall scenario consider the following examples to get an idea
> 
> Example 1
> Suppose a PDF should be parsed to XML
> Parser Manager will look the catalog and find the shortest path to get the 
> XML output from PDF. The available parsers are(both the coded parsers in the 
> project and the dockerized parsers),
> • PDF to Text
> • Text to JSON
> • JSON to XML
> • application/gaussian .out to JSON (This is a very specific parsing 
> mechanism not similar  to parsing a simple .out file to a JSON)
> and the rest which I have included in the diagram
> 
> Then Parser Manager will construct the graph and find the shortest path as 
> PDF -> Text -> JSON -> XML from the available parsers. 
> 
> <graph 2.png>
> Then Parser Manager will return 3 Parsers. From the three parsers a DAG will 
> be constructed as follows,
> 
> <parser dag.png>
> 
> The reason for this architectural decision to have three parsers than doing 
> in the single parser because if one of the parsers fails it would be easy to 
> identify which parser it is. 
> 
> Example 2
> Consider a separate example to parse a Gaussian .out file to JSON then it is 
> pretty straightforward. Same as the aforementioned example it will construct 
> a Parser which linking the dockerized app/gaussian .out to JSON container. 
> 
> Example 3
> Problem is when it is needed to parse a Gaussian .out file to XML. There are 
> two options.
> 
> 1st option - If an application related parsing should happen there must be 
> application typed parsers to get the work done if not it is not allowed. 
> In the list of parsers, there is no application related parser to convert 
> .out file to XML. In this case even Parser Manager could construct a path 
> like, 
> .out/gaussian -> JSON/gaussian -> XML, this process is not allowed.
> 
> 2nd option - Once the application-specific content has been parsed it will be 
> same as converting a normal JSON to XML assuming that we could allow the path 
> .out/gaussian -> JSON/gaussian -> XML. 
> What actually should be done? 1st option or the 2nd option? This is one point 
> I need a suggestion.
>  
> I would really appreciate any suggestions to improve this.
> 
> [1] https://github.com/Lahiru-J/airavata-data-parser 
> <https://github.com/Lahiru-J/airavata-data-parser>
> [2] 
> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288
>  
> <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288>
> [3] 
> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175
>  
> <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175>
>  
> [4] 
> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
>  
> <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175>
> [5] 
> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175
>  
> <https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175>
> 
> Cheers,
> 
> On 28 May 2018 at 18:05, Lahiru Jayathilake <lahiruj...@cse.mrt.ac.lk 
> <mailto:lahiruj...@cse.mrt.ac.lk>> wrote:
> Note this is the High-level architecture diagram. (Since it was not visible 
> in the previous email)
> 
> <Screen Shot 2018-05-28 at 9.30.43 AM.png>
> Thanks,
> Lahiru
> 
> On 28 May 2018 at 18:02, Lahiru Jayathilake <lahiruj...@cse.mrt.ac.lk 
> <mailto:lahiruj...@cse.mrt.ac.lk>> wrote:
> Hi Everyone,
> 
> During the past few days, I’ve been implementing the tasks which are related 
> to the Data Parsing. To give a heads up, the following image depicts the top 
> level architecture of the implementation.
> 
> 
> 
> Following are the main task components have been identified,
> 
> 1. DataParsing Task
> This task will get the stored output and will find the matching Parser 
> (Gaussian, Lammps, QChem, etc.) and send the output through the selected 
> parser to get a well-structured JSON
> 
> 2. Validating Task
> This is to validate the desired JSON output is achieved or not. That is JSON 
> output should match with the respective schema(Gaussian Schema, Lammps 
> Schema, QChem Schema, etc.)
> 
> 3. Persisting Task
> This task will persist the validated JSON outputs
> 
> The successfully stored outputs will be exposed to the outer world.
> 
> 
> According to the diagram the generated JSON should be shared between the 
> tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing 
> task nor Validating task persists the JSON, therefore, helix task framework 
> should make sure to share the content between the tasks.
> 
> In this Helix tutorial [1] it says how to share the content between Helix 
> tasks. The problem is, the method [2] which has been given only capable of 
> sharing String typed key-value data. 
> However, I can come up with an implementation to share all the values related 
> to the JSON output. That involves calling this method [2] many times. I 
> believe that is not a very efficient method because Helix task framework has 
> to call this [3] method many times (taking into consideration that the 
> generated JSON output can be larger).
> 
> I have already sent an email to the Helix mailing list to clarify whether 
> there is another way and also will it be efficient if this method [2] is 
> called multiple times to get the work done.
> 
> Am I on the right track? Your suggestions would be very helpful and please 
> add if anything is missing.
> 
> 
> [1] 
> http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs
>  
> <http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs>
> [2] 
> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75
>  
> <https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75>
> [3] 
> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361
>  
> <https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361>
> 
> Thanks,
> Lahiru
> 
> On 26 March 2018 at 19:44, Lahiru Jayathilake <lahiruj...@cse.mrt.ac.lk 
> <mailto:lahiruj...@cse.mrt.ac.lk>> wrote:
> Hi Dimuthu, Suresh,
> 
> Thanks a lot for the feedback. I will update the proposal accordingly.
> 
> Regards,
> Lahiru
> 
> On 26 March 2018 at 08:48, Suresh Marru <sma...@apache.org 
> <mailto:sma...@apache.org>> wrote:
> Hi Lahiru,
> 
> I echo Dimuthu’s comment. You have a good starting point, it will be nice if 
> you can cover how users can interact with the parsed data. Essentially adding 
> API access to the parsed metadata database and having proof of concept UI’s. 
> This task could be challenging as the queries are very data specific and 
> generalizing API access and building custom UI’s can be explanatory (less  
> defined) portions of your proposal. 
> 
> Cheers,
> Suresh
> 
> 
>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <dimuthu.upeks...@gmail.com 
>> <mailto:dimuthu.upeks...@gmail.com>> wrote:
>> 
>> Hi Lahiru,
>> 
>> Nice document. And I like how you illustrate the systems through diagrams. 
>> However try to address how you are going to expose parsed data to outside 
>> through thrift APIs and how to design those data APIs in application 
>> specific manner. And in the persisting task, you have to make sure data 
>> integrity is preserved. For example in a Gaussian parsed output, you might 
>> have to validate the parsed output using a schema before persisting them in 
>> the database. 
>> 
>> Thanks
>> Dimuthu
>> 
>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake 
>> <lahiruj...@cse.mrt.ac.lk <mailto:lahiruj...@cse.mrt.ac.lk>> wrote:
>> Hi Everyone,
>> 
>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718 [2]. 
>> Any comments would be very helpful to improve it.
>> 
>> [1] 
>> https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing
>>  
>> <https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing>
>>  
>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718 
>> <https://issues.apache.org/jira/browse/AIRAVATA-2718>
>> 
>> Thanks & Regards,
>> -- 
>> Lahiru Jayathilake
>> Department of Computer Science and Engineering,
>> Faculty of Engineering,
>> University of Moratuwa
>> 
>>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>
> 
> 
> -- 
> Lahiru Jayathilake
> Department of Computer Science and Engineering,
> Faculty of Engineering,
> University of Moratuwa
> 
>  <https://lk.linkedin.com/in/lahirujayathilake>

Re: [GSoC] Re-architect Output Data Parsing into Airavata core

Reply via email to