Yes, +1 on the detailed email summaries.
Marlon From: Suresh Marru <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Friday, June 22, 2018 at 8:46 AM To: Airavata Dev <[email protected]> Cc: Supun Nakandala <[email protected]> Subject: Re: [GSoC] Re-architect Output Data Parsing into Airavata core Hi Lahiru, Thank you for sharing the detailed summary. I do not have comments on your questions, may be Supun can weigh in. I have couple of meta requests though: Can you consider adding few Molecular dynamics parsers in this order LAMMPS, Amber, and CHARMM. The cclib library you used for others do not cover these, but InterMol [1] provides a python library to parse these. We have to be careful here, InterMol itself is MIT licensed and we can have its dependency but it depends upon ParamEd[2] which is LGPL license. Its a TODO for me on how to deal wit this but please see if you can include adding these parsers into your timeline. Can you evaluate if we can provide export to Quantum Chemistry JSON Scheme [3]? Is this is trivial we can pursue it. Lastly, can you see if Apache Tikka will help with any of your efforts. I will say my kudos again for your mailing list communications, Suresh [1] - https://github.com/shirtsgroup/InterMol [2] - https://github.com/ParmEd/ParmEd [3] - https://github.com/MolSSI/QC_JSON_Schema On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake <[email protected]> wrote: Hi Everyone, In the last couple of days, I've been working on the data parsing tasks. To give an update about it, I have already converted the code-base of Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared to code-base of seagrid-data there won't be any codes related to experiments in the project(for example no JSON mappings). The main reason for doing this because to de-couple experiments with the data parsing tasks. While I was converting the codes of Gaussian, Molpro, Newchem, and Gamess I found some JSON key value-pairs in the data-catalog docker container have not been used in the seagrid-data to generate the final output file. I have commented unused key-value pairs in the code itself [2], [3], [4], [5]. I would like to know is there any specific reason for this, hope @Supun Nakandala can answer it. The next update about the data parsing architecture. The new requirement is to come up with a framework which is capable of parsing any kind of document to a known type when the metadata is given. By this new design, data parsing will not be restricted only to experiments(Gaussian, Molpro, etc.) The following architecture is designed according to the requirements specified by @dimuthu in the last GSoC meeting. The following diagram depicts the top level architecture. <suggested architecture.png> Following are the key components. Abstract Parser This is a basic template for the Parser which specifies the parameters required for parsing task. For example, input file type, output file type, experiment type( if this is related to an experiment), etc. Parser Manager Constructs the set of parsers considering the input file type, output file type, and the experiment type. Parser Manager will construct a graph to find the shortest path between input file type and output file type. Then it will return the constructed set of Parsers. <graph.png> Catalog A mapping which has records to get a Docker container that can be used to parse from one file type to another file type. For example, if the requirement is to parse a Gaussian .out file to JSON then "app/gaussian .out to JSON" docker will be fetched Parsers There are two types of parsers (according to the suggested way) The first type is the parsers those will be directly coded into the project code-base. For example, parsing Text file to a JSON will be straightforward, then it is not necessarily required to maintain a separate docker container to convert text file to JSON. With the help of a library and putting an entry to the catalog will be enough to get the work done. The second type is parsers which have a separate docker container. For example Gaussian .out file to JSON docker container For the overall scenario consider the following examples to get an idea Example 1 Suppose a PDF should be parsed to XML Parser Manager will look the catalog and find the shortest path to get the XML output from PDF. The available parsers are(both the coded parsers in the project and the dockerized parsers), • PDF to Text • Text to JSON • JSON to XML • application/gaussian .out to JSON (This is a very specific parsing mechanism not similar to parsing a simple .out file to a JSON) and the rest which I have included in the diagram Then Parser Manager will construct the graph and find the shortest path as PDF -> Text -> JSON -> XML from the available parsers. <graph 2.png> Then Parser Manager will return 3 Parsers. From the three parsers a DAG will be constructed as follows, <parser dag.png> The reason for this architectural decision to have three parsers than doing in the single parser because if one of the parsers fails it would be easy to identify which parser it is. Example 2 Consider a separate example to parse a Gaussian .out file to JSON then it is pretty straightforward. Same as the aforementioned example it will construct a Parser which linking the dockerized app/gaussian .out to JSON container. Example 3 Problem is when it is needed to parse a Gaussian .out file to XML. There are two options. 1st option - If an application related parsing should happen there must be application typed parsers to get the work done if not it is not allowed. In the list of parsers, there is no application related parser to convert .out file to XML. In this case even Parser Manager could construct a path like, .out/gaussian -> JSON/gaussian -> XML, this process is not allowed. 2nd option - Once the application-specific content has been parsed it will be same as converting a normal JSON to XML assuming that we could allow the path .out/gaussian -> JSON/gaussian -> XML. What actually should be done? 1st option or the 2nd option? This is one point I need a suggestion. I would really appreciate any suggestions to improve this. [1] https://github.com/Lahiru-J/airavata-data-parser [2] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288 [3] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175 [4] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175 [5] https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175 Cheers, On 28 May 2018 at 18:05, Lahiru Jayathilake <[email protected]> wrote: Note this is the High-level architecture diagram. (Since it was not visible in the previous email) <Screen Shot 2018-05-28 at 9.30.43 AM.png> Thanks, Lahiru On 28 May 2018 at 18:02, Lahiru Jayathilake <[email protected]> wrote: Hi Everyone, During the past few days, I’ve been implementing the tasks which are related to the Data Parsing. To give a heads up, the following image depicts the top level architecture of the implementation. Following are the main task components have been identified, 1. DataParsing Task This task will get the stored output and will find the matching Parser (Gaussian, Lammps, QChem, etc.) and send the output through the selected parser to get a well-structured JSON 2. Validating Task This is to validate the desired JSON output is achieved or not. That is JSON output should match with the respective schema(Gaussian Schema, Lammps Schema, QChem Schema, etc.) 3. Persisting Task This task will persist the validated JSON outputs The successfully stored outputs will be exposed to the outer world. According to the diagram the generated JSON should be shared between the tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing task nor Validating task persists the JSON, therefore, helix task framework should make sure to share the content between the tasks. In this Helix tutorial [1] it says how to share the content between Helix tasks. The problem is, the method [2] which has been given only capable of sharing String typed key-value data. However, I can come up with an implementation to share all the values related to the JSON output. That involves calling this method [2] many times. I believe that is not a very efficient method because Helix task framework has to call this [3] method many times (taking into consideration that the generated JSON output can be larger). I have already sent an email to the Helix mailing list to clarify whether there is another way and also will it be efficient if this method [2] is called multiple times to get the work done. Am I on the right track? Your suggestions would be very helpful and please add if anything is missing. [1] http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs [2] https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75 [3] https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361 Thanks, Lahiru On 26 March 2018 at 19:44, Lahiru Jayathilake <[email protected]> wrote: Hi Dimuthu, Suresh, Thanks a lot for the feedback. I will update the proposal accordingly. Regards, Lahiru On 26 March 2018 at 08:48, Suresh Marru <[email protected]> wrote: Hi Lahiru, I echo Dimuthu’s comment. You have a good starting point, it will be nice if you can cover how users can interact with the parsed data. Essentially adding API access to the parsed metadata database and having proof of concept UI’s. This task could be challenging as the queries are very data specific and generalizing API access and building custom UI’s can be explanatory (less defined) portions of your proposal. Cheers, Suresh On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <[email protected]> wrote: Hi Lahiru, Nice document. And I like how you illustrate the systems through diagrams. However try to address how you are going to expose parsed data to outside through thrift APIs and how to design those data APIs in application specific manner. And in the persisting task, you have to make sure data integrity is preserved. For example in a Gaussian parsed output, you might have to validate the parsed output using a schema before persisting them in the database. Thanks Dimuthu On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake <[email protected]> wrote: Hi Everyone, I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718 [2]. Any comments would be very helpful to improve it. [1] https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing [2] https://issues.apache.org/jira/browse/AIRAVATA-2718 Thanks & Regards, -- Lahiru Jayathilake Department of Computer Science and Engineering, Faculty of Engineering, University of Moratuwa -- Lahiru Jayathilake Department of Computer Science and Engineering, Faculty of Engineering, University of Moratuwa -- Lahiru Jayathilake Department of Computer Science and Engineering, Faculty of Engineering, University of Moratuwa -- Lahiru Jayathilake Department of Computer Science and Engineering, Faculty of Engineering, University of Moratuwa -- Lahiru Jayathilake Department of Computer Science and Engineering, Faculty of Engineering, University of Moratuwa
smime.p7s
Description: S/MIME cryptographic signature
