Hi Devs, Sorry for joining in late.
Regarding the challenges that Lahiru mentioned, I think it is a question of whether to use configurations or conventions. Personally, I like following a convention based approach in this context, as more and more configurations will make the system more cumbersome for the users. But I agree that it has it's downsides too. But as Lahiru mentioned, I think using a UI based approach will be a better approach. It helps to shield the complexities of the configurations and provide an intuitive interface for the users. Overall, I feel the task of output data parsing perfectly aligns with new Airavata architecture on distributed task execution. Maybe we should brainstorm what it would take to incorporate these parsers into the application catalog (extend or abstract out a generic catalog). If we can incorporate without making it overly complicated, I feel that it will be a good direction to follow up. On Sat, Aug 11, 2018 at 12:50 PM Lahiru Jayathilake < [email protected]> wrote: > Hi Everyone, > > First of all Suresh, Marlon, and Dimuthu thanks for the suggestions and > comments. > > Yes Suresh we can include QC_JSON_Schema[1] for airavata-data-parser[2]. > However, there is a challenge of using InterMol[3], as you mentioned the > depending ParamEd[4] is LGPL and according to Apache Legal page[5], LGPL is > not to be used. > About the Data Parsing Project, yes I did look into the Apache Tikka > whether I can use it and there are some challenges while making a generic > framework. I will discuss them in detail in this same thread later. > > *This is an update about what I have accomplished.* > > I have created a separate parser project [2] for Gaussian, Gamess, Molpro, > and NwChem. One advantage of separating the Gaussian, Molpro, etc. codes > from the core is to make high cohesive and less coupled. When it comes to > the maintainability it is far easier in this manner. > > Next regarding the Data Parsing Framework. > > As I have mentioned in my previous email I have implemented the Data > Parsing Framework with some additional features. The method of achieving > the goal had to be slightly changed in order to accompany some features. I > will start from the bottom. > > Here is the scenario, A user has to define what are the Catalog Entries. A > Catalog Entry is nothing more than basic key-value properties of a > Dockerized parser. The following image shows an example of how I have > defined it. > > > The above image shows the entry corresponds to the Dockerized Gaussian > Parser. There are both mandatory and optional properties. For example, > *dockerImageName, > inputFileExtension, dockerWorkingDirPath *must be stated whereas *securityOpt, > envVariables *like properties are optional. Some of the properties are > needed to run the Docker container. > > There are two special properties called *applicationType *and *operation* > . *applicationType *states whether the Docker container is for parsing > Gaussian, Molpro, NwChem, or Gamess files. Property *operation* is to > mention that the parser can perform some operation to the file. For > example, converting all the text characters to lower/upper case, removing > last *n *lines, appending some text.. you get the point. A Dockerized > Parser cannot have both the application and operation. It should be either > an application or operation or none of them. (This is a design decision) > > For the time being catalog file is a JSON file which will be picked by the > DataParsing Framework according to the user given file path. > For the further explanation consider the following set of parsers. Note > that I have only mentioned the most essential properties just to explain > the example scenarios. > > > > Once the user has defined the catalog entries then DataParsing Framework > expects a Parser Request to parse a given file. Consider the user has given > the following set of Parser Requests. > > > > Once the above facts have been completed then the baton is on the hand of > DataParsing Framework. > > At the runtime Data Parsing Framework pick Catalog File and makes a > directed graph G(V,E) using the indicated parsers. I have already given a > detailed summary about how the path will be identified but in this > implementation, I changed it a little bit to facilitate application parsing > (Gaussian, Molpro, etc) as well as multiple operations to a single file. > Every vertex of the graph will be a file extension type and every edge of > the graph represents a Catalog Entry. Then DataParsing Framework generates > the directed graph as follows. > > The graph is based on the aforementioned Catalog Parsers and only the > required properties have been defined on the graph edges for simplicity. > > > > This is how it connects with the file extensions. In the previous method > we had nodes like *.out/gaussian* but instead of that multiple edges are > allowed here. > > When a parser request comes the DataParsing Framework will find the > shortest possible path with fulfilling all the requirements to parse the > particular file. > Following DAGs will be created for the aforementioned parser requests > > > > *Parser Requests* > > *Parser Request 1* > > This is Straightforward *P**6* parser is selected to parser *.txt* to > *.xml* > > > *Parser Request 2* > > The file should go through a Gaussian parser. *P1* is selected as the > Gaussian parser but that parser's output file extension is *.json *since > the request expect the output file extension to be *.xml, *the* P7* > parser is selected at the end of the DAG > > > *Parser Request 3* > > Similar to the Parser Request 2 but need an extra operation to be > incorporated. The file's text should be converted to lower case. Only the > *P9* parser exhibits the desired property. Hence *P9* parser is also used > to create the DAG > > > *Parser Request 4* > > Similar to Parser Request 3 however, in this case, two more operations > should be considered. Those are *operation1* and *operation2. **P11* and > *P12* parsers provide those operations respectively hence they are used > when creating the DAG > > > > I completed these parts, a couple of weeks back. With the discussion of > Dimuthu, he suggested to me, that I should try to generify the framework. > The goal was not to declare the properties such as "*inputFileExtension*", > "*outputFileExtension*" etc. at the coding level. The user is the one who > is defining that language in Catalog Entries and in the Parser Request. The > Data Parsing Framework should be capable of getting any kind of metadata > keys and create the parser DAG. > > For example, one user can specify the input file extension as " > *inputFileExtension*", another can specify it as "*inputEx*", another can > specify it as "*input*", and another can declare it as "*x*". The method > of defining is totally up to the user. > > While I was researching how to find a solution to this I faced some > challenges. > > *Challenge 1* > > Without knowing the exact name of the keys it is not possible to identify > the dependencies between such keys. > > For example, *inputFileExtension* and *outputFileExtension* exhibit a > dependency that is 1st parser's *outputFileExtension* should be equal to > the 2nd parser's *inputFileExtension*. > > > One solution to overcome this problem is, a user has to give those kinds > of dependency relationship between keys. For example, think a user has > defined the keys of input file extension and output file extension as "*x*" > and "*y*" respectively. Then he has to indicate that using some kind of a > notation(eg. *key(y) ≊ key(x)) *inside a separate file. > > *Challenge 2* > > When it is required to parse a file through an application, the parser > with the application should always come first. For example, suppose I need > to parse a file through Gaussian and some kind of an operation. Then I > cannot first parse the file through the parser which holds the operation > and then pass through the Gaussian parser. What I always have to do is, > initially parse through the Gaussian parser and then parse the rest of the > content through the other one. > > As I mentioned earlier this could also be solvable by maintaining a file > which has the details to what Parsers the priority should be given. > > > Overall we might never know what kind of keys will be there what kind of > relationships should be maintained with the keys etc. The solutions for the > above challenges were to introduce another file which maintains all the > relationships, dependencies, priorities, etc. held by keys and parsers. Our > main goal to make this DataParsing Framework to be generic is, make the > user's life easier. But with this approach, that goal is quite harder to > achieve. > > There was another solution suggested by Dimuthu which is to come up with a > UI based framework. Where users can direct towards their Catalog Entries > and drag and drop Parsers to make their own DAG. This is the next milestone > we are going to achieve. However, I am still looking for a solution to get > the work done extending/modifying the work I have already completed. > > A detailed description of this Data Parsing Framework with contributions > can be found here[5] > > Cheers, > Lahiru > > [1] https://github.com/MolSSI/QC_JSON_Schema > [2] https://github.com/Lahiru-J/airavata-data-parser/tree/master/datacat > [3] https://github.com/shirtsgroup/InterMol > [4] https://github.com/ParmEd/ParmEd > [5] > https://medium.com/@lahiru_j/gsoc-2018-re-architect-output-data-parsing-into-airavata-core-81da4b37057e > > > On 26 June 2018 at 21:27, DImuthu Upeksha <[email protected]> > wrote: > >> Hi Lahiru >> >> Thanks for sharing this with the dev list. I would like to suggest few >> changes to your data parsing framework. Please have a look at following >> diagram >> >> >> >> I would like to come up with a sample use case so that you can understand >> the data flow. >> >> I have a application output file called gaussian.out and I need to parse >> it to a JSON file. However your have a parser that can parse gaussian files >> into xml format. But you have another parser that can parse XML files into >> JSON. You have a parser catalog that contains all details about parsers you >> currently have and you can filter out necessary parsers based on metadata >> like application type, output type, input type and etc. >> >> Challenge is how we are going to combine these two parsers in correct >> order and how the data passing within these parsers are going to handle. >> That's where we need a workflow manager. Workflow manager gets your >> requirement then talk to the catalog to fetch necessary parser information >> and build the correct parser DAG. Once the DAG is finalized, it can be >> passed to helix to execute. There could be multiple DAGs that can achieve >> same requirement, but workflow manager should select the highest >> constrained path. >> >> What do you think? >> >> Thanks >> Dimuthu >> >> On Fri, Jun 22, 2018 at 8:49 AM, Pierce, Marlon <[email protected]> wrote: >> >>> Yes, +1 on the detailed email summaries. >>> >>> >>> >>> Marlon >>> >>> >>> >>> >>> >>> *From: *Suresh Marru <[email protected]> >>> *Reply-To: *"[email protected]" <[email protected]> >>> *Date: *Friday, June 22, 2018 at 8:46 AM >>> *To: *Airavata Dev <[email protected]> >>> *Cc: *Supun Nakandala <[email protected]> >>> *Subject: *Re: [GSoC] Re-architect Output Data Parsing into Airavata >>> core >>> >>> >>> >>> Hi Lahiru, >>> >>> >>> >>> Thank you for sharing the detailed summary. I do not have comments on >>> your questions, may be Supun can weigh in. I have couple of meta requests >>> though: >>> >>> >>> >>> Can you consider adding few Molecular dynamics parsers in this order >>> LAMMPS, Amber, and CHARMM. The cclib library you used for others do not >>> cover these, but InterMol [1] provides a python library to parse these. We >>> have to be careful here, InterMol itself is MIT licensed and we can have >>> its dependency but it depends upon ParamEd[2] which is LGPL license. Its a >>> TODO for me on how to deal wit this but please see if you can include >>> adding these parsers into your timeline. >>> >>> >>> >>> Can you evaluate if we can provide export to Quantum Chemistry JSON >>> Scheme [3]? Is this is trivial we can pursue it. >>> >>> >>> >>> Lastly, can you see if Apache Tikka will help with any of your efforts. >>> >>> >>> >>> I will say my kudos again for your mailing list communications, >>> >>> Suresh >>> >>> >>> >>> [1] - https://github.com/shirtsgroup/InterMol >>> >>> [2] - https://github.com/ParmEd/ParmEd >>> >>> [3] - https://github.com/MolSSI/QC_JSON_Schema >>> >>> >>> >>> >>> >>> On Jun 22, 2018, at 12:37 AM, Lahiru Jayathilake < >>> [email protected]> wrote: >>> >>> >>> >>> Hi Everyone, >>> >>> >>> >>> In the last couple of days, I've been working on the data parsing tasks. >>> To give an update about it, I have already converted the code-base of >>> Gaussian, Molpro, Newchem, and Gamess parsers to python[1]. With compared >>> to code-base of seagrid-data there won't be any codes related to >>> experiments in the project(for example no JSON mappings). The main reason >>> for doing this because to de-couple experiments with the data parsing >>> tasks. >>> >>> >>> >>> While I was converting the codes of Gaussian, Molpro, Newchem, and >>> Gamess I found some JSON key value-pairs in the data-catalog docker >>> container have not been used in the seagrid-data to generate the final >>> output file. I have commented unused key-value pairs in the code itself >>> [2], [3], [4], [5]. I would like to know is there any specific reason for >>> this, hope @Supun Nakandala >>> <https://plus.google.com/u/1/103731766138074233701?prsrc=4> can answer >>> it. >>> >>> >>> >>> The next update about the data parsing architecture. >>> >>> The new requirement is to come up with a framework which is capable of >>> parsing any kind of document to a known type when the metadata is given. By >>> this new design, data parsing will not be restricted only to >>> experiments(Gaussian, Molpro, etc.) >>> >>> >>> >>> The following architecture is designed according to the requirements >>> specified by @dimuthu in the last GSoC meeting. >>> >>> >>> >>> The following diagram depicts the top level architecture. >>> >>> >>> >>> <suggested architecture.png> >>> >>> Following are the key components. >>> >>> >>> >>> *Abstract Parser * >>> >>> This is a basic template for the Parser which specifies the parameters >>> required for parsing task. For example, input file type, output file type, >>> experiment type( if this is related to an experiment), etc. >>> >>> >>> >>> *Parser Manager* >>> >>> Constructs the set of parsers considering the input file type, output >>> file type, and the experiment type. >>> >>> Parser Manager will construct a graph to find the shortest path between >>> input file type and output file type. Then it will return the constructed >>> set of Parsers. >>> >>> >>> >>> <graph.png> >>> >>> *Catalog * >>> >>> A mapping which has records to get a Docker container that can be used >>> to parse from one file type to another file type. For example, if the >>> requirement is to parse a *Gaussian .out file to JSON* then *"app/gaussian >>> .out to JSON"* docker will be fetched >>> >>> >>> >>> *Parsers* >>> >>> There are two types of parsers (according to the suggested way) >>> >>> >>> >>> The first type is the parsers those will be directly coded into the >>> project code-base. For example, parsing Text file to a JSON will be >>> straightforward, then it is not necessarily required to maintain a separate >>> docker container to convert text file to JSON. With the help of a library >>> and putting an entry to the catalog will be enough to get the work done. >>> >>> >>> >>> The second type is parsers which have a separate docker container. For >>> example Gaussian .out file to JSON docker container >>> >>> >>> >>> For the overall scenario consider the following examples to get an idea >>> >>> >>> >>> *Example 1* >>> >>> Suppose a PDF should be parsed to XML >>> >>> Parser Manager will look the catalog and find the shortest path to get >>> the XML output from PDF. The available parsers are(both the coded parsers >>> in the project and the dockerized parsers), >>> >>> • PDF to Text >>> >>> • Text to JSON >>> >>> • JSON to XML >>> >>> • application/gaussian .out to JSON (This is a very specific parsing >>> mechanism not similar to parsing a simple .out file to a JSON) >>> >>> and the rest which I have included in the diagram >>> >>> >>> >>> Then Parser Manager will construct the graph and find the shortest path >>> as >>> >>> *PDF -> Text -> JSON -> XML* from the available parsers. >>> >>> >>> >>> <graph 2.png> >>> >>> Then Parser Manager will return 3 Parsers. From the three parsers a DAG >>> will be constructed as follows, >>> >>> >>> >>> <parser dag.png> >>> >>> The reason for this architectural decision to have three parsers than >>> doing in the single parser because if one of the parsers fails it would be >>> easy to identify which parser it is. >>> >>> >>> >>> *Example 2* >>> >>> Consider a separate example to parse a Gaussian *.out* file to *JSON* then >>> it is pretty straightforward. Same as the aforementioned example it will >>> construct a Parser which linking the dockerized *app/gaussian .out to >>> JSON* container. >>> >>> >>> >>> *Example 3* >>> >>> Problem is when it is needed to parse a Gaussian *.out* file to *XML*. >>> There are two options. >>> >>> >>> >>> *1st option* - If an application related parsing should happen there >>> must be application typed parsers to get the work done if not it is not >>> allowed. >>> >>> In the list of parsers, there is no application related parser to >>> convert *.out* file to *XML*. In this case even Parser Manager could >>> construct a path like, >>> >>> *.out/gaussian -> JSON/gaussian -> XML*, this process is not allowed. >>> >>> >>> >>> *2nd option* - Once the application-specific content has been parsed it >>> will be same as converting a normal JSON to XML assuming that we could >>> allow the path >>> >>> *.out/gaussian -> JSON/gaussian -> XML*. >>> >>> What actually should be done? 1st option or the 2nd option? This is one >>> point I need a suggestion. >>> >>> >>> >>> I would really appreciate any suggestions to improve this. >>> >>> >>> >>> [1] https://github.com/Lahiru-J/airavata-data-parser >>> >>> [2] >>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gaussian/gaussian.py#L191-L288 >>> >>> [3] >>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/gamess/gamess.py#L76-L175 >>> >>> >>> [4] >>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175 >>> >>> [5] >>> https://github.com/Lahiru-J/airavata-data-parser/blob/master/datacat/molpro/molpro.py#L76-L175 >>> >>> >>> >>> Cheers, >>> >>> >>> >>> On 28 May 2018 at 18:05, Lahiru Jayathilake <[email protected]> >>> wrote: >>> >>> Note this is the High-level architecture diagram. (Since it was not >>> visible in the previous email) >>> >>> >>> >>> <Screen Shot 2018-05-28 at 9.30.43 AM.png> >>> >>> Thanks, >>> >>> Lahiru >>> >>> >>> >>> On 28 May 2018 at 18:02, Lahiru Jayathilake <[email protected]> >>> wrote: >>> >>> Hi Everyone, >>> >>> >>> >>> During the past few days, I’ve been implementing the tasks which are >>> related to the Data Parsing. To give a heads up, the following image >>> depicts the top level architecture of the implementation. >>> >>> >>> >>> [image: mage removed by sender.] >>> >>> Following are the main task components have been identified, >>> >>> >>> >>> *1. DataParsing Task* >>> >>> This task will get the stored output and will find the matching Parser >>> (Gaussian, Lammps, QChem, etc.) and send the output through the selected >>> parser to get a well-structured JSON >>> >>> >>> >>> *2. Validating Task* >>> >>> This is to validate the desired JSON output is achieved or not. That is >>> JSON output should match with the respective schema(Gaussian Schema, Lammps >>> Schema, QChem Schema, etc.) >>> >>> >>> >>> *3. Persisting Task* >>> >>> This task will persist the validated JSON outputs >>> >>> >>> >>> The successfully stored outputs will be exposed to the outer world. >>> >>> >>> >>> >>> >>> According to the diagram the generated JSON should be shared between the >>> tasks(DataParsing, Validating, and, Persisting tasks). Neither DataParsing >>> task nor Validating task persists the JSON, therefore, helix task framework >>> should make sure to share the content between the tasks. >>> >>> >>> >>> In this Helix tutorial [1] it says how to share the content between >>> Helix tasks. The problem is, the method [2] which has been given only >>> capable of sharing String typed key-value data. >>> >>> However, I can come up with an implementation to share all the values >>> related to the JSON output. That involves calling this method [2] many >>> times. I believe that is not a very efficient method because Helix task >>> framework has to call this [3] method many times (taking into consideration >>> that the generated JSON output can be larger). >>> >>> >>> >>> I have already sent an email to the Helix mailing list to clarify >>> whether there is another way and also will it be efficient if this method >>> [2] is called multiple times to get the work done. >>> >>> >>> >>> Am I on the right track? Your suggestions would be very helpful and >>> please add if anything is missing. >>> >>> >>> >>> >>> >>> [1] >>> http://helix.apache.org/0.8.0-docs/tutorial_task_framework.html#Share_Content_Across_Tasks_and_Jobs >>> >>> [2] >>> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/UserContentStore.java#L75 >>> >>> [3] >>> https://github.com/apache/helix/blob/helix-0.6.x/helix-core/src/main/java/org/apache/helix/task/TaskUtil.java#L361 >>> >>> >>> >>> Thanks, >>> >>> Lahiru >>> >>> >>> >>> On 26 March 2018 at 19:44, Lahiru Jayathilake <[email protected]> >>> wrote: >>> >>> Hi Dimuthu, Suresh, >>> >>> >>> >>> Thanks a lot for the feedback. I will update the proposal accordingly. >>> >>> >>> >>> Regards, >>> >>> Lahiru >>> >>> >>> >>> On 26 March 2018 at 08:48, Suresh Marru <[email protected]> wrote: >>> >>> Hi Lahiru, >>> >>> >>> >>> I echo Dimuthu’s comment. You have a good starting point, it will be >>> nice if you can cover how users can interact with the parsed data. >>> Essentially adding API access to the parsed metadata database and having >>> proof of concept UI’s. This task could be challenging as the queries are >>> very data specific and generalizing API access and building custom UI’s can >>> be explanatory (less defined) portions of your proposal. >>> >>> >>> >>> Cheers, >>> >>> Suresh >>> >>> >>> >>> >>> >>> On Mar 25, 2018, at 8:12 PM, DImuthu Upeksha <[email protected]> >>> wrote: >>> >>> >>> >>> Hi Lahiru, >>> >>> >>> >>> Nice document. And I like how you illustrate the systems through >>> diagrams. However try to address how you are going to expose parsed data to >>> outside through thrift APIs and how to design those data APIs in >>> application specific manner. And in the persisting task, you have to make >>> sure data integrity is preserved. For example in a Gaussian parsed output, >>> you might have to validate the parsed output using a schema before >>> persisting them in the database. >>> >>> >>> >>> Thanks >>> >>> Dimuthu >>> >>> >>> >>> On Sun, Mar 25, 2018 at 5:05 PM, Lahiru Jayathilake < >>> [email protected]> wrote: >>> >>> Hi Everyone, >>> >>> >>> >>> I have shared a draft proposal [1] for the GSoC project, AIRAVATA-2718 >>> [2]. Any comments would be very helpful to improve it. >>> >>> >>> >>> [1] >>> https://docs.google.com/document/d/1xhgL1w9Yn_c1d5PpabxJJNNLTbkgggasMBM-GsBjVHM/edit?usp=sharing >>> >>> >>> [2] https://issues.apache.org/jira/browse/AIRAVATA-2718 >>> >>> >>> >>> Thanks & Regards, >>> >>> -- >>> >>> Lahiru Jayathilake >>> >>> Department of Computer Science and Engineering, >>> >>> Faculty of Engineering, >>> >>> University of Moratuwa >>> >>> >>> >>> [image: mage removed by sender.] >>> <https://lk.linkedin.com/in/lahirujayathilake> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> -- >>> >>> Lahiru Jayathilake >>> >>> Department of Computer Science and Engineering, >>> >>> Faculty of Engineering, >>> >>> University of Moratuwa >>> >>> >>> >>> [image: mage removed by sender.] >>> <https://lk.linkedin.com/in/lahirujayathilake> >>> >>> >>> >>> >>> >>> -- >>> >>> Lahiru Jayathilake >>> >>> Department of Computer Science and Engineering, >>> >>> Faculty of Engineering, >>> >>> University of Moratuwa >>> >>> >>> >>> [image: mage removed by sender.] >>> <https://lk.linkedin.com/in/lahirujayathilake> >>> >>> >>> >>> >>> >>> -- >>> >>> Lahiru Jayathilake >>> >>> Department of Computer Science and Engineering, >>> >>> Faculty of Engineering, >>> >>> University of Moratuwa >>> >>> >>> >>> [image: mage removed by sender.] >>> <https://lk.linkedin.com/in/lahirujayathilake> >>> >>> >>> >>> >>> >>> -- >>> >>> Lahiru Jayathilake >>> >>> Department of Computer Science and Engineering, >>> >>> Faculty of Engineering, >>> >>> University of Moratuwa >>> >>> >>> >>> [image: mage removed by sender.] >>> <https://lk.linkedin.com/in/lahirujayathilake> >>> >>> >>> >> >> > > > -- > Lahiru Jayathilake > Department of Computer Science and Engineering, > Faculty of Engineering, > University of Moratuwa > > <https://lk.linkedin.com/in/lahirujayathilake> >
