Re: [O] Parsing Org-mode in Python
Hi Daniel, Daniel Clemente writes: > Are there already Python parsers for it? Parsing generic JSON is fairly trivial in Python. import json data = json.dumps(open('file.json').read()) The resulting "data" is then a bunch of Python lists and/or dicts matching whatever structure was output from org and is in the .json file. The schema in these three contexts are (will be) identical. At this point, Pythonistas can do what they want with "data". Although, as I mentioned, I'd like to put another layer on this "raw" data structure which expresses/enforces the org schema as understood by the org-exporter. If I can figure out how to dump a representation of this schema from org I'll express it as a set of generated collections.namedtuple instances. We'll see. > Should ox-json's output be as raw as possible (e.g. what your code > produces now) or transformed to simpler JSON? > (I think both formats should coexist). I suppose there may be a usefulness to "winnow down" the structure. One thing I'm thinking about here is the narrowing done to support the "blog From anywhere" feature of Karl's lazyblorg mentioned in this thread. That can be done either on the emacs side or Python side (or both, in principle). However, my intention is to do as little modification of the org document structure on the emacs-side in order to preserve details that may possibly be interesting on the Python-side in the future. Also, I'm still learning LISP but know Python fairly well so would rather do as much processing as possible on the Python side. :) So far the only thing I see that needs to be stripped is the :parent property (and the :structure, which really should be resolved as a copy instead of being stripped) which cause the emacs-side data structure to become a Circular Object and thus break the emacs JSON dumper. I just noticed that Python's JSON dumper can do this kind of stripping implicitly and in general. It might be nice if someone were to add such a feature to the emacs JSON dumper but I don't plan to try this. -Brett. pgp9M9SeqaAZM.pgp Description: PGP signature
Re: [O] Parsing Org-mode in Python
El Wed, 08 Jan 2014 10:42:17 -0500 Brett Viren va escriure: > > http://lists.gnu.org/archive/html/emacs-orgmode/2013-12/msg00415.html > > In any case, here is the salient chunk: > > #+BEGIN_SRC elisp > (require 'json) > (let* ((tree (org-element-parse-buffer 'object nil))) > (org-element-map tree (append org-element-all-elements > org-element-all-objects '(plain-text)) > (lambda (x) > (if (org-element-property :parent x) > (org-element-put-property x :parent "none")) > (if (org-element-property :structure x) > (org-element-put-property x :structure "none")) > )) > (write-region > (json-encode tree) > nil "foo.dat")) > #+END_SRC > I like this very much. This output is much easier to parse than the source .org file, and it's still using the original Elisp parser (so you don't need a Python parser). I hope ox-json.el gets into org-mode some day. Are there already Python parsers for it? Should ox-json's output be as raw as possible (e.g. what your code produces now) or transformed to simpler JSON? (I think both formats should coexist).
Re: [O] Parsing Org-mode in Python
2014/1/8 Brett Viren Huh, maybe a transient failure? It's there for me right now. Here is > the same message from GNU's archive: > > http://lists.gnu.org/archive/html/emacs-orgmode/2013-12/msg00415.html Got it, thanks! :-) -- François Pinard http://pinard.progiciels-bpi.ca
Re: [O] Parsing Org-mode in Python
François Pinard writes: > Brett Viren writes: > >> http://permalink.gmane.org/gmane.emacs.orgmode/79838 > > This yields: > > , > | Not Found > | > | The requested URL /gmane.emacs.orgmode/79838 was not found on this server. > ` Huh, maybe a transient failure? It's there for me right now. Here is the same message from GNU's archive: http://lists.gnu.org/archive/html/emacs-orgmode/2013-12/msg00415.html In any case, here is the salient chunk: #+BEGIN_SRC elisp (require 'json) (let* ((tree (org-element-parse-buffer 'object nil))) (org-element-map tree (append org-element-all-elements org-element-all-objects '(plain-text)) (lambda (x) (if (org-element-property :parent x) (org-element-put-property x :parent "none")) (if (org-element-property :structure x) (org-element-put-property x :structure "none")) )) (write-region (json-encode tree) nil "foo.dat")) #+END_SRC This test is meant to run from inside an org-mode buffer which itself provides the fodder for the test. But, it shows the steps that I'll need to integrate into some new org export mechanism. The important part is nulling out the :parent and :structure (and maybe others?) properties in order to break their circular references. The heavy lifting is all in org-element-parse-buffer and json-encode. >> At the end of the day one will have a DOM-style data structure >> representing the initial org document. > > Keep me (us!) posted! :-) Definitely! -Brett. pgpOODLoXxtb1.pgp Description: PGP signature
Re: [O] Parsing Org-mode in Python
Brett Viren writes: > I'm also (slowly) working toward some Python-based org processing. My > strategy is to produce an intermediate file in JSON format which is > designed to capture the full org document structure. I am calling > this a "shunt" export as it is meant to do as little interpretation of > the document as possible. Might be interesting, indeed! > http://permalink.gmane.org/gmane.emacs.orgmode/79838 This yields: , | Not Found | | The requested URL /gmane.emacs.orgmode/79838 was not found on this server. ` > At the end of the day one will have a DOM-style data structure > representing the initial org document. Keep me (us!) posted! :-) François
Re: [O] Parsing Org-mode in Python
Hi Karl, Karl Voit writes: > Hi! > > * Daniel Clemente wrote: >>> >>> I dream of having a general Python parser for Org mode files, knowing >>> every bit about the current syntax for Org files, surrounded by enough >>> Python machinery to make it useful. > > Oh, this would be great since there are way more Python-coders out > there as ELISP coders. I agree. I'm also (slowly) working toward some Python-based org processing. My strategy is to produce an intermediate file in JSON format which is designed to capture the full org document structure. I am calling this a "shunt" export as it is meant to do as little interpretation of the document as possible. If this is interesting to you and you haven't already seen it please check the thread from December were I got a lot of help to output this JSON via the new org export mechanism (I'm a LISP newbie). Here is the concluding post with a working example: http://permalink.gmane.org/gmane.emacs.orgmode/79838 Besides any eventual Python-side development, one remaining gap in my plan is how to produce some kind of schema description using the org exporter machinery. I want to have this description generated automatically so that any future changes to the org format can be accommodated with some level of automation. So, my current thinking is to find a way to exploit org export machinery to generate this schema (call it a "meta-shunt" export?). If I can find that I'll output it as another JSON file. Then, on the Python-side, I will read this schema file in and generate instances of collections.namedtuple. Finally a reader of the JSON org document will be developed to produce objects of these namedtuple classes. At the end of the day one will have a DOM-style data structure representing the initial org document. -Brett. pgpRE1ypSZwl8.pgp Description: PGP signature
Re: [O] Parsing Org-mode in Python
Karl Voit writes: > I did not get the impression that [ply] is a parsing engine that is > done the Python way. PLY has pros and cons. SPARK[1] always attracted me as being more elegant. While it accepts a wider set of grammars than PLY, SPARK can become quite slow on grammars which are less "natural" (admittedly a very fuzzy, subjective term). For simpler grammars, recursive descent does the job at good enough speed, and often, grammars can be rearranged a bit so the lexer could cleverly help the parser. Of course, it looks like more work writing a recursive descent parser, yet many times in my experience, the programmer is amply repaid with simplicity and clarity. >> You don't need a Lisp interpreter written in Python, only Python >> code that understands org syntax without getting confused. > if you are going to use a ELISP interpreter to parse Org-mode syntax > for Python, this should completely re-use the original Org-parser and > nothing else. I have no idea if this is possible or not. If you have > to implement a parser on your own, you probably should stick to > Python-only. Hey hey, it's fun! :-) You misunderstood me, but this is constructive actually, as you raise good points. In my dreams, a pure Python parser parses Org mode files. However, here and there in the parsed files, as data, we can see bits of Emacs Lisp code, or even Calc syntax at some places. That Emacs Lisp code could be mere constants or identifiers, but sometimes more complex, evalable S-expressions. A parser is probably of limited use if it does not come with some extra-tools covering most frequent use cases around the syntax, and I guess that pressure will develop to have some kind of Emacs Lisp interpreter, hardly complete, probably only mild or even ridiculous. The interesting idea in your comments is that, *if* we had an Emacs Lisp interpreter of serious quality, that interpreter could use "the original Org-parser and nothing else". That would solve maintenance, as the parser would be wholly external, to be found in Org mode distribution, all standard. But this avenue is quite unlikely: it looks like a major undertaking to me, and while such a parser would be useful on small data excerpts within an Org file, it might be inordinately slow if it had to interpret a lot of Lisp code while deciphering big Org files. Worse, keeping a Python parser in sync with the true Emacs Lisp parser would require much energy, maybe only once in a while, but extended over a long period of time. Unless a great enthusiasm exists, distributed on many people, such projects are always doomed to fail. Not many people are ready to commit themselves for life in the required maintenance. François --- [1] http://pages.cpsc.ucalgary.ca/~aycock/spark/
[O] Parsing Org-mode in Python (was: Implementing Org-mode tools in languages other than ELISP)
Hi! * Daniel Clemente wrote: >> >> I dream of having a general Python parser for Org mode files, knowing >> every bit about the current syntax for Org files, surrounded by enough >> Python machinery to make it useful. Oh, this would be great since there are way more Python-coders out there as ELISP coders. > Try PyOrgMode (https://github.com/bjonnh/PyOrgMode), it works for > some files (but still needs corrections: it crashes with date > formats, with bold markers, etc.). For my blogging system I am implementing [4] I was doing some research on current Org-parsers in Python. My notes about PyOrgMode (2013-05) were that there is not much of a documentation to use it properly and that the list of open todos contains rather basic things to consider it elaborated enough. So far, I consider my own Python parser[1] as the most advanced Python parser so far (unfortunately). However, I am completely aware of its downsides: - it's a very primitive line-by-line parser and not using any classical parsing tool at all (works for me so far!) - it's currently limited to a few Org-mode elements so that I can continue to develop my blogging system - more Org-mode elements (not all!) will be added when my blogging system gets stable enough to add Org-mode syntax features such as tables. - it's not written with the premise to be a stand-alone Org-mode parser since I only need it for my blogging system - feel free to use it and modify it to be a stand-alone parser I do think that for a more general approach, somebody should develop an Org-mode Python parser with classical parsing engines. I do have some experience with ply[2]. Unfortunately, I have to say that using ply feels a bit awkward in Python. I did not get the impression that this is a parsing engine that is done the Python way. A lot of things are done by convention (naming stuff, and so on) which has certain limitations in details. And AFAIR there were more things that puzzled me. However, it got my (simple) job [3] done back then. > You don't need a Lisp interpreter written in Python, only Python > code that understands org syntax without getting confused. I am no expert in this. I do feel that if you are going to use a ELISP interpreter to parse Org-mode syntax for Python, this should completely re-use the original Org-parser and nothing else. I have no idea if this is possible or not. If you have to implement a parser on your own, you probably should stick to Python-only. In order to avoid confusion, your own Python parser implements only a very well defined and documented sub-set of Org-mode syntax and should accept/parse everything else als ordinary text (content). IMHO. HTH. 1. https://github.com/novoid/lazyblorg/blob/master/lib/orgparser.py 2. http://www.dabeaz.com/ply/ 3. https://github.com/novoid/2011-04-tagstore-formal-experiment/tree/master/analysis_and_derived_data/scripts 4. https://github.com/novoid/lazyblorg -- mail|git|SVN|photos|postings|SMS|phonecalls|RSS|CSV|XML to Org-mode: > get Memacs from https://github.com/novoid/Memacs < https://github.com/novoid/extract_pdf_annotations_to_orgmode + more on github