Re: [O] Parsing Org-mode in Python

2014-01-09 Thread Brett Viren
Hi Daniel,

Daniel Clemente n142...@gmail.com writes:

   Are there already Python parsers for it?

Parsing generic JSON is fairly trivial in Python.

  import json
  data = json.dumps(open('file.json').read())

The resulting data is then a bunch of Python lists and/or dicts
matching whatever structure was output from org and is in the .json
file.  The schema in these three contexts are (will be) identical.

At this point, Pythonistas can do what they want with data.  Although,
as I mentioned, I'd like to put another layer on this raw data
structure which expresses/enforces the org schema as understood by the
org-exporter.  If I can figure out how to dump a representation of this
schema from org I'll express it as a set of generated
collections.namedtuple instances.  We'll see.

   Should ox-json's output be as raw as possible (e.g. what your code
 produces now) or transformed to simpler JSON?
   (I think both formats should coexist).

I suppose there may be a usefulness to winnow down the structure.  One
thing I'm thinking about here is the narrowing done to support the blog
From anywhere feature of Karl's lazyblorg mentioned in this thread.

That can be done either on the emacs side or Python side (or both, in
principle).  However, my intention is to do as little modification of
the org document structure on the emacs-side in order to preserve
details that may possibly be interesting on the Python-side in the
future.  Also, I'm still learning LISP but know Python fairly well so
would rather do as much processing as possible on the Python side. :)

So far the only thing I see that needs to be stripped is the :parent
property (and the :structure, which really should be resolved as a copy
instead of being stripped) which cause the emacs-side data structure to
become a Circular Object and thus break the emacs JSON dumper.  

I just noticed that Python's JSON dumper can do this kind of stripping
implicitly and in general.  It might be nice if someone were to add such
a feature to the emacs JSON dumper but I don't plan to try this.

-Brett.



pgp9M9SeqaAZM.pgp
Description: PGP signature


Re: [O] Parsing Org-mode in Python

2014-01-08 Thread François Pinard
Brett Viren b...@bnl.gov writes:

 I'm also (slowly) working toward some Python-based org processing.  My
 strategy is to produce an intermediate file in JSON format which is
 designed to capture the full org document structure.  I am calling
 this a shunt export as it is meant to do as little interpretation of
 the document as possible.

Might be interesting, indeed!

   http://permalink.gmane.org/gmane.emacs.orgmode/79838

This yields:

,
| Not Found
| 
| The requested URL /gmane.emacs.orgmode/79838 was not found on this server.
`

 At the end of the day one will have a DOM-style data structure
 representing the initial org document.

Keep me (us!) posted! :-)

François



Re: [O] Parsing Org-mode in Python

2014-01-08 Thread Brett Viren
François Pinard pin...@iro.umontreal.ca writes:

 Brett Viren b...@bnl.gov writes:

   http://permalink.gmane.org/gmane.emacs.orgmode/79838

 This yields:

 ,
 | Not Found
 | 
 | The requested URL /gmane.emacs.orgmode/79838 was not found on this server.
 `

Huh, maybe a transient failure?  It's there for me right now.  Here is
the same message from GNU's archive:

  http://lists.gnu.org/archive/html/emacs-orgmode/2013-12/msg00415.html

In any case, here is the salient chunk:

#+BEGIN_SRC elisp
  (require 'json)
  (let* ((tree (org-element-parse-buffer 'object nil)))
(org-element-map tree (append org-element-all-elements
org-element-all-objects '(plain-text))
  (lambda (x) 
(if (org-element-property :parent x)
(org-element-put-property x :parent none))
(if (org-element-property :structure x)
(org-element-put-property x :structure none))
))
(write-region
 (json-encode tree) 
  nil foo.dat))
#+END_SRC

This test is meant to run from inside an org-mode buffer which itself
provides the fodder for the test.  But, it shows the steps that I'll
need to integrate into some new org export mechanism.  The important
part is nulling out the :parent and :structure (and maybe others?)
properties in order to break their circular references.  The heavy
lifting is all in org-element-parse-buffer and json-encode.

 At the end of the day one will have a DOM-style data structure
 representing the initial org document.

 Keep me (us!) posted! :-)

Definitely!  
-Brett.


pgpOODLoXxtb1.pgp
Description: PGP signature


Re: [O] Parsing Org-mode in Python

2014-01-08 Thread François Pinard
2014/1/8 Brett Viren b...@bnl.gov

Huh, maybe a transient failure?  It's there for me right now.  Here is
 the same message from GNU's archive:

   http://lists.gnu.org/archive/html/emacs-orgmode/2013-12/msg00415.html


Got it, thanks! :-)

-- 
François Pinard http://pinard.progiciels-bpi.ca


Re: [O] Parsing Org-mode in Python

2014-01-08 Thread Daniel Clemente
El Wed, 08 Jan 2014 10:42:17 -0500 Brett Viren va escriure:
 
   http://lists.gnu.org/archive/html/emacs-orgmode/2013-12/msg00415.html
 
 In any case, here is the salient chunk:
 
 #+BEGIN_SRC elisp
   (require 'json)
   (let* ((tree (org-element-parse-buffer 'object nil)))
 (org-element-map tree (append org-element-all-elements
 org-element-all-objects '(plain-text))
   (lambda (x) 
 (if (org-element-property :parent x)
 (org-element-put-property x :parent none))
 (if (org-element-property :structure x)
 (org-element-put-property x :structure none))
 ))
 (write-region
  (json-encode tree) 
   nil foo.dat))
 #+END_SRC
 

  I like this very much. This output is much easier to parse than the source 
.org file, and it's still using the original Elisp parser (so you don't need a 
Python parser).
  I hope ox-json.el gets into org-mode some day.

  Are there already Python parsers for it?
  Should ox-json's output be as raw as possible (e.g. what your code produces 
now) or transformed to simpler JSON?
  (I think both formats should coexist).
  



Re: [O] Parsing Org-mode in Python

2014-01-07 Thread Brett Viren
Hi Karl,

Karl Voit devn...@karl-voit.at writes:

 Hi!

 * Daniel Clemente n142...@gmail.com wrote:
 
 I dream of having a general Python parser for Org mode files, knowing
 every bit about the current syntax for Org files, surrounded by enough
 Python machinery to make it useful.

 Oh, this would be great since there are way more Python-coders out
 there as ELISP coders.

I agree.

I'm also (slowly) working toward some Python-based org processing.  My
strategy is to produce an intermediate file in JSON format which is
designed to capture the full org document structure.  I am calling this
a shunt export as it is meant to do as little interpretation of the
document as possible.

If this is interesting to you and you haven't already seen it please
check the thread from December were I got a lot of help to output this
JSON via the new org export mechanism (I'm a LISP newbie).  Here is the
concluding post with a working example:

  http://permalink.gmane.org/gmane.emacs.orgmode/79838

Besides any eventual Python-side development, one remaining gap in my
plan is how to produce some kind of schema description using the org
exporter machinery.  I want to have this description generated
automatically so that any future changes to the org format can be
accommodated with some level of automation.

So, my current thinking is to find a way to exploit org export machinery
to generate this schema (call it a meta-shunt export?).  If I can find
that I'll output it as another JSON file.  Then, on the Python-side, I
will read this schema file in and generate instances of
collections.namedtuple.  Finally a reader of the JSON org document will
be developed to produce objects of these namedtuple classes.

At the end of the day one will have a DOM-style data structure
representing the initial org document.

-Brett.


pgpRE1ypSZwl8.pgp
Description: PGP signature


[O] Parsing Org-mode in Python (was: Implementing Org-mode tools in languages other than ELISP)

2014-01-06 Thread Karl Voit
Hi!

* Daniel Clemente n142...@gmail.com wrote:
 
 I dream of having a general Python parser for Org mode files, knowing
 every bit about the current syntax for Org files, surrounded by enough
 Python machinery to make it useful.

Oh, this would be great since there are way more Python-coders out
there as ELISP coders.

 Try PyOrgMode (https://github.com/bjonnh/PyOrgMode), it works for
 some files (but still needs corrections: it crashes with date
 formats, with bold markers, etc.).

For my blogging system I am implementing [4] I was doing some
research on current Org-parsers in Python.

My notes about PyOrgMode (2013-05) were that there is not much of a
documentation to use it properly and that the list of open todos
contains rather basic things to consider it elaborated enough.

So far, I consider my own Python parser[1] as the most advanced
Python parser so far (unfortunately). However, I am completely aware
of its downsides:

- it's a very primitive line-by-line parser and not using any classical
  parsing tool at all (works for me so far!)
- it's currently limited to a few Org-mode elements so that I can
  continue to develop my blogging system
  - more Org-mode elements (not all!) will be added when my blogging
system gets stable enough to add Org-mode syntax features such
as tables.
- it's not written with the premise to be a stand-alone Org-mode
  parser since I only need it for my blogging system
  - feel free to use it and modify it to be a stand-alone parser

I do think that for a more general approach, somebody should develop
an Org-mode Python parser with classical parsing engines. I do have
some experience with ply[2]. Unfortunately, I have to say that using
ply feels a bit awkward in Python. I did not get the impression that
this is a parsing engine that is done the Python way. A lot of
things are done by convention (naming stuff, and so on) which has
certain limitations in details. And AFAIR there were more things that
puzzled me. However, it got my (simple) job [3] done back then.

 You don't need a Lisp interpreter written in Python, only Python
 code that understands org syntax without getting confused.

I am no expert in this. I do feel that if you are going to use a
ELISP interpreter to parse Org-mode syntax for Python, this should
completely re-use the original Org-parser and nothing else. I have
no idea if this is possible or not.

If you have to implement a parser on your own, you probably should
stick to Python-only.

In order to avoid confusion, your own Python parser implements only
a very well defined and documented sub-set of Org-mode syntax and
should accept/parse everything else als ordinary text (content).
IMHO.

HTH.

  1. https://github.com/novoid/lazyblorg/blob/master/lib/orgparser.py
  2. http://www.dabeaz.com/ply/
  3. 
https://github.com/novoid/2011-04-tagstore-formal-experiment/tree/master/analysis_and_derived_data/scripts
  4. https://github.com/novoid/lazyblorg
-- 
mail|git|SVN|photos|postings|SMS|phonecalls|RSS|CSV|XML to Org-mode:
get Memacs from https://github.com/novoid/Memacs 

https://github.com/novoid/extract_pdf_annotations_to_orgmode + more on github




Re: [O] Parsing Org-mode in Python

2014-01-06 Thread François Pinard
Karl Voit devn...@karl-voit.at writes:

 I did not get the impression that [ply] is a parsing engine that is
 done the Python way.

PLY has pros and cons.  SPARK[1] always attracted me as being more
elegant.  While it accepts a wider set of grammars than PLY, SPARK can
become quite slow on grammars which are less natural (admittedly a
very fuzzy, subjective term).  For simpler grammars, recursive descent
does the job at good enough speed, and often, grammars can be rearranged
a bit so the lexer could cleverly help the parser.  Of course, it looks
like more work writing a recursive descent parser, yet many times in my
experience, the programmer is amply repaid with simplicity and clarity.

 You don't need a Lisp interpreter written in Python, only Python
 code that understands org syntax without getting confused.

 if you are going to use a ELISP interpreter to parse Org-mode syntax
 for Python, this should completely re-use the original Org-parser and
 nothing else.  I have no idea if this is possible or not.  If you have
 to implement a parser on your own, you probably should stick to
 Python-only.

Hey hey, it's fun! :-) You misunderstood me, but this is constructive
actually, as you raise good points.  In my dreams, a pure Python parser
parses Org mode files.  However, here and there in the parsed files, as
data, we can see bits of Emacs Lisp code, or even Calc syntax at some
places.  That Emacs Lisp code could be mere constants or identifiers,
but sometimes more complex, evalable S-expressions.

A parser is probably of limited use if it does not come with some
extra-tools covering most frequent use cases around the syntax, and I
guess that pressure will develop to have some kind of Emacs Lisp
interpreter, hardly complete, probably only mild or even ridiculous.

The interesting idea in your comments is that, *if* we had an Emacs Lisp
interpreter of serious quality, that interpreter could use the original
Org-parser and nothing else.  That would solve maintenance, as the
parser would be wholly external, to be found in Org mode distribution,
all standard.  But this avenue is quite unlikely: it looks like a major
undertaking to me, and while such a parser would be useful on small data
excerpts within an Org file, it might be inordinately slow if it had to
interpret a lot of Lisp code while deciphering big Org files.

Worse, keeping a Python parser in sync with the true Emacs Lisp parser
would require much energy, maybe only once in a while, but extended over
a long period of time.  Unless a great enthusiasm exists, distributed on
many people, such projects are always doomed to fail.  Not many people
are ready to commit themselves for life in the required maintenance.

François

---
[1] http://pages.cpsc.ucalgary.ca/~aycock/spark/