Re: eLyXer for Document Parsing

slitt Sat, 04 Feb 2012 16:16:50 -0800

On Sat, 4 Feb 2012 14:00:24 -0700
Rob Oakes <[email protected]> wrote:


> Hi Steve,
[clip]
> > One more question: You sure you want to go in-memory? What happens
> > if a guy has a 1200 page book with 100 chapters each containing 10
> > sections, each containing 10 subsections, and tries to parse it on
> > a machine with 512 MB RAM? 
> 
> I pity this poor man's decision to convert the whole mess to Word,
> rather than splitting it out into individual chapters.
> 
> But, I appreciate the voice for reason answer sanity and best
> practice. Short answer, no, not convinced that I want to go in
> memory. My first pass was to just to become comfortable with eLyXer
> to see if it might meet my needs. I'm still try to get comfortable
> with the structure of LyX documents and .docx documents. I've found a
> nice little python library with support for basic docx features and
> was going to try and refine that to something slightly more usable.
> 
> > You in a heap of trouble son. He'll be swapped half way into the
> > next century. If instead you used an event parser (e.g SAX) with a
> > few stacks, it will probably be slower, and it will be much more
> > hard to write, but for practical purposes there won't be an upper
> > limit on input file size.
> 
> Good points. The python library makes use of lxml, which supports
> sax. After I've got a better handle on my constraints, I'll spend the
> time required to design something more robust. 

On my lyx2kindle program
(http://www.troubleshooters.com/projects/lyx2kindle/) I used Python's
HTMLParser XML event parser tool. It was easy, though I think your lxml
idea is faster with big documents. For my 11K word book "Rules of the
Happiness Highway", conversion was maybe a second. Anyway, my
lyx2kindle.py illustrates use of HTMLParser, illustrates the use of a
stack to keep levels and maintain a poor man's state machine, and also
another part of it implements the kludge of the century.

SteveT

Re: eLyXer for Document Parsing

Reply via email to