On Sat, 4 Feb 2012 14:00:24 -0700 Rob Oakes <lyx-de...@oak-tree.us> wrote:
> Hi Steve, [clip] > > One more question: You sure you want to go in-memory? What happens > > if a guy has a 1200 page book with 100 chapters each containing 10 > > sections, each containing 10 subsections, and tries to parse it on > > a machine with 512 MB RAM? > > I pity this poor man's decision to convert the whole mess to Word, > rather than splitting it out into individual chapters. > > But, I appreciate the voice for reason answer sanity and best > practice. Short answer, no, not convinced that I want to go in > memory. My first pass was to just to become comfortable with eLyXer > to see if it might meet my needs. I'm still try to get comfortable > with the structure of LyX documents and .docx documents. I've found a > nice little python library with support for basic docx features and > was going to try and refine that to something slightly more usable. > > > You in a heap of trouble son. He'll be swapped half way into the > > next century. If instead you used an event parser (e.g SAX) with a > > few stacks, it will probably be slower, and it will be much more > > hard to write, but for practical purposes there won't be an upper > > limit on input file size. > > Good points. The python library makes use of lxml, which supports > sax. After I've got a better handle on my constraints, I'll spend the > time required to design something more robust. On my lyx2kindle program (http://www.troubleshooters.com/projects/lyx2kindle/) I used Python's HTMLParser XML event parser tool. It was easy, though I think your lxml idea is faster with big documents. For my 11K word book "Rules of the Happiness Highway", conversion was maybe a second. Anyway, my lyx2kindle.py illustrates use of HTMLParser, illustrates the use of a stack to keep levels and maintain a poor man's state machine, and also another part of it implements the kludge of the century. SteveT