OMG, emacs lisp beats perl/python again! Hiya all, another little emacs lisp tutorial from the tiny Xah's Edu Corner.
〈Emacs Lisp: Processing HTML: Transform Tags to HTML5 “figure” and “figcaption” Tags〉 xahlee.org/emacs/elisp_batch_html5_tag_transform.html plain text version follows. ------------------------------------------ Emacs Lisp: Processing HTML: Transform Tags to HTML5 “figure” and “figcaption” Tags Xah Lee, 2011-07-03 Another triumph of using elisp for text processing over perl/python. ---------------------------- The Problem -------------- Summary I want batch transform the image tags in 5 thousand html files to use HTML5's new “figure” and “figcaption” tags. I want to be able to view each change interactively, while optionally give it a “go ahead” to do the whole job in batch. Interactive eye-ball verification on many cases lets me be reasonably sure the transform is done correctly. Yet i don't want to spend days to think/write/test a mathematically correct program that otherwise can be finished in 30 min with human interaction. -------------- Detail HTML5 has the following new tags: “figure” and “figcaption”. They are used like this: <figure> <img src="cat.jpg" alt="my cat" width="167" height="106"> <figcaption>my cat!</figcaption> </figure> (For detail, see: HTML5 “figure” & “figurecaption” Tags Browser Support) On my website, i used a similar structure. They look like this: <div class="img"> <img src="cat.jpg" alt="my cat" width="167" height="106"> <p class="cpt">my cat!</p> </div> So, i want to replace them with the HTML5's new tags. This can be done with a regex. Here's the “find” regex: <div class="img"> ?<img src="\([^.]+?\)\.jpg" alt="\([^"]+?\)" width="\([0-9]+?\)" height="\([0-9]+?\)">? <p class="cpt">\([^<]+?\)</p> ?</div> Here's the replacement string: <figure> <img src="\1.jpg" alt="\2" width="\3" height="\4"> <figcaption>\5</figcaption> </figure> Then, you can use “find-file” and dired's “dired-do-query-replace- regexp” to work on your 5 thousand pages. Nice. (See: Emacs: Interactively Find & Replace String Patterns on Multiple Files.) However, the problem here is more complicated. The image file may be jpg or png or gif. Also, there may be more than one image per group. Also, the caption part may also contain complicated html. Here's some examples: <div class="img"> <img src="cat1.jpg" alt="my cat" width="200" height="200"> <img src="cat2.jpg" alt="my cat" width="200" height="200"> <p class="cpt">my 2 cats</p> </div> <div class="img"> <img src="jamie_cat.jpg" alt="jamie's cat" width="167" height="106"> <p class="cpt">jamie's cat! Her blog is <a href="http://example.com/ jamie/">http://example.com/jamie/</a></p> </div> So, a solution by regex is out. ---------------------------- Solution The solution is pretty simple. Here's the major steps: Use “find-lisp-find-files” to traverse a dir. For each file, open it. Search for the string <div class="img"> Use “sgml-skip-tag-forward” to jump to its closing tag. Save the positions of these tag begin/end positions. Ask user if she wants to replace. If so, do it. (using “delete- region” and “insert”) Repeat. Here's the code: ;; -*- coding: utf-8 -*- ;; 2011-07-03 ;; replace image tags to use html5's “figure” and “figcaption” tags. ;; Example. This: ;; <div class="img">…</div> ;; should become this ;; <figure>…</figure> ;; do this for all files in a dir. ;; rough steps: ;; find the <div class="img"> ;; use sgml-skip-tag-forward to move to the ending tag. ;; save their positions. (defun my-process-file (fpath) "process the file at fullpath FPATH ..." (let (mybuff p1 p2 p3 p4 ) (setq mybuff (find-file fpath)) (widen) (goto-char 0) ;; in case buffer already open (while (search-forward "<div class=\"img\">" nil t) (progn (setq p2 (point) ) (backward-char 17) ; beginning of “div” tag (setq p1 (point) ) (forward-char 1) (sgml-skip-tag-forward 1) ; move to the closing tag (setq p4 (point) ) (backward-char 6) ; beginning of the closing div tag (setq p3 (point) ) (narrow-to-region p1 p4) (when (y-or-n-p "replace?") (progn (delete-region p3 p4 ) (goto-char p3) (insert "</figure>") (delete-region p1 p2 ) (goto-char p1) (insert "<figure>") (widen) ) ) ) ) (when (not (buffer-modified-p mybuff)) (kill-buffer mybuff) ) ) ) (require 'find-lisp) (let (outputBuffer) (setq outputBuffer "*xah img/figure replace output*" ) (with-output-to-temp-buffer outputBuffer (mapc 'my-process-file (find-lisp-find-files "~/web/xahlee_org/ emacs/" "\\.html$")) (princ "Done deal!") ) ) Seems pretty simple right? The “p1” and “p2” variables are the positions of start/end of <div class="img">. The “p3” and “p4” is the start/end of it's closing tag </ div>. We also used a little trick with “widen” and “narrow-to-region”. It lets me see just the part that i'm interested. It narrows to the beginning/end of the div.img. This makes eye-balling a bit easier. The real time-saver is the “sgml-skip-tag-forward” function from “html- mode”. Without that, one'd have to write a mini-parser to deal with html's nested ways to be able to locate the proper ending tag. Using the above code, i can comfortably eye-ball and press “y” at the rate of about 5 per second. That makes 300 replacements per minute. I have 5000+ files. If we presume there are 6k replacement to be made, then at 5 per second means 20 minutes sitting there pressing “y”. Quite tiresome. So, now, the next step is simply to remove the asking (y-or-n-p "replace?"). Or, if i'm absolutely paranoid, i can make emacs write into a log buffer for every replacement it makes (together with the file path). When the batch replacement is done (probably under 3 minutes), i can simply scan thru the log to see if any replacement went wrong. For how to do that, see: Emacs Lisp: Multi-Pair String Replacement with Report. But what about replacing <p class="cpt">…</p> with <figcaption>…</ figcaption>? I simply copy and pasted the above code into a new file, just made changes in 4 places. So, the replacing figcaption part is considered a separete batch job. Of course, one could spend extra hour or so to make the code do them both in one pass, but is that one extra hour of thinking & coding worthwhile for this one-time job? I ♥ Emacs, do you? --------------------------------- PS perl and python solution welcome. I haven't looked at perl or python's html parser libs for 5+ years. Though, 2 little requirement: 1. it must be correct, of course. Cannot tolerate the possiblility that maybe one out of a thousand replacement it introduced a mismatched tag. (but you can assume that all the input html files are w3c valid) 2. it must not change the formatting of the html pages. i.e. adding/ removing spaces or tabs. Xah -- http://mail.python.org/mailman/listinfo/python-list