Just reviewed old posts and found this one about RegEx parsing HTML and cr (return) characters. I wanted to pass along a little trick I found useful to extract tables into tab delimited format.

Premise: An HTML document is formatted with spaces and cr's for the benefit of the programmer. Basically the browser app ignores this white space in such a way that an HTML table will display correctly even with extra characters, such as multiple cr's, are sprinkled about.

The (?s) is good for searching past cr's, but it can make a difference if you wish to end up with a single cr defining a table row, rather than 2 or 3 or 4 cr's. Also, this cr specific.

One of my first steps is to replace cr with string "MMMM". This makes the entire block of HTML text a single line and no need for (?s), It also makes spurious cr's easily identifiable by subsequent search commands, not to mention easily visible when checking your results. BBEdit in softwrap mode allows you to see all of the text even without the returns.

Of course, you could simply replace cr with "" in htmlTextBlock and there is no need for (?s) either. The browser will display the same page, with or without the returns present.

As I mine data from HTML I find it useful to re-establish cr's at specific points, thus the MMMM replacement allows me to reinsert cr's where desired and use loops that "repeat for the number of lines" for patterned data blocks.


Further MMM[M]+ will locate all MMMM or longer, no matter how many cr's were in a row,
using
get matchChunk(temp,"(MMM[M]+)", startChar, endChar) ==> 4 to howevermany


-------------thus------
   put fld htmlTextToParse into temp

put "z" into startChar
repeat until startChar = ""
-- note: the startChar and endChar vars do not have to be defined before matchChunk
get matchChunk(temp,"(MMM[M]+)", startChar, endChar)
-->you have to use parens in the regex string
put return into char startChar to endChar of temp
put return & startChar & "," & endChar after temp -->for demo purposes only
end repeat
put return & "startChar, endChar list " after temp -->for demo purposes only
put temp --> view the replacement, and the char list at the bottom


 -------------------will convert a run of cr's to a single cr.

Nested tables can be problematic, but I find that this technique allows me to establish my true output cr's in the cacophony of HTML source code formatting.

Hope this helps those who need to learn a bit more about the power of RegEx and Rev

Jim Ault
Las Vegas


On 11/20/04 8:20 PM, "Sivakatirswami" <[EMAIL PROTECTED]> wrote:

 I am using Rev to repurpose old html to new CSS compliant mark up. The
 old pages are incredibly inconsistent.  Fortunately grep is our
 > friend.. I need a grep expression that will pass out the content from
 >
 > both #1:

<title> some title </title>

 and #2

 <title> some title
 </title>
 >
 > where the first instance has no line break but the second one does

Use the "(?s)" directive:

on mouseUp
  local tTitle
  put "<title>some title"&cr&"</title>" into tXML
  get matchText(tXML,"(?s)title>(.*?)</title",tTItle)
  put tTitle
end mouseUp

Note that you'll get the trailing CR after "some title" as well, so you'd
have to strip that out if you want to.

Check the docs at http://www.prce.org/man.txt - the "?s" directive
corresponts to PCRE_DOTALL, which causes the "." character to match all
characters, including newlines (CRs).

HTH,

Ken Ray
Sons of Thunder Software
Web site: http://www.sonsothunder.com/
Email: [EMAIL PROTECTED]


_______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com http://lists.runrev.com/mailman/listinfo/use-revolution

_______________________________________________ use-revolution mailing list use-revolution@lists.runrev.com http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to