Just reviewed old posts and found this one about RegEx parsing HTML
and cr (return) characters. I wanted to pass along a little trick I
found useful to extract tables into tab delimited format.
Premise: An HTML document is formatted with spaces and cr's for the
benefit of the programmer. Basically the browser app ignores this
white space in such a way that an HTML table will display correctly
even with extra characters, such as multiple cr's, are sprinkled
about.
The (?s) is good for searching past cr's, but it can make a
difference if you wish to end up with a single cr defining a table
row, rather than 2 or 3 or 4 cr's. Also, this cr specific.
One of my first steps is to replace cr with string "". This
makes the entire block of HTML text a single line and no need for
(?s), It also makes spurious cr's easily identifiable by subsequent
search commands, not to mention easily visible when checking your
results. BBEdit in softwrap mode allows you to see all of the text
even without the returns.
Of course, you could simply replace cr with "" in htmlTextBlock and
there is no need for (?s) either. The browser will display the same
page, with or without the returns present.
As I mine data from HTML I find it useful to re-establish cr's at
specific points, thus the replacement allows me to reinsert cr's
where desired and use loops that "repeat for the number of lines" for
patterned data blocks.
Further MMM[M]+ will locate all or longer, no matter how
many cr's were in a row,
using
get matchChunk(temp,"(MMM[M]+)", startChar, endChar) ==> 4 to howevermany
-thus--
put fld htmlTextToParse into temp
put "z" into startChar
repeat until startChar = ""
-- note: the startChar and endChar vars do not have to be
defined before matchChunk
get matchChunk(temp,"(MMM[M]+)", startChar, endChar)
-->you have to use parens in the regex string
put return into char startChar to endChar of temp
put return & startChar & "," & endChar after temp -->for demo
purposes only
end repeat
put return & "startChar, endChar list " after temp -->for demo
purposes only
put temp --> view the replacement, and the char list at the bottom
---will convert a run of cr's to a single cr.
Nested tables can be problematic, but I find that this technique
allows me to establish my true output cr's in the cacophony of HTML
source code formatting.
Hope this helps those who need to learn a bit more about the power of
RegEx and Rev
Jim Ault
Las Vegas
On 11/20/04 8:20 PM, "Sivakatirswami" <[EMAIL PROTECTED]> wrote:
I am using Rev to repurpose old html to new CSS compliant mark up. The
old pages are incredibly inconsistent. Fortunately grep is our
> friend.. I need a grep expression that will pass out the content from
>
> both #1:
some title
and #2
some title
>
> where the first instance has no line break but the second one does
Use the "(?s)" directive:
on mouseUp
local tTitle
put "some title"&cr&"" into tXML
get matchText(tXML,"(?s)title>(.*?)
Note that you'll get the trailing CR after "some title" as well, so you'd
have to strip that out if you want to.
Check the docs at http://www.prce.org/man.txt - the "?s" directive
corresponts to PCRE_DOTALL, which causes the "." character to match all
characters, including newlines (CRs).
HTH,
Ken Ray
Sons of Thunder Software
Web site: http://www.sonsothunder.com/
Email: [EMAIL PROTECTED]
___
use-revolution mailing list
use-revolution@lists.runrev.com
http://lists.runrev.com/mailman/listinfo/use-revolution
___
use-revolution mailing list
use-revolution@lists.runrev.com
http://lists.runrev.com/mailman/listinfo/use-revolution