Re: RegEx Help--Across Lines

2005-05-04 Thread Jim Ault
Just reviewed old posts and found this one about RegEx parsing HTML 
and cr (return) characters.  I wanted to pass along a little trick I 
found useful to extract tables into tab delimited format.

Premise:  An HTML document is formatted with spaces and cr's for the 
benefit of the programmer.  Basically the browser app ignores this 
white space in such a way that an HTML table will display correctly 
even with extra characters, such as multiple cr's, are sprinkled 
about.

The (?s) is good for searching past cr's, but it can make a 
difference if you wish to end up with a single cr defining a table 
row, rather than 2 or 3 or 4 cr's.  Also, this cr specific.

One of my first steps is to replace cr with string "".  This 
makes the entire block of HTML text a single line and no need for 
(?s),  It also makes spurious cr's easily identifiable by subsequent 
search commands, not to mention easily visible when checking your 
results. BBEdit in softwrap mode allows you to see all of the text 
even without the returns.

Of course, you could simply replace cr with "" in htmlTextBlock and 
there is no need for (?s) either.  The browser will display the same 
page, with or without the returns present.

As I mine data from HTML I find it useful to re-establish cr's at 
specific points, thus the  replacement allows me to reinsert cr's 
where desired and use loops that "repeat for the number of lines" for 
patterned data blocks.

Further   MMM[M]+   will locate all  or longer, no matter how 
many cr's were in a row,
using
get matchChunk(temp,"(MMM[M]+)", startChar, endChar)  ==> 4 to howevermany

-thus--
   put fld htmlTextToParse into temp
   put "z" into startChar
   repeat until startChar = ""
  -- note: the startChar and endChar vars do not have to be 
defined before matchChunk
  get matchChunk(temp,"(MMM[M]+)", startChar, endChar)
  -->you have to use parens in the regex string
  put return into char startChar to endChar of temp
  put return & startChar & "," & endChar after temp  -->for demo 
purposes only
   end repeat
   put return & "startChar, endChar list " after temp  -->for demo 
purposes only
   put temp  --> view the replacement, and the char list at the bottom

 ---will convert a run of cr's to a single cr.
Nested tables can be problematic, but I find that this technique 
allows me to establish my true output cr's in the cacophony of HTML 
source code formatting.

Hope this helps those who need to learn a bit more about the power of 
RegEx and Rev

Jim Ault
Las Vegas

On 11/20/04 8:20 PM, "Sivakatirswami" <[EMAIL PROTECTED]> wrote:
 I am using Rev to repurpose old html to new CSS compliant mark up. The
 old pages are incredibly inconsistent.  Fortunately grep is our
 > friend.. I need a grep expression that will pass out the content from
 >
 > both #1:
  some title 
 and #2
  some title
 
 >
 > where the first instance has no line break but the second one does
Use the "(?s)" directive:
on mouseUp
  local tTitle
  put "some title"&cr&"" into tXML
  get matchText(tXML,"(?s)title>(.*?)
Note that you'll get the trailing CR after "some title" as well, so you'd
have to strip that out if you want to.
Check the docs at http://www.prce.org/man.txt - the "?s" directive
corresponts to PCRE_DOTALL, which causes the "." character to match all
characters, including newlines (CRs).
HTH,
Ken Ray
Sons of Thunder Software
Web site: http://www.sonsothunder.com/
Email: [EMAIL PROTECTED]
___
use-revolution mailing list
use-revolution@lists.runrev.com
http://lists.runrev.com/mailman/listinfo/use-revolution
___
use-revolution mailing list
use-revolution@lists.runrev.com
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: RegEx Help--Across Lines

2004-11-21 Thread Mark Greenberg
There is a wonderful book called Mastering Regular Expressions by 
Jeffery E. F. Friedl that covers Regex in detail with an emphasis on 
the Perl standard.  I found it very helpful with the Regex 
implementation that we use in Transcript.

Still, there are some quirks to get used to.  For example, Regex terms 
in Rev are strings, which differs from other implementations.  Also, 
figuring what parts of the Regex will assign substrings to variables 
was, at least for me, an exercise in trial and error.

Between the book and experimenting the Message Box I've been able to do 
just about everything I attempted with Regex in Rev.

Mark Greenberg
___
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution


Re: RegEx Help--Across Lines

2004-11-20 Thread Ken Ray
On 11/20/04 8:20 PM, "Sivakatirswami" <[EMAIL PROTECTED]> wrote:

> I am using Rev to repurpose old html to new CSS compliant mark up. The
> old pages are incredibly inconsistent.  Fortunately grep is our
> friend.. I need a grep expression that will pass out the content from
> 
> both #1:
> 
>  some title 
> 
> and #2
> 
>  some title
> 
> 
> where the first instance has no line break but the second one does

Use the "(?s)" directive:

on mouseUp
  local tTitle
  put "some title"&cr&"" into tXML
  get matchText(tXML,"(?s)title>(.*?)http://www.prce.org/man.txt - the "?s" directive
corresponts to PCRE_DOTALL, which causes the "." character to match all
characters, including newlines (CRs).

HTH,

Ken Ray
Sons of Thunder Software
Web site: http://www.sonsothunder.com/
Email: [EMAIL PROTECTED]


___
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution