[Proto-Scripty] Re: help with basic regular expression

RobG Mon, 16 Mar 2009 23:17:11 -0700

On Mar 17, 2:25 pm, kangax <kan...@gmail.com> wrote:
> On Mar 16, 9:44 pm, RobG <rg...@iinet.net.au> wrote:
>
> > On Mar 17, 4:10 am, kangax <kan...@gmail.com> wrote:
>
> > > On Mar 16, 1:13 pm, arkady <arkad...@gmail.com> wrote:
>
> > > > if trying to strip off the everything before the <body> and everything
> > > > after </body>
>
> > > response.replace(/.*(?=<body>)/, '').replace(/(<\/body>).*/, '$1');
>
> > That seems a bit risky, the string may not always have lower case tag
> > names and the body opening tag may include attributes.  New lines in
>
> I actually took OP's issue too literally; i.e. - "strip off everything
> before the <body> and after </body>" : )
>
> > the string might trip it up too.  In any case, it doesn't work for me
> > at all in Firefox 3 or IE 6.
>
> Which string did you feed it with? dot doesn't match newlines, does
> it? [\s\S] should match:
>
> response.replace(/[\s\S]*(?=<body)/i, '');

I was giving it the innerHTML of the doc it was in (with new lines,
returns, etc.), the above does the trick ( I guess any countermanding
character class would do - [\d\D] works too).  For a more general (and
the OP’s) case, it needs to also trim everything after the closing </
body> (which *should* only ever be </html> with maybe some whitespace
but who knows what a server might send?) as you’ve done below.

According to the innerHTML, Firebug puts a div between the head and
body elements - not sure if I like that, it will be dealt with by
error correction (moved into the body or perhaps ignored completely)
if fed back to the browser.


> > An alternative, provided all new lines are removed, is:
>
> >   response.match(/<body.*body>/i)[0];
>
> > or
>
> >   response.replace(/\s/g,' ').match(/\<body.+body\>/i)[0];
>
> > A sub-string version is:
>
> >   var start = response.toLowerCase().indexOf('<body');
> >   var end = response.toLowerCase().indexOf('</body>') + 7;
> >   var theBody = response.substring(start, end)
>
> Obviously, string-based matching should be marginally faster than
> regex, especially when that regex is based on a relatively slow
> positive lookahead : )

But the substring stuff just *looks* clunky.  :-p


>
> var response = document.documentElement.innerHTML;
> console.time(1);
> for (var i=0; i<100; i++) {
>   var l = response.toLowerCase();
>   response.substring(l.indexOf('<body'), l.indexOf('</body>') + 7);}
>
> console.timeEnd(1);
>
> var response = document.documentElement.innerHTML;
> console.time(2);
> for (var i=0; i<100; i++) {
>   response.replace(/[\s\S]*(?=<body)/i, '')
>     .replace(/(<\/body>)[\s\S]*/i, '$1');}
>
> console.timeEnd(2);
>
> //1: 186ms
> //2: 2664ms

For that sort of speed gain, I’d use substring every time - match is
about 50% slower again.


--
Rob
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Prototype & script.aculo.us" group.
To post to this group, send email to prototype-scriptaculous@googlegroups.com
To unsubscribe from this group, send email to 
prototype-scriptaculous+unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/prototype-scriptaculous?hl=en
-~----------~----~----~----~------~----~------~--~---
[Proto-Scripty] Re: help with basic regular expression

Reply via email to