Re: containsregex and concat

Brian Agnew Wed, 29 Nov 2006 01:37:26 -0800

If your input file is XHTML, or can be transformed into XHTML (I'm sorry -
not been following the thread), then you can use XMLTask and specify the
XPath to the element you wish to change/insert at.


http://www.oopsconsultancy.com/software/xmltask

Brian

On Wed, November 29, 2006 05:07, George Bills wrote:
> Thanks Gilbert - except that doesn't work when tables span more than one
> line. "byline true" splits the text into multiple tokens, and the regex
> is applied independently to each token. So if the start of the
> expression (<table>) is on line one, the middle of the expression
> (<tr>blah</tr>...etc) is on line two, and the end of the expression
> (</table>) is on line three, then no one individual line / token
> matches, so nothing comes out (correct me if I'm wrong, but that's what
> seemed to happen in my testing). If byline is false, the entire text is
> one big token - so if I match the token, I get the entire token (the
> original input) back. Also, I wanted the entire table, not just the
> contents. I tried using "replace="\0"", but that just means that within
> the token I'm replacing the matching text with the matching text - not
> very useful.
>
> What I really wanted was a way of saying "give me the matching text and
> only the matching text, not the token that matches". I sort of solved it
> by writing a regular expression to match the entirety of input:
> "(.*?)(<table[^/<]*class="summary"[^/>]>(.*?)</table>)" . With the Ant
> encoding that ends up as:
> "(.*?)(&lt;table[^&lt;/]*class=&quot;summary&quot;[^&lt;/]*&gt;(.*?)&lt;/table&gt;)(.*)".
> I don't know that each table will only take one line of input (in fact,
> they won't), but since I know that there's only one table in each input
> file, I can match the entire file and use "replace="\2"" to replace the
> entire match (all input) with the second matching group (the table).
>
> So, that works for one file. The problem I have now is getting it to
> work for multiple files - each file that I concatenate has exactly one
> summary table that I want to extract and place in a single HTML summary
> file. I tried:
> (A) Concatenating (<concat>) all of the files and applying a filterchain
> - but the filterchain filters all the input once, not once per file. So
> I concatenate the files first, then apply the regex - which means I only
> get the one matching table from the entire concatenation, not one
> matching table from each file that I concatenate.
> (B) Copying (<copy>) all of the files to a single file - in this case,
> the filterchain extracts the individual tables from each file - but I
> only end up with one file, because I can't make it concatenate them all
> to one destination (even with a mergemapper). "enablemultiplemappings"
> doesn't seem to help.
>
> If there was some way of saying "for each file, apply the transform
> *before* concatenating, not after", then that would work, but as far as
> I can see, there isn't. Any ideas?
>
> Rebhan, Gilbert wrote:
>> Hi,
>>
>> <target name="depends">
>>      <echo file="Y:/test.html">
>>          <![CDATA[
>>          <html>
>>          <head>
>>          <title>summary</title>
>>          <link rel="stylesheet" href="summary.css" type="text/css">
>>          </head>
>>          <body>
>>          <a name="overview"></a>
>>          <center>
>>          <table class="summary"> was wrong </table>
>>          </center>
>>          </html>
>>          ]]>
>>          </echo>
>>      </target>
>>
>>      <target name="main" depends="depends">
>>
>>      <loadfile srcfile="Y:/test.html" property="summary">
>>         <filterchain>
>>             <containsregex
>>               pattern='&lt;table[^&lt;/]*&gt;(.*?)&lt;/table&gt;'
>>               replace="\1"
>>               byline="true"
>>               />
>>             <tokenfilter>
>>                 <!-- to get rid of whitespace in ${summary} -->
>>                 <trim/>
>>             </tokenfilter>
>>         </filterchain>
>>     </loadfile>
>>
>>      <echo>Summary == ${summary}</echo>
>>
>>      </target>
>>
>> gives only the text =
>>
>> depends:
>> main:
>>      [echo] Summary == was wrong
>> BUILD SUCCESSFUL
>> Total time: 407 milliseconds
>>
>>
>> you have to use \1 and byline=true
>>
>> Regards, Gilbert
>>
>> -----Original Message-----
>> From: George Bills [mailto:[EMAIL PROTECTED]
>> Sent: Tuesday, November 28, 2006 6:14 AM
>> To: Ant Users List
>> Subject: Re: containsregex and concat
>>
>> Thanks: the regular expression works now, which is progress.
>> Unfortunately I'm getting all of the concatenated text, not just the
>> matching text. If I use replace:
>> <filterchain>
>>   <!--<tokenfilter><filetokenizer />-->
>>     <containsregex flags="isg"
>>       pattern="${summary.regex}"
>>       replace="SUMMARYTABLE"
>>       byline="false" <!-- implies filetokenizer -->
>>       />
>>     <!-- </tokenfilter>-->
>> </filterchain>
>>
>> I end up getting something like:
>> [concat] <html>
>> [concat] <head>
>> [concat] <title>summary</title>
>> [concat] <link rel="stylesheet" href="summary.css" type="text/css">
>> [concat] </head>
>> [concat] <body>
>> [concat] <a name="overview"></a>
>> [concat] <center>
>> [concat] SUMMARYTABLE
>> [concat] </center>
>> [concat] ...more HTML here...
>> [concat] </html>
>>
>> I'm assuming it's because the file is just one big token - but if I use
>> a line tokenizer, will I be able to match regular expressions over
>> multiple lines?
>>
>> Thanks for the help.
>>
>> Rebhan, Gilbert wrote:
>>
>>> Hi,
>>>
>>> <table[^>/]*>(.*?)</table>
>>>
>>> should match :
>>>
>>> <table class="summary">foobar</table>
>>>
>>> also with more than one attribute
>>>
>>> <table class="summary" foo="bar">foobar</table>
>>>
>>>
>>> foobar is  /1  (group 1)
>>>
>>>
>>> Regards, Gilbert
>>>
>>>
>>> -----Original Message-----
>>> From: George Bills [mailto:[EMAIL PROTECTED]
>>> Sent: Monday, November 27, 2006 6:41 AM
>>> To: Ant Users List
>>> Subject: Re: containsregex and concat
>>>
>>> Hrm, it probably isn't since advanced regexs are still black magic to
>>> me. The "." was supposed to match any character, including a newline
>>> (with the s flag), the * to say match 0-n of them and the ? to say be
>>> lazy, match as little as possible (so that I don't pull in
>>> <table>...</table><table>...</table> in one match).
>>>
>>> I just tried [^<], but it doesn't seem to work - I think because of
>>>
>> such
>>
>>> things as "<table><tr>...</tr></table>" - the opening bracket of <tr>
>>> conflicts. I tried [.&lt;&gt]*? to make sure that the "regex.body"
>>>
>> part
>>
>>> was matching the brackets, but that didn't work either.
>>>
>>> Also, <table class="summary"> was wrong - <table class="summary"(.*?)>
>>>
>>
>>
>>> is a little better since the tables can have more than the class
>>> attribute (in fact, all of them do). But after changing that I'm
>>> matching the entire document - <html> through to </html>. That might
>>> just be because I'm using filetokenizer - if I make one match within
>>> filetokenizer, do I end up getting the entire document? If so, how do
>>>
>> I
>>
>>> get only the matching text?
>>>
>>> Regex is now: <table class="summary".*?>.*?</table>
>>>
>>> Thanks for the help, I appreciate it.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>>> For additional commands, e-mail: [EMAIL PROTECTED]
>>>
>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>


-- 
Brian Agnew                  http://www.oopsconsultancy.com
OOPS Consultancy Ltd
Tel: +44 (0)7720 397526
Fax: +44 (0)20 8682 0012


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: containsregex and concat

Reply via email to