Re: Text parsing: RegEx, filters, maybe third party software?

Januk Aggarwal Sun, 02 Nov 2003 01:20:06 -0800

Hello Peter,

On Friday, October 31, 2003 at 15:23 GMT +0100, audiences applauded as
Peter Fjelsten [PF] announced:


PF> I have figured this out: In the template for saving the file I have made
PF> it so the date is saved in YYYYMMDD. It can be extracted with
PF> [The order date] = (^Date:\s{4}).{8}

Ok, so you've got this part.

P>> [Multi-line comment field, including blank lines, that may or may not be
P>> there]
PF> 
PF> This poses a problem. It could look like this:

It is not really a problem if you remember that you can use the other
anchors.  I'll use your notation and assume that you know how to
figure out subpatterning and TB's macros to get it working in a
template.

[Comment] = (?ism)^(Date Ordered|Ordre 
modtaget):[^\n]*\n\s*(.*?)\s*\n(Products|Produkter):

P>> Products#3
P>> ------------------------------------------------------
P>> [Number ordered] x Single tank adapter ([Item model name]) = 280dkk
PF> 
PF> I have:
PF> [Number ordered] = ^\d+(\sx\s)
PF> [Item model name] = (\d{1,2}\sx\s.+)(\s\().+(\)\s)

Yes, except that your subpatterning is not uniquely capturing the part
of the string that you want.  Also, have you put some sort of anchors
to make sure your regexps aren't confused by similar text elsewhere in
the message?  Also, how are you dealing with the fact that there could
be an arbitrary number of models ordered?

If you haven't thought about it, you'll be best off using a recursive
engine for this part.


P>> Sub-Total#4: 3.640dkk
P>> [Shipping method] (Shipping (5-7 days) to NO : 9.72 kg): [Shipping
P>> Price]
PF> 
PF> [Shipping method] = (Sub(-Total|total).+\n).+(\s\(\(D+)
PF> [Shipping Price] = (Sub(-Total|total).+\n)(.*:\s).+(dkk$)

Does this work for you?

P>> Total: 3.865dkk
P>> {3}
PF> 
PF> How can I test for the text strings "Moms" or "DK moms/VAT" at the
PF> position {3} and use that later?

Well, your best bet is to capture that string and put it into a
variable (ah the pride and joy of v2.0x).  Then when you want to use
the condition, use a %IF statement.


PF> = (Delivery Address|Leveringsadresse)(.*\n)(-+\n)(^.+\n)(^.+\n)^.+
PF> 
P>> [Delivery Address3, if applicable]
PF> 
PF> I don't know how to end the extraction of these addresses - how many of
PF> them are there are and stop at the right time. I suspect that I can use
PF> the Delivery Post Code as a stop clause, but how to do this, i.e. the
PF> recursive element escapes me.

Well, you can do them all at once by looking at the subpatterns.  If
you hard code it, then you must choose the maximum number of lines, ie
how many times you repeat the .+\n part.  For this, I don't think
you'll be best served by a recursive engine.  Your entire form is
probably generated by a bot/webpage.  So presumably the address lines
aren't completely arbitrary.

I'd use something like:
(?i-s)(Delivery 
Address|Leveringsadresse)(.*\n)(-+\n)(.+\n)?(.+\n)?(.+\n)?(.+\n)?((\D{1,2}.\d{3,6})|\d{3,6})\s(.*?)\s*\n(.*?)\s*\n

This looks ugly, it is horribly long, but it should* get all the
address info at once.  So:
 Name -> Subpatt 4
 Add1 -> Subpatt 5
 Add2 -> Subpatt 6
 Add3 -> Subpatt 7
 Post Code -> Subpatt 8
 City -> Subpatt 10
 Country -> Subpatt 11

* untested...

The subpatterning should be right, but I recommend creating a test
template with the regexp above, then a list of subpatterns below.  Ie,
something like:

=====[Begin regexp test template]=====
%IF:'%_Text'='':'%_Text="%ClipBoard"'%-
%setpattregexp='...'%-
%RegExpBlindMatch='%_Text'%-
%_Text%-

SubPatt 0 = <%SUBPATT="0">
        1 = <%SUBPATT="1">
        2 = <%SUBPATT="2">
        3 = <%SUBPATT="3">
        4 = <%SUBPATT="4">
%REM='        5 = <%SUBPATT="5">
        6 = <%SUBPATT="6">
        7 = <%SUBPATT="7">
        8 = <%SUBPATT="8">
        9 = <%SUBPATT="9">
       10 = <%SUBPATT="10">
       11 = <%SUBPATT="11">
       12 = <%SUBPATT="12">
       13 = <%SUBPATT="13">
       14 = <%SUBPATT="14">
       15 = <%SUBPATT="15">
       16 = <%SUBPATT="16">
       17 = <%SUBPATT="17">
       18 = <%SUBPATT="18">
       19 = <%SUBPATT="19">
       20 = <%SUBPATT="20">
'
=====[ End  regexp test template]=====

Obviously you need to move the %Rem around to expose as many
subpatterns as you need.  I recommend exposing a couple more than you
expect, just in case you counted wrong.

P>> [Delivery Post code]{4}
PF> 
PF> = (\n)\D{1,2}.\d{3,6}(\s)|(\n)\d{3,6}(\s) - except I have a problem
PF> here. This will not find "SE-123 23". Can anybody help?

Your expression is almost there, just some minor changes as I've
included in the expression above.

P>>  [Delivery City]
<snip>
P>> [Delivery country]

See above.

P>> Billing Address#6

Copy the expression above, just change the anchor names.

PF> I don't know how to make a QT that takes the RegExp above and saves it
PF> in variables so it can be formatted as below.

Well, your best bet is to do a whole bunch of smaller regexps when you
need them.  Otherwise your expression will get unwieldy and buggy.

PF> In other words - I need the "shell" for all my little extractions - the
PF> main QT that handles all of this.

Most of them should use:
FieldID=%-
%SetPattRegexp="..."%-
%RegexpBlindMatch="%Text"%-
%Subpatt="..."

There are a couple where you need more sophisticated templates, so
just put the formatting/recursive templates in QTs and use:
FieldID=%QInclude="..."

<snip>

PF> This is a problem. How do I make a statement:
PF> 
PF>      If "Moms" or "DK moms/VAT" is present in the order
PF>      Then make a FRAGTMOMSPLIGTIG=X, and let X be 0.8*[Shipping
PF>      Price] 1)
PF> 
PF>      If "Moms" or "DK moms/VAT" is NOT present in the order
PF>      Then make a FRAGTMOMSFRI=[Shipping Price] 1)

Use %IF and %CALC.  You can extract just the number part for use with
%Calc, and use another subpattern to capture any other text.  In the
test condition for %IF, do the search for the "Moms" or "DK..."
strings.

P>> <VARE>
P>> ANTAL=[Number ordered]
PF> 
P>> VARENUMMER=[Item model name]
PF> 
PF> How do I generate a <VARE> with ANTAL=^\d+(\sx\s) and
PF> VARENUMMER=(\d{1,2}\sx\s.+)(\s\().+(\)\s) for each instance of ANTAL?

You'll have to generate this section with a recursive template.  The
"easiest" thing to do is to grab all the items with one regexp, then
feed that string to the recursive template.  The recursive part can
then analyze each line, pulling out the number and model name while
adding the repeating tags.

PF> Also I need to be able to distinguish the different orders in the mail.
PF> I need to format each <ORDRE> within the scope of each mail saved to
PF> disc: these can be limited by ===-===-===... to start each mail and
PF> ---=---=--- to end each mail - and consequently <ORDRE>. How do I do
PF> that?

I'm not sure I follow.  Are you using a filter to extract all this
info from a message and append it to a file which would then contain
all orders?  If so, then just add those delimiters to the top and
bottom of your template as necessary.

PF> I am very proud of myself for having reached so far - the visual RegExp
PF> 3.0 really helps, but now I am probably at my wits' end - please
PF> somebody help me!

You're doing well.  I recommend using my template above to test your
regexps with TB.  Give it a simple name, like regexp, then insert it
into a new message by typing: regexp<ctrl><space>

Depending on which macros you want to test for %RegexpBlindMatch, you
might want to reply to the message you want to process.  Just remember
not to send out these test messages.  ;-)

Good luck!

-- 
Thanks for writing,
 Januk Aggarwal




________________________________________________________

http://www.silverstones.com/thebat/TBUDLInfo.html

Re: Text parsing: RegEx, filters, maybe third party software?

Reply via email to