Re: Text parsing: RegEx, filters, maybe third party software?

2003-11-11 Thread Peter Fjelsten
Januk [J],

On 10-11-2003 05:27, you wrote in mid:[EMAIL PROTECTED]:
J That's because the address regexps can't handle the colon after the
J labels in the second example.  There are a couple of other errors
J too. Try the following (note: I haven't tested this much, so it may
J need polishing):

I works better now.

The Post code (POSTNR) thing needs a bit of tweaking (it returns an
empty string), but I think I can manage that myself.

I'm in greater trouble concerning the FRAGTMOMSFRI or FRAGTMOMSPLIGTIGT.
This string is also empty.

J =[Begin template fragment]= %Rem= Get Name and address from
J Billing Address %- %-

I have wrapped this for readability reasons:


J %SetPattRegexp=(?im-s)^(Billing\sAddress|Fakturaadresse)
J [^\n]*?\n-{50,}\s*?\n(.*?)\s*?\n((.*?)\s*?\n)?((.*?)\s*?\n)?((.*?)\s*?\n)?
J ((\D{1,2}.\d{3,6})|\d{3,6})\s(.*?)\s*\n\s*(.*?)\s*?\n

This last line is for the post code, right? I guess the missing string
stems from the problem with spaces in the possibilities, right?

Possible post codes:
5678
N-5678
S-23 456
S-345 56
SE-234 56
SE-45 456
- each followed by a space and a non-integer character.

J %RegexpBlindMatch=%Text%-

What does this line do? Does it say where to extract the result from?

J %_Shipping='%SetPattRegexp=(?im-s)shipping.*?(\d*([\.\,]d*)?)dkk\s*\n%-
J %___%RegexpMatch=%Text'%-
J %-
J %If:'%RegexpText=(?im-s)^Total:.*\n\s*?(Moms|DK\smoms\/VAT)\n\n''':'%-
J FRAGTMOMSPLIGTIGT=%Calc=%_Shipping*0.8dkk':'%-
J FRAGTMOMSFRI=%_Shipping'

This is wrong. Although the string DK moms/VAT: is present the pure
number is returned and it is set to FRAGTMOMSFRI (without *0,8)

PF As I don't really understand how you make sub-patterns and variables
PF it's a bit hard for me to change your code.

J Subpatterns are simply parts of the regexp surrounded by round
J brackets.  Counting them is also very easy, just count the number of
J opening brackets.

... and they are numbered sequentially?

J I actually used a fair bit of what you wrote, just combined it and
J cleaned some of it up a bit.  But you had done quite a bit of the
J logic.  So hang in there, look for things that look similar to yours
J and go from there.

Any hint as to recursive element for the goods which are each instance
is preceded by varer?



-- 
greeting Best regards /greeting
author Peter Fjelsten /author
thebat version 2.01.20 /thebat version
os Windows XP 5.1.2600 /os






http://www.silverstones.com/thebat/TBUDLInfo.html


Re: Text parsing: RegEx, filters, maybe third party software?

2003-11-11 Thread Januk Aggarwal
Hello Peter,

On Tuesday, November 11, 2003 at 22:22 GMT +0100, authorities charged
Peter Fjelsten [PF] for writing:

PF I works better now.

I'm glad you're working better, but what about the regexps?! ;-)
(Sorry, I tried to resist, really...)

PF I'm in greater trouble concerning the FRAGTMOMSFRI or
PF FRAGTMOMSPLIGTIGT. This string is also empty.

Hmm, this isn't good.  I don't see the error.  Can you please give me
again the exact strings that *could* be in the message?  Maybe I'm
looking for a space that doesn't exist or something...

PF I have wrapped this for readability reasons:

We can use the PCRE extended mode if you really want to make it much
more readable.  You have to be careful where you place the
closing quotes if you use comments.  Also, remember that in extended
mode, whitespace in the expression is ignored.  For example, the
regexp you quoted could be written:

%SetPattRegexp=(?imx-s)
^(Billing\sAddress|Fakturaadresse)[^\n]*?\n   # Billing address label
-{50,}\s*?\n  # Separator line
(.*?)\s*?\n   # Name
((.*?)\s*?\n)?# Address line 1 (optional)
((.*?)\s*?\n)?# Address line 2 (optional)
((.*?)\s*?\n)?# Address line 3 (optional)
((\D{1,2}.\d{3,6})|\d{3,6})\s # Postal Code
(.*?)\s*\n# City
\s*(.*?)\s*?\n# Country


PF This last line is for the post code, right?

I imagine it is much more obvious with the expression above...

PF I guess the missing string stems from the problem with spaces in
PF the possibilities, right?

Yes, probably.  Hmm, ok well you can try replacing the postal code and
city lines above with:

((\D{1,2}-)?\d{2,4}(\s\d{2,3}))\s # Postal Code
\s*?(\D.*)\s*?\n  # City

The subpattern of the City and Country both increase by 1, ie they are
in subpatterns 12 and 13 respectively.

J %RegexpBlindMatch=%Text%-

PF What does this line do? Does it say where to extract the result from?

Yes, and it runs the expression so the subpatterns get filled with
their results.  Because we did a blind match, nothing is returned
unless we explicitly call a %Subpatt macro.

PF This is wrong. Although the string DK moms/VAT: is present the pure
PF number is returned and it is set to FRAGTMOMSFRI (without *0,8)

Ok, obviously the %IF condition isn't being met.  The following regexp
should return the DK moms/VAT string if it is present:
%RegexpText=(?im-s)^Total:.*\n\s*?(Moms|DK\smoms\/VAT)\n\n

That's what we're using to make the choice of output.  Maybe we should
add a \s* (without quotes) immediately before the closing \n\n pair.

J Subpatterns are simply parts of the regexp surrounded by round
J brackets.  Counting them is also very easy, just count the number of
J opening brackets.

PF ... and they are numbered sequentially?

Yes.

PF Any hint as to recursive element for the goods which are each instance
PF is preceded by varer?

The basic template for a recursive template is something like:

=[Begin generalized recursive template]=
%REM='
   TB v2.0x recursive template template
   Written November 2003 by Januk Aggarwal
'%-
%-
%REM='
   This is an initialization segment.  It can be removed if you
   know that you'll initialize the text variable outside of this
   QT
'%-
%IF:'%_QTName_FirstTime'='':'%-
%___%IF:%_QTName_Text=:%_QTName_Text(%Clipboard)%-
%___%_QTName_FirstTime=No'%-
%-
%-
%REM='
   This is the main processing segment
'%-
%-
%REM=' Remaining text condition '%-
%IF:'%-
%SetPattRegexp=(?is-m)[^\n]+%-
%RegexpMatch(%_QTName_Text)''':'%-
%-
%REM=  Main regexp.  You can pull out all interesting sections of
text here.  Just do it line by line/repetition by repetition
%-
%-
%SetPattRegexp=(?is-m)^(.*?)\n(.*)$%-
%RegexpBlindMatch(%_QTName_Text)%-
%-
%REM=  Put the subpatterns into useful variables  %-
%-
%_QTName_Line_1(%Subpatt(1))%-
%_QTName_Remainder_Of_Text(%Subpatt(2))%-
%-
%REM=  Do formatting and output here  %-
Some initial text%-
%_QTName_Line_1%-
some closing text
%-
%Rem=  Put remaining text into text variable %-
%-
%_QTName_Text(%_QTName_Remainder_of_Text)%-
%-
%REM=  Call template recursively and then close if 
%QInclude(QTName)'%-
=[ End  generalized recursive template]=

Look at my TB v2 wrap template* for an example.  Or look at the
recipient list template* for an example.  Both are more complicated
than what you need.

Your main regexp would probably something like:
%SetPattRegexp=(?isx-m) # Option setting- Subpattern
^\s*(\d+)\s* # Quantity  -   1
x\s* # Times
(.*?)\s* # Item description  -   2
\((.*?)\)\s* # Item model name   -   3
=\s*