Message: 14
Date: Sat, 11 Jun 2016 15:48:00 -0400
From: Gregory Lypny <gregory.ly...@videotron.ca>
To: LiveCode Discussion List <use-livecode@lists.runrev.com>
Subject: Need Help With String Pattern Matching
Message-ID: <19a0e5fc-e4ce-42e8-9dd1-1b4d9040b...@videotron.ca>
Content-Type: text/plain; charset=utf-8

Hello everyone,

> I used to do some basic text analysis of files where the lines containing 
> strings of interest were consistent and therefore easy to spot. I am now 
> working on files where the chunk of text that contains the data I want is 
> more ambiguous.…

>The chunk starts with the word *owner* or the phrase *beneficial owner*.
>
>The chunk ends with *all directors* or *less than one percent*.
>
>The chunk contains all of the following:
>- At least four or five big numbers, e.g., 234,879
>- At least two percentages, e.g., 3.4%, or percentage signs
MatchChunk uses regular expressions ("regex" for short). I don't claim to be a 
master of regex, but hopefully the following will be of some help to you.

First off, "owner" or "beneficial owner". That would be like so:

[owner|beneficial owner]

Since that's the start of the chunks you're interested in, you'll put that at 
the beginning of your regex filter. Next is "all directors" or "less than one 
percent". That's going to be similar:

[all directors|less than one percent]

And *that* bit goes at the *end* of your regex filter. In between the start-bit 
and the end-bit, you have "four or five big numbers", and "percentages" or 
"percentage signs". "Big number" isn't really a well-defined concept, but 
here's one way to go for "big numbers": 

[0-9][0-9],[0-9][0-9][0-9]

In regex, that bit will match any string that consists of *at least* two 
digits, a comma, and three more digits. It'll match XX,XXX (where "X" is any 
digit at all); it'll match XXX,XXX (because if you can match *two* digits in a 
row, you can certainly match *three* digits in a row); it'll match XX,XXXX (if 
you can match 3 in a row, you can match 4 in a row); and so on. Note that this 
bit *will not* match XXXXX—that's a string of five digits in a row *without* 
any commas. As for percentages, this will work for matching a percent sign:

&

And this will work for matching a single digit followed by a percent sign:

[0-9]%

I'm going to assume that you don't know exactly where the "big number"s or 
"percentage"s will be within the chunks you're interested in, or how many 
characters will occur in between the bits of interest. If you want your regex 
filter to ignore what occurs between the bits of interest, this will do the 
trick:

.*

The period will match any character (except a newline character), and the 
asterisk is regex for "at least 0 of that thing just previous". So if you want 
to match Big Number followed by Percentage, this should do the trick:

[0-9][0-9],[0-9][0-9][0-9].*[0-9]%

If you at least know what order your Big Numbers and Percentages going to be 
found in, you can build a regex filter for that sequence by fitting the bits 
together like Lego bricks, with the period-asterisk "spacer" in between the 
important bits.
   
"Bewitched" + "Charlie's Angels" - Charlie = "At Arm's Length"
    
Read the webcomic at [ http://www.atarmslength.net ]!
    
If you like "At Arm's Length", support it at [ 
http://www.patreon.com/DarkwingDude ].

_______________________________________________
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Reply via email to