Hi all,
Find below a draft of the manpage of the class Regexp::Ignore.
I think this class is very cool class, but maybe I am totally wrong. Read
the start of the DESCRIPTION of the manpage, and probably you will
understand why it was written.
The class is already programmed, and we actually use it here in our company.
Yet, it is very very new class.
So please - your opinion - is it needed at all, or maybe there are better
ways to get the same results?
Is the name Regexp::Ignore good name for it?
Any other suggestions?
My plan is to submit it next week (according to the replies of this mail of
course)
Thanks,
Rani
NAME
Regexp::Ignore - Let us ignore unwanted parts, while parsing text.
SYNOPSIS
use Regexp::IgnoreXXX;
my $rei = new Regexp::IgnoreXXX($text,
"<!-- __INDEX__ -->");
# split the wanted text from the unwanted text
$rei->split();
# use substitution function
$rei->s('(var)_(\d+)', '$2$1', 'gi');
$rei->s('(\d+):(\d+)', '$2:$1');
# merge back to get the resulted text
my $changed_text = $rei->merge();
DESCRIPTION
Markup languages, like HTML, are difficult to parse. The reason is that
you can have a line like:
<font size=+1>H</font>ello <font size=+1>W</font>orld
How can we find the string "Hello World", in the above line, and replace
it by "Hello Universe" (which is a lot deeper)? Or how can we run a
speller on the text and replace the mistakes with suggestions for the
correct spelling?
This module come to help you doing exactly that.
Actually the module let you first split the text to the parts you are
interested in and the unwanted parts. For example, all the HTML tags can
be taken as unwanted parts.
Then it let you parse the part you are interested in (while totally
ignoring the unwanted parts).
In the end it let you merge back the unwanted parts with the possibly
changed parts you were interested in.
There is just one catch. It uses the assumption that when you replace
the above "Hello World" to "Hello Universe", all the unwanted parts
between the start of the match to the end of the match, will be pushed
after the text that will replace the match. This is not really
understood right? Look at the example:
The text:
<font size=+1>H</font>ello <font size=+1>W</font>orld
will be first split and we will get the "cleaned" text:
Hello World
Then we can parse it using something like:
s/Hello World/Hello Universe/;
This will give us the changed "cleaned" text:
Hello Universe
When we will merge with the unwanted parts we will get
<font size=+1>Hello Universe</font><font size=+1></font>
So, the unwanted parts in the match were pushed after the replacer.
Why this assumption?
Because. Actually, I could not find any better assumption. I can not
guess what will be the unwanted parts in a match and the replacer of the
match might be longer or shorter then the match itself. So, in fact, we
have three reasonable possibilities: 1. Push the unwanted parts before
the replacer. 2. Push the unwanted parts after the replacer. 3. Spread
the unwanted parts in the replacer in the same proportions that they are
spread in the match.
So I chose the second option. It is very similar to the first, and by
far a lot simpler (to implement and to use) then the third.
As you see in the example above, usually it should not break the markup
language. It might, though, give some surprises - in the example above,
"Hello Universe" is all marked to be with bigger fonts.
All in all, I believe that it provides big help when parsing formatted
texts.
So now, that we know what the module can give us, let's check how we use
the module.
The class Regexp::Ignore is an abstract class: there is a method,
get_tokens, in the class that is not implemented. So the user of this
class must inherit it and implement the get_tokens method. The
get_tokens method actually splits the text into tokens and mark them
"wanted" or "unwanted".
Don't panic - it might sound very difficult, but it is not. Moreover,
the module comes with some classes that already inherit from
Regexp::Ignore, and you can use them. For more details about
implementing the get_tokens method and an implementation example, see
below.
After we have the inherited class that implements the get_tokens method,
and we call split to split the text, we can go on with our parsing like
the SYNOPSIS above. We can use the method s which is parallel to the
perl s// operator, and if we need more complex text manipulation, we can
replace text directly using the b<replace> method.
When we finish to change the text, we can call the merge method that
will build the resulted text from the changed "cleaned" text and the
unwanted parts.
HOW IT WORKS
OK, you don't have to read this part if you just want to use the class.
However, if you are the curious type, you might find it interesting.
The get_tokens method splits the text to tokens that are kept in a list.
It also creates other list that contains "wanted" flags. So actually we
get a list of tokens and for each the information if it is wanted or
unwanted.
The split method uses the get_tokens to create the CLEANED_TEXT and the
DELIMITED_TEXT.
Let's take the example:
<p><b>bla</b><b>_</b><b>123</b></p>
<p><b>bLa</b><b>_</b><b>1234567</b></p>
And assuming our get_tokens mark all the HTML tags as unwanted, we will
get:
tokens list flags list
--------------------- ---------------
0: <p> 0
1: <b> 0
2: bla 1
3: </b> 0
4: <b> 0
5: _ 1
6: </b> 0
7: <b> 0
8: 123 1
9: </b> 0
10: </p> 0
11: <p> 0
12: <b> 0
13: bLa 1
14: </b> 0
15: <b> 0
16: _ 1
17: </b> 0
18: <b> 0
19: 123456 1
20: </b> 0
21: </p> 0
The CLEANED_TEXT will be:
bla_123bLa_1234567
And if the delimiter pattern is "<!-- __INDEX__ -->" the DELIMITED_TEXT
will be:
<!-- 000000000 --><!-- 000000001 -->bla
<!-- 000000003 --><!-- 000000004 -->_
<!-- 000000006 --><!-- 000000007 -->123
<!-- 000000009 --><!-- 000000010 -->
<!-- 000000011 --><!-- 000000012 -->bLa
<!-- 000000014 --><!-- 000000015 -->_
<!-- 000000017 --><!-- 000000018 -->1234567
<!-- 000000020 --><!-- 000000021 -->
Now the split method generates an array that contains a translation of
the positions between the cleaned text and the delimited text:
CLEANED_TO_DELIMITED_POSITIONS array
------------------------------------
0: 36
1: 37
2: 38
3: 75
4: 112
5: 113
6: 114
7: 187
8: 188
9: 189
10: 226
11: 263
12: 264
13: 265
14: 266
15: 267
16: 268
17: 269
The following rulers with the cleaned and delimited texts might help you
understand this the translation table:
The CLEANED_TEXT:
1
012345678901234567
bla_123bLa_1234567
The DELIMITED_TEXT:
0 1 2 3
012345678901234567890123456789012345678
<!-- 000000000 --><!-- 000000001 -->bla
4 5 6 7
9012345678901234567890123456789012345
<!-- 000000003 --><!-- 000000004 -->_
8 9 0 1
678901234567890123456789012345678901234
<!-- 000000006 --><!-- 000000007 -->123
2 3 4 5
567890123456789012345678901234567890
<!-- 000000009 --><!-- 000000010 -->
6 7 8
123456789012345678901234567890123456789
<!-- 000000011 --><!-- 000000012 -->bLa
9 0 1 2
0123456789012345678901234567890123456
<!-- 000000014 --><!-- 000000015 -->_
3 4 5 6
7890123456789012345678901234567890123456789
<!-- 000000017 --><!-- 000000018 -->1234567
7 8 9 0
012345678901234567890123456789012345
<!-- 000000020 --><!-- 000000021 -->
As an example, we call now the s method with something similar to:
s/(bla)_(\d+)/<font color=red>$2</font>_$1/gi
which will be the call:
$rei->s('(bla)_(\d+)','<font color=red>$2</font>_$1','gi');
the following will happen:
We will use the m// operator to have the match against the cleaned text:
m/(bla)_(\d+)/i
This will match first with 'bla_123' in the cleaned text. Now we keep
the matching variables $& and $1..$9. Then we create the replacer string
by substituting those variables in the string:
'<font color=red>$2</font>_$1'
We will also keep the exact position where the match happened in the
cleaned text, and the length of the match.
Using the positions of the start and end of the match, we define a
region in the clean text where the match happened, and where the
replacer should be placed.
In our example this region is 0 to 6.
We can now use the translation array to translate this region to
positions in the delimited text.
We will get the region 36 to 114 in the delimited text.
Now we can get deal with those two regions:
In the clean text it is simple to place the replacer instead of anything
that was in that region.
In the delimited text, we will first put all the delimiters in that
region together. Then we add the replacer before them, and we place all
of this in the region.
Now the only thing we have to do is to fix the translation table - the
translation table will not be correct from the start of the matched
region, and if the replacer is different in size from the match, also
after the matched region.
This is why we use the TRANSLATION_POSITION_FACTOR data member. It keeps
the built up difference between the match regions and the replacers
while we parse along the text.
The fix of the translation table is boring indexing manipulations. We
first fix the region of the replacer to represent the new replacer, and
then if there is a difference between the lengths of the match and the
replacer, we fix all the indexes after the match.
After we finish to manipulate the text, we build back our text by
replacing the delimiters in the delimited text by the tokens that those
delimiters represent. This is done by the merge method.
And voila! We get back our text manipulated.
CONSTRUCTOR
new (TEXT, DELIMITER_PATTERN)
Constructs an object of the class. TEXT is the text that we want to
parse. DELIMITER_PATTERN is a string that will be used to create
delimiters while processing the text. It should contain the string
'__INDEX__' that will be replaced by an index, for example:
'000000073'.
That delimiter should be chosen to fit the text that should be
parsed, and to the get_tokens results. For example for HTML text we
can choose '<!-- __INDEX__ -->' or even <__INDEX__>. This might be a
good delimiter if our get_tokens takes all the HTML tags as unwanted
tokens.
So our choice for a delimiter should be anything that can be used as
a delimiter for the "cleaned" text (after the unwanted parts were
taken away from the text).
METHODS
get_tokens ( )
This is an abstract method. It should be implemented in a daughter
class of this class. Moreover, you will never call this method
directly in your code. The split method will call the get_tokens
method that you implement.
The method should use the text method to get the text it takes as
input. It should return a list of two array references. The first
reference refers to a list of all the tokens, and the second
reference refers to a list of flags (perl TRUE or FALSE, so one or
zero for example). If the flag is FALSE, it means that the token in
the other list in the same index is unwanted.
As one example is better then many words, here is an implementation
of the get_tokens method that takes all the HTML tags as unwanted
parts:
sub get_tokens {
my $self = shift;
my $tokens = [];
my $flags = [];
my $index = 0;
# we should create tokens from the TEXT.
my $text = $self->text();
while (defined($text) &&
# the regular expression will try to match:
# - HTML remarks - all the remark will be matched.
# - HTML other tags
$text =~ /(<\!\-\-[\s\S]+?\-\->)|(<\/?[^\>]*?>)/i) {
if ($`) { # if there is text before, take it as clean
$tokens->[$index] = $`;
# the text before the match is clean.
$flags->[$index] = 1;
$index++; # increment the index
}
$tokens->[$index] = $&;
$flags->[$index] = 0; # the match itself is unwanted.
$index++; # increment the index again
$text = $'; # update the original text to after the match.
}
# if we are done or we had no match at all, check if there is
# still something in the $text. this will be also a clean text.
if (defined($text) && $text) {
$tokens->[$index] = $text;
$flags->[$index] = 1;
}
# return the two list
return ($tokens, $flags);
} # of get_tokens
Classes that implement the get_tokens come with this module. Check
first if one of them does not implement the get_tokens you need.
And if you feel you wrote a get_tokens that might be useful for the
rest of us, please let me know about it.
split ( )
This method should be called before the s or replace methods are
called. It will use the get_tokens method to split the text to
unwanted tokens and the "cleaned" text. After this method is called
the CLEANED_TEXT and the DELIMITED_TEXT data members are available.
s (PATTERN, REPLACEMENT, SWITCHES)
This method implements the perl s// operator while ignoring the
unwanted tokens. See the INTRODUCTION section above, and the perlop
manpage for more details.
You can call this method several times between a call to split and a
call to merge.
Important Note: The 'e' and the double 'e' switches are not yet
implemented. It is very difficult to implement and maybe impossible
without a very sophisticated hack as the method s suppose to see the
values of lexical variables in the code that calls that method. I do
not know how to do that. If someone has ideas - please contact me or
send the patch. Other problem is the way to correctly eval the
REPLACEMENT. It is not totally clear to me how to do that correctly.
Again - if someone can help - please! Meanwhile, though, you can use
the replace method below.
replace ( BUFFER_REF, LAST_POSITION_REF, START_MATCH_POSITION,
END_MATCH_POSITION, REPLACER )
The replace method is used by the s method, and usually should not
be used directly. However, it might be that the advanced programmer
will want to have special manipulation that is done better using the
replace. It also gives us a way to by-pass my failure to implement
the 'e' and double 'e' switches in the s method.
The replace builds a buffer every time it is called. This buffer is
the manipulated cleaned text till the place of the last match and
replace. It does not work directly on the CLEANED_TEXT data member
in order not to change the cleaned text between the matches (so to
gain in performance).
Before we call the replace, we suppose to zero the
TANSLATION_POSITION_FACTOR, so previous replaces along the text will
not affect the current replaces.
Then we should prepare an empty buffer, and a variable that will
hold the position after the last match. This variable should be zero
as well.
Now we should send to the replace method a reference to the buffer,
a reference to the last position variable, the positions of the
start and end of a match in the cleaned text, and a replacer.
The replace method will place the replacer instead of the match, and
will build the buffer till the end of the replacer. It will also set
the last position variable to the correct value.
Again, example might make it a lot simpler:
my $name = "Rani";
...
$rei->translation_position_factor(0);
my $cleaned_text = $rei->cleaned_text();
my $after_the_matach;
my $buffer = "";
my $last_position = 0;
# for each word
while ($cleaned_text =~ /$pattern/g) {
my $match = $&;
my $end_match_position = pos($cleaned_text) - 1;
my $match_length = length($match);
my $start_match_position =
$end_match_position - $match_length + 1;
# as an example we call a function
my $replacer = func($name, $2, $1);
$rei->replace(\$buffer,
\$last_position,
$start_match_position,
$end_match_position,
$replacer);
}
$buffer .= substr($rei->cleaned_text(), $last_position);
$rei->cleaned_text($buffer);
This will actually do the same as calling the s method like this:
s/$pattern/&func($name,$2,$1)/ge;
Of course the replace method can be more useful in other cases. For
example, if we change our regular expression in the above while
block. Or if , before the while block, we copy part of the
CLEANED_TEXT to the buffer and set the last position variable
accordingly in order to start to match from the middle of the
CLEANED_TEXT.
merge ( )
This method will build back our text from the manipulated
CLEANED_TEXT and the unwanted tokens. It saves the resulted text in
the TEXT data member and also returns it.
ACCESS METHODS
text ( TEXT )
Represents the text we input in order to manipulate, and the
resulted text we get after we had the manipulations and merged.
delimited_text ( )
Represents the "cleaned" text after we called the split method, with
delimiters that represent the unwanted tokens.
cleaned_text ( )
Represents the "cleaned" text after we called the split method and
took out the unwanted parts.
delimiter_pattern ( DELIMITER_PATTERN )
Represents the DELIMITER_PATTERN data member. See the CONSTRUCTOR
for more details.
BUGS AND OTHER PROBLEMS
Who knows?!? You should tell me. Please!
I guess there are bugs because this module is new - a baby module - that
was created in the holidays of the end of 2001. And also because the
algorithm that is implemented in it is not simple for me.
Besides, I am quite certain it does not perform as you expect. So, part
of this problem is in your expectations ;-) This module come to kill a
huge problem, that if you try to solve it other way, it will probably
perform less good (and if not - tell me how you do it!). However, many
parts in it can be for sure implemented differently to give better
performances. Please - let me know what you think, send me patches or
ideas.
AUTHOR
Rani Pinchuk, <[EMAIL PROTECTED]>
COPYRIGHT
Copyright (c) 2002 Rani Pinchuk. All rights reserved. This package is
free software; you can redistribute it and/or modify it under the same
terms as Perl itself.
SEE ALSO
the perl manpage, the perlop manpage, the perlre manpage.
------------------------------------------------------------
Rani Pinchuk http://www.wamnet.be
Phone: +32-15-28-18-20 Fax: +32-15-28-18-21
------------------------------------------------------------