As the current maintainer of Text::Balanced, I'm proposing for public review the following changes (or clarifications) to the external interface of Text::Balanced to be effective in version 1.96:

(1) All extractors will return (undef, $$textref) on failure in list context. Although a subtle point, extractors that normally return more than three values (e.g. extract_tagged and extract_quotelike) will also return (undef, $$textref) as well rather than for example (undef, $textref, undef, undef, undef). Previously, the POD was ambiguous here, but the actual code returned ('', $$textref, '') in all cases. In scalar context, however, extractors have and will continue to return a single undef on failure.

(2) extract_multiple will recognize only the empty list and (undef, ...) return values from extractor functions as match failures. This is what the POD currently states, but ('', ...) was previously also recognized as a match failure in the actual code. Under the new proposal, ('', ...) is not returned by any built-in extractor either on success or failure, so it usually will make no difference. Custom extractors will be allowed to return ('', ...) on success in the trivial case even though I don't see much practical application for that.


These changes are meant to clear up ambiguities. Other approaches could be taken. For example, the source code could be left untouched, and the POD could be changed to use ('', $$textref, ''). Damian's suggestion was to return (undef, $$textref, undef) in the general case. Use of undefs here is more consistent with scalar context, which returns a single undef on failure. Not all extractors return three values on success, though. To keep things consistent and efficient, I propose (undef, $$textref) as the failure return value, and this is behaviorally identical in all but obscure cases. That is, the following typical code is identical in both scenarios:


my($a, $b, $c) = extract_variable(...);

Code that wants to be compatible with the new and old implementations of Text::Balanced could do

if (defined($a) || $a eq '') ...

or

if($a) ...

(presuming $a cannot feasibly be '0') or use extract_multiple which will correctly handle the return values.

Most existing code "out there" seems to use the "if($a)" approach or extract_multiple, so they are safe. Some code (like HTTP-WebTest) does do this:

        my($extracted) = extract_delimited($_[0]);
        die "Can't find string terminator \"$delim\"\n"
            if $extracted eq '';

which would break under the new proposal. Of course, one could argue it is broken already since it relies on an ambiguous specification.

Other code such as Shell-POSIX-Select is actually forward thinking:

  ( $loop_var, @rest ) = extract_variable( $_ );
  if (defined $loop_var and $loop_var ne "" ) ...

Still other code may generate a warning on stringifying undef or comparing it to ''.

I would have prefered the return value on failure in list context to be the empty list (like the private _match_* functions) since that would permit code like

  elsif ($grammar =~ m/(?=$ACTION)/gco
                        and do { ($code) = extract_codeblock($grammar); $code })

in Parse::RecDescent to be rewritten as

  elsif ($grammar =~ m/(?=$ACTION)/gco
                        and ($code) = extract_codeblock($grammar))


but some code actually does rely on the $$textref in (undef, $$textref, undef) being there. Perhaps a 'use' option code be given to enable this behavior.



For reference, below are the relevant POD snippets in Text::Balanced 1.95 relating to the failure return value in list context:


BEGIN>>>>>>>>>>

General behaviour in list contexts

In a list context, all the subroutines return a list, the first three elements of which are always:
[0]
The extracted string, including the specified delimiters. If the extraction fails an empty string is returned.


[1]
The remainder of the input string (i.e. the characters after the extracted string). On failure, the entire string is returned.


[2]
The skipped prefix (i.e. the characters before the extracted string). On failure, the empty string is returned.


---

However, the call in:

@result = extract_bracketed( $text, '{([<' );

would fail, returning:

( undef , "{ an '[irregularly :-(] {} parenthesized >:-)' string }" );

---

extract_tagged...

[1] ... [5]
On failure, all of these values (except the remaining text) are undef.

---

extract_variable...

[1] ... [3]
On failure, all of these values (except the remaining text) are undef.

---

extract_tagged...

[1] ... [3]
On failure, all of these values (except the remaining text) are undef.

---

extract_quotelike...

[1] ... [10]
On failure, all of these values (except the remaining text) are undef.

---

extract_multiple...

If an extractor returns a defined value, that value is immediately treated as the next extracted field and pushed onto the list of fields. If
...
If the extractor fails to match (in the case of a regex extractor), or returns an empty list or an undefined value (in the case of a subroutine extractor), it is assumed to have failed to extract.


---

DIAGNOSTICS

In a list context, all the functions return (undef,$original_text) on failure.

<<<<<<<<<<<END



The only other major change planned for 1.96 (besides bug fixes) is the addition of a "$non_code" regular expression that will recognize POD, __DATA__ sections, and comments. Something like this, although I might want to create an accessor function for it:

# note: compare this to $Filter::Simple::pod_or_DATA.
# [Added in v1.96]
our $_find_cut;
$_find_cut = qr/ =cut [^\n]* \n? | [^\n]* \n (??{ $_find_cut }) | \z /xs;
our $non_code = qr/
    ^ (?: =[a-zA-Z] [^\n]* \n? $_find_cut
          | __(?:DATA|END)__ \n .* )
    | \# [^\n]* \n? (?: \s* \# [^\n]* \n?)* # combine adjacent
/xms;


With this addition, Text::Balanced will provide most of the primitives needed to fairly reliablly (e.g. 99%) perform certain types of source code filtering, such as the identification of quote-likes:


   my @frags = extract_multiple($_, [
           qr/\s+/,
           # not used: {VAR => \&extract_variable},
           # note: "my $x = ... = ... =" not a regex.
           [EMAIL PROTECTED],
           # note: $', $", $`, $/ vars not quotelikes
           #       */ not quotelike (see English.pm)
           #       $# not comment
           qr{\$\s*[\'\"\`\/\#]|\*\s*/}s,
           qr/-s\b/,       # note: "-s $ ... $ ... $" not a regex
           qr/sub\s+m\b/,  # note: sub m {...} not a regex
           {NONCODE => $Text::Balanced::non_code},
           {QUOT => \&extract_quotelike},
           qr/[a-z_]\w+/i,
   ]);


-davidm





Reply via email to