RFC: Perl should support non-linear text

Roland Giersig Fri, 03 Nov 2000 06:06:34 -0800

Hi folks,

I know, the RFC period is over, but still...
Please, read this through and tell me if it's a good idea or not.
Actually, it's not mine, I just wrote it down.  But see for yourself...

Roland

--snip--

=head1 TITLE

Perl should support non-linear text.

=head1 VERSION

  Maintainer: Roland Giersig <[EMAIL PROTECTED]>
  Date: 19 Oct  2000
  Version: 1
  Mailing List: perl6-internals ?
  Number: ?

=head1 ABSTRACT

Right now, Perl performs its magic only upon linear strings of ASCII
and Unicode text. As Ilya Zakharevich has stated in his recent
interview (http://www.perl.com/pub/2000/09/ilya.html), the new feature
that would help todays Perl programmers most is if Perl would be
capable to perform its mighty string operations on marked-up
(non-linear) text consisting of linear chunks of text strings that
carry different attributes.

This could very well be THE new feature that justifies the complete
Perl6 rewrite!

=head1 DESCRIPTION

When Perl first came into being, the world was full of ASCII text,
so Perl became strong in manipulating ASCII text.  But this has changed.
Nowadays even the simplest documents (e.g. mail messages) tend to
be in some marked-up format or other, and programmers worldwide
are struggling in finding a way to manipulate those.

To aid these efforts I therefore propose to nehance the string format
used in Perl: non-linear text, consisting of chunks of linear text
(Unicode, of course) that have attributes attached.

Take this HTML for example: 

  <html>Text with a <b>larger <font size=+1>l</font>etter</b> in it. </html>

and try to find a way to substitute the word `letter' with `word',
with outside formatting (<b>) preserved.

Next to impossible?  I found no easy (but general) way, even not with
HTML::Parser et. al.

If perl could handle non-linear strings, this could be done in a
simple s/letter/word/.  Ain't that time-saving!!  For example

  s/(l)etter/${"w":${1:}}ord/

could do the magic (see below for a syntax proposal).

Or, to make formulas more readable:

  s/\b(\w+)^(\d+)/$1${2:raised=>1}/


=head1 IMPLEMENTATION

Ugh, you got me there.  I know very little about Perl internals, so I
can't even pretend something.  Maybe Ilya has already started on a
prototype? ;-)

Anyway, the current document parsers (HTML::Parser et. al.) already
build non-linear text data structures.  Basically these structs are
lists of strings interspersed with refs to embedded structs (and
attributes) of the same type.  It has to be discussed if this
structure is flexible enough for most purposes.

Attributes could be simply stored as hashes, so the chunks would have
hash refs attached.  This sounds rather easy to accomplish.

So, what today is a string would become an array of strings with
attached hashes internally.  This doesn't sound too strange, but
again, this is for others to decide.

=head1 SYNTAX

We need a way to specify attributes to chunks of text in a backward
compatible way.  But how can we specify it in a compact way?  Hmm, as
variable access by name is deprecated anyhow, we could use ${var} to
mean $var and ${"text"} to mean "text".

Now we can use `:' to separate the varname from the attributes:

  ${foo:size} # accesses attribute `size' in variable `foo'

  # set attribute `size'
  ${foo:size} = $fontsize;

  # copy attribute `a1' of text in var `bar' to attribute `a2' in var `foo'
  ${foo:a2} = ${bar:a1};  

  # copy all attributes, but leave text as-is
  ${foo:} = ${bar:};

Now for literal strings with embedded attributes:

  $foo = "just another string";
  ${foo:size} = 12;

or

  $foo =  ${"just another string":size=>10};

This can nest:

  $bar = ${"${"L":size=>12}arge":size=>10};

  ${bar:size} gives 10

How to loop over all chunks? Hmm, seems like split could handle it OK
if the regex engine can match chunk borders. Seems like another
special token is needed.  How about `\C' for chunk?  Or is this
already taken?

  $astring = ${"${"L":size=>12}arge ${"S":size=>8}mall":size=>10};
  foreach my $chunk (split /\C/ $astring) {
    print "$chunk: ${chunk:size}\n";
  }

would print

  L: 12
  arge: 10
  S: 8
  mall: 10


What if an attributed string is split in half?  Well, in that case,
the attributes must be duplicated.

  $foo = ${"no attrib here ${"ATTRIBUTES":size=>12} nothing here":size=>8};
  $firsthalf = substr($foo, 0, length($foo)/2);

should set $firsthalf to

  ${"no attrib here ${"ATTR":size=>12}":size=>8}

and

  substr($foo, length($foo)/2, 14, "really  ${"nothing":attrib=>1}");

should set $foo to

  ${"no attrib here ${"ATTR":size=>12}${"really  ${"nothing":attrib=>1}":}":size=>8}


Hmm, what about string comparisions?  `eq' and friends should simply
conmtinue to work as usual on the string contents.  Do we need some
kind of meta-eq to be able to compare the attribs also?

There are a lot of other issues to work out, but I'd like to first get
some approval from the gurus, so I'll stop here.


=head1 REFERENCES

  http://www.perl.com/pub/2000/09/ilya.html

--snip--
RFC: Perl should support non-linear text

Reply via email to