Hi folks, I know, the RFC period is over, but still... Please, read this through and tell me if it's a good idea or not. Actually, it's not mine, I just wrote it down. But see for yourself... Roland --snip-- =head1 TITLE Perl should support non-linear text. =head1 VERSION Maintainer: Roland Giersig <[EMAIL PROTECTED]> Date: 19 Oct 2000 Version: 1 Mailing List: perl6-internals ? Number: ? =head1 ABSTRACT Right now, Perl performs its magic only upon linear strings of ASCII and Unicode text. As Ilya Zakharevich has stated in his recent interview (http://www.perl.com/pub/2000/09/ilya.html), the new feature that would help todays Perl programmers most is if Perl would be capable to perform its mighty string operations on marked-up (non-linear) text consisting of linear chunks of text strings that carry different attributes. This could very well be THE new feature that justifies the complete Perl6 rewrite! =head1 DESCRIPTION When Perl first came into being, the world was full of ASCII text, so Perl became strong in manipulating ASCII text. But this has changed. Nowadays even the simplest documents (e.g. mail messages) tend to be in some marked-up format or other, and programmers worldwide are struggling in finding a way to manipulate those. To aid these efforts I therefore propose to nehance the string format used in Perl: non-linear text, consisting of chunks of linear text (Unicode, of course) that have attributes attached. Take this HTML for example: <html>Text with a <b>larger <font size=+1>l</font>etter</b> in it. </html> and try to find a way to substitute the word `letter' with `word', with outside formatting (<b>) preserved. Next to impossible? I found no easy (but general) way, even not with HTML::Parser et. al. If perl could handle non-linear strings, this could be done in a simple s/letter/word/. Ain't that time-saving!! For example s/(l)etter/${"w":${1:}}ord/ could do the magic (see below for a syntax proposal). Or, to make formulas more readable: s/\b(\w+)^(\d+)/$1${2:raised=>1}/ =head1 IMPLEMENTATION Ugh, you got me there. I know very little about Perl internals, so I can't even pretend something. Maybe Ilya has already started on a prototype? ;-) Anyway, the current document parsers (HTML::Parser et. al.) already build non-linear text data structures. Basically these structs are lists of strings interspersed with refs to embedded structs (and attributes) of the same type. It has to be discussed if this structure is flexible enough for most purposes. Attributes could be simply stored as hashes, so the chunks would have hash refs attached. This sounds rather easy to accomplish. So, what today is a string would become an array of strings with attached hashes internally. This doesn't sound too strange, but again, this is for others to decide. =head1 SYNTAX We need a way to specify attributes to chunks of text in a backward compatible way. But how can we specify it in a compact way? Hmm, as variable access by name is deprecated anyhow, we could use ${var} to mean $var and ${"text"} to mean "text". Now we can use `:' to separate the varname from the attributes: ${foo:size} # accesses attribute `size' in variable `foo' # set attribute `size' ${foo:size} = $fontsize; # copy attribute `a1' of text in var `bar' to attribute `a2' in var `foo' ${foo:a2} = ${bar:a1}; # copy all attributes, but leave text as-is ${foo:} = ${bar:}; Now for literal strings with embedded attributes: $foo = "just another string"; ${foo:size} = 12; or $foo = ${"just another string":size=>10}; This can nest: $bar = ${"${"L":size=>12}arge":size=>10}; ${bar:size} gives 10 How to loop over all chunks? Hmm, seems like split could handle it OK if the regex engine can match chunk borders. Seems like another special token is needed. How about `\C' for chunk? Or is this already taken? $astring = ${"${"L":size=>12}arge ${"S":size=>8}mall":size=>10}; foreach my $chunk (split /\C/ $astring) { print "$chunk: ${chunk:size}\n"; } would print L: 12 arge: 10 S: 8 mall: 10 What if an attributed string is split in half? Well, in that case, the attributes must be duplicated. $foo = ${"no attrib here ${"ATTRIBUTES":size=>12} nothing here":size=>8}; $firsthalf = substr($foo, 0, length($foo)/2); should set $firsthalf to ${"no attrib here ${"ATTR":size=>12}":size=>8} and substr($foo, length($foo)/2, 14, "really ${"nothing":attrib=>1}"); should set $foo to ${"no attrib here ${"ATTR":size=>12}${"really ${"nothing":attrib=>1}":}":size=>8} Hmm, what about string comparisions? `eq' and friends should simply conmtinue to work as usual on the string contents. Do we need some kind of meta-eq to be able to compare the attribs also? There are a lot of other issues to work out, but I'd like to first get some approval from the gurus, so I'll stop here. =head1 REFERENCES http://www.perl.com/pub/2000/09/ilya.html --snip--