Re: SvPV*

2000-11-28 Thread David Mitchell

Nick Ing-Simmons [EMAIL PROTECTED] wrote:
 Dave Storrs [EMAIL PROTECTED] writes:
 On Tue, 21 Nov 2000, Jarkko Hietaniemi wrote:
 
  Yet another bummer of the current SVs is that they poorly fit into
  'foreign memory' situations where the buffer is managed by something
  else than Perl.  "No, thank you, Perl, keep your greedy fingers off
  this chunk.  No, you may not play with it."
 
 
  Out of curiousity, when might such a situation arise?  When you
 are embedding C in Perl, perhaps?
 
 Or calling an external library which returns a pointer to data.
 Right now we _have_ to copy it as there is no way to tell perl 
 to (say) XFree() it rather than Safefree() it. Which is a pain when data
 is big.

As long as destroy() is one of the vtable methods, then it should be
fairly easy for someone to write an SV wrapper type that calls a specific
free() - either fixed per type, or per SV.

Roll on perl6 :-)

Dave M.




Re: SvPV*

2000-11-24 Thread Bart Lateur

On Fri, 24 Nov 2000 08:54:43 +0100, Roland Giersig wrote:

Maybe the title should be :

"Perl should use XML as its basic data type instead of linear strings"

Horrible.

I kinda liked your original proposal. But you should NOT focus on XML.
That leaves out too many other possible data sources: RTF, for example,
or TeX. What is typical, is that it is marked up text, in the form of a
tree, i.e. properly nested.

The internal structure might as well be easily representable as XML.

I do think that the term "non-linear text" is absolutely unclear.

-- 
Bart.



Re: SvPV*

2000-11-23 Thread David Mitchell

It could be argued that the way to implement "enhanced" strings, eg
strings with embedded attributes (html, rtf,) is for someone
to write a specific SV class to deal with that kind of string.
As has been pointed out, a difficulty with this is that standard
regexes must be able to operate on that SV, leading to all sorts of
problems related to the extraction of the string representation,
and the definition of the semantics of matching and substituting on
that string.

One way round this is to leave the semantics to implementor of the SV type.
This could be done by having vtable methods for *all* string ops
known to Perl; in particular m//, s// and tr//.

The way this could work is for the Perl core to provide a generic regex
library, which uses only the public interface to SVs to extract
and manipulate its contents. Standard string SVs would have the relevant
vtable entries point to these generic regex functions.
However, if someone wants to implement a HTML SV type say, then
(if they are keen enough) they can write their own m//, s// methods
which are efficent (becuase they can access the internal representation),
and can have whatever semantics the author wishes.

However, since the internals of regexes are a dark art to me, I dont know
whether is is sensible to have a single regex compiler, but multiple
regex executors (if that's the right terminology).




Re: SvPV*

2000-11-23 Thread Roland Giersig

Nicholas Clark wrote:
 
 On Wed, Nov 22, 2000 at 01:24:50PM -0500, Chaim Frenkel wrote:
  I'd offer the possiblity that there are two (or perhaps more)
  different problems here.  One is the current bunch of bytes (string,
  executable to be twiddled) Another which the attribute on strings
  seems to be structured data.
 
  Squeezing attributes onto a buffer, seems to be shoehorning a more
  general problem onto a specific implementation.
 
  Getting an efficient representation of a meaningful structure should
  be done a new data type.
 
  (I'm thinking of representing COBOL records/data, or even XML documents)

That's (XML) what I was thinking also when writing the proposal.
Hmm, I should modify it to use the XML buzzword, this could greatly
enhance its obvious value.  Maybe the title should be :

"Perl should use XML as its basic data type instead of linear strings"

How does that sound?

 Have I misunderstood you if I suggest that "two or more" is actually a
 continuous range of representation from
 
 1 (contiguous linear) string data with 0 or more attribute attached to each
   character where the string's text is the backbone
   [and the global and local order of the characters in string is crucial
to the value and equality with other variables]
 
 2 structured data (eg XML) where the string's text is just part of the data
   held in the structure, and you could sort the data in different ways
   without changing its value
 
 Are those end members in a continuum? or are hybrids of the 2 impossible?
 Am I barking up the wrong tree completely?

I would see that (1) is the simplest form of (2), so once handling (2)
is
solved, (1) is also handled.  This is from a functional point of view,
performance is another issue.  It could be well so that the solution to 
(2) needs only minor tweaking to be fast enough for (1) compared to
the current solution.  Or a complete separate implementation is
warranted.

I'm with Chaim Frenkel, who wrote:
 If for no other reason, there are many ways of having the attributes
 distribute across, deletions, additions, and moves. That is a policy
 decision that should not be done at the perl internal level.

This means IMHO, that the basic data structure for (1) must be
extensible in a way that it can be morphed into the one for (2).
But the implementation of functions that work on (2) are separable
from those that work on (1).

David Mitchell has a proposal how this could be done:
 One way round this is to leave the semantics to implementor of the SV type.
 This could be done by having vtable methods for *all* string ops
 known to Perl; in particular m//, s// and tr//.
 
 The way this could work is for the Perl core to provide a generic regex
 library, which uses only the public interface to SVs to extract
 and manipulate its contents. Standard string SVs would have the relevant
 vtable entries point to these generic regex functions.
 However, if someone wants to implement a HTML SV type say, then
 (if they are keen enough) they can write their own m//, s// methods
 which are efficent (becuase they can access the internal representation),
 and can have whatever semantics the author wishes.
 
 However, since the internals of regexes are a dark art to me, I dont know
 whether is is sensible to have a single regex compiler, but multiple
 regex executors (if that's the right terminology).

I'm very happy how this discussion is going.  Are you guys also
feeling that this could be of immense value for a lot of Perl users
out there?

Best regards,

Roland
--
[EMAIL PROTECTED]



Re: SvPV*

2000-11-22 Thread Jarkko Hietaniemi

 2) An attached table of attributes and ranges to which they apply?
Uses less memory for sparse attributes, but means that it's hard work
every time we have to interrogate or shuffle characters as we need to
check all the ranges each time to see if the characters we are
manipulating have metadata.

I believe this alternative has been discussed once in a while.  Which
ranges an operation affects is a log(N) operation on the character
position (binary search), and the ranges can also be kept sorted among
themselves on (primary key start position, secondary key end
position), so that finding out the victim ranges is also a log(N).
Admittedly, log(N) tends to be larger than 1, and certainly larger
than 0 :-)  Also, using UTF-8 (or any variable length encoding) is
a pain since you can't any more just happily offset to the data.

One could also implement SVs as balanced trees, splitting and merging
as the scalar grows and shrinks.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: SvPV*

2000-11-22 Thread Dan Sugalski

At 10:45 AM 11/22/00 -0600, Jarkko Hietaniemi wrote:
  2) An attached table of attributes and ranges to which they apply?
 Uses less memory for sparse attributes, but means that it's hard work
 every time we have to interrogate or shuffle characters as we need to
 check all the ranges each time to see if the characters we are
 manipulating have metadata.

I believe this alternative has been discussed once in a while.  Which
ranges an operation affects is a log(N) operation on the character
position (binary search), and the ranges can also be kept sorted among
themselves on (primary key start position, secondary key end
position), so that finding out the victim ranges is also a log(N).
Admittedly, log(N) tends to be larger than 1, and certainly larger
than 0 :-)  Also, using UTF-8 (or any variable length encoding) is
a pain since you can't any more just happily offset to the data.

This strikes me as an excellent candidate for a custom scalar type. I like 
the idea, and it could be really useful in some circumstances, but I'd not 
want to burden the default scalar with the code for this.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: SvPV*

2000-11-22 Thread Jarkko Hietaniemi

 I believe this alternative has been discussed once in a while.  Which
 ranges an operation affects is a log(N) operation on the character
 position (binary search), and the ranges can also be kept sorted among
 themselves on (primary key start position, secondary key end
 position), so that finding out the victim ranges is also a log(N).
 Admittedly, log(N) tends to be larger than 1, and certainly larger
 than 0 :-)  Also, using UTF-8 (or any variable length encoding) is
 a pain since you can't any more just happily offset to the data.
 
 This strikes me as an excellent candidate for a custom scalar type. I like 
 the idea, and it could be really useful in some circumstances, but I'd not 
 want to burden the default scalar with the code for this.

Agreed.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: SvPV*

2000-11-22 Thread Dan Sugalski

At 05:07 PM 11/22/00 +, Nicholas Clark wrote:
On Wed, Nov 22, 2000 at 11:02:16AM -0600, Jarkko Hietaniemi wrote:
   Dan:
   This strikes me as an excellent candidate for a custom scalar type. I 
 like
   the idea, and it could be really useful in some circumstances, but 
 I'd not
   want to burden the default scalar with the code for this.
 
  Agreed.

How does the regexp replacement engine cope with this? By implementing
all replacements as substr() type ops?

By punting really, *really* hard and using a lot of handwaving at this 
point. :)

You're likely right that a series of substituting substr() calls will be 
the end result, but that's OK.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: SvPV*

2000-11-22 Thread David Mitchell

Nicholas Clark [EMAIL PROTECTED] writes:

 How does the regexp replacement engine cope with this? By implementing
 all replacements as substr() type ops?
 [or behaving as if it implements... whilst cheating and doing it direct for
 scalars it understands?]
 
 Or don't we need to work this out at this time?

This sounds like something that *does* need working out - it is
essentially the problem of defining the string-related parts of the
vtable API (which I seem to recall is where this thread starrted, anyway!)

A possible approach would be to have per-scalar attribute(s) saying
whether the SV's string value is
* simple bytes
* variable length bytes (eg UTF8)
* complex (eg embedded attributes)
with different bits of the "PV" part of the API legal for each of these
options. For simple bytes, you can use the "here's a pointer to a buffer"
approach, which the regex engine etc can handle efficiently; for the others
there are more fancy (and less efficient) access methods.

Then chuck in more complications for shared copy-on-write strings, etc etc.

Anyone want to volunteer to knock up an API in the next 5 minutes   :-)




Re: SvPV*

2000-11-22 Thread Chaim Frenkel

 "JH" == Jarkko Hietaniemi [EMAIL PROTECTED] writes:

 2) An attached table of attributes and ranges to which they apply?
 Uses less memory for sparse attributes, but means that it's hard work
 every time we have to interrogate or shuffle characters as we need to
 check all the ranges each time to see if the characters we are
 manipulating have metadata.

JH I believe this alternative has been discussed once in a while.  Which
JH ranges an operation affects is a log(N) operation on the character
JH position (binary search), and the ranges can also be kept sorted among
JH themselves on (primary key start position, secondary key end
JH position), so that finding out the victim ranges is also a log(N).
JH Admittedly, log(N) tends to be larger than 1, and certainly larger
JH than 0 :-)  Also, using UTF-8 (or any variable length encoding) is
JH a pain since you can't any more just happily offset to the data.

JH One could also implement SVs as balanced trees, splitting and merging
JH as the scalar grows and shrinks.

I'd offer the possiblity that there are two (or perhaps more)
different problems here.  One is the current bunch of bytes (string,
executable to be twiddled) Another which the attribute on strings
seems to be structured data.

Squeezing attributes onto a buffer, seems to be shoehorning a more
general problem onto a specific implementation.

Getting an efficient representation of a meaningful structure should
be done a new data type.

(I'm thinking of representing COBOL records/data, or even XML documents)

chaim
-- 
Chaim FrenkelNonlinear Knowledge, Inc.
[EMAIL PROTECTED]   +1-718-236-0183



Re: SvPV*

2000-11-22 Thread Nicholas Clark

On Wed, Nov 22, 2000 at 01:24:50PM -0500, Chaim Frenkel wrote:
 I'd offer the possiblity that there are two (or perhaps more)
 different problems here.  One is the current bunch of bytes (string,
 executable to be twiddled) Another which the attribute on strings
 seems to be structured data.
 
 Squeezing attributes onto a buffer, seems to be shoehorning a more
 general problem onto a specific implementation.
 
 Getting an efficient representation of a meaningful structure should
 be done a new data type.
 
 (I'm thinking of representing COBOL records/data, or even XML documents)

Have I misunderstood you if I suggest that "two or more" is actually a
continuous range of representation from

1 (contiguous linear) string data with 0 or more attribute attached to each
  character where the string's text is the backbone
  [and the global and local order of the characters in string is crucial
   to the value and equality with other variables]

2 structured data (eg XML) where the string's text is just part of the data
  held in the structure, and you could sort the data in different ways
  without changing its value

Are those end members in a continuum? or are hybrids of the 2 impossible?
Am I barking up the wrong tree completely?

Nicholas Clark



Re: SvPV*

2000-11-22 Thread Chaim Frenkel

 "NC" == Nicholas Clark [EMAIL PROTECTED] writes:

NC Have I misunderstood you if I suggest that "two or more" is actually a
NC continuous range of representation from

NC 1 (contiguous linear) string data with 0 or more attribute attached to each
NC   character where the string's text is the backbone
NC   [and the global and local order of the characters in string is crucial
NCto the value and equality with other variables]

NC 2 structured data (eg XML) where the string's text is just part of the data
NC   held in the structure, and you could sort the data in different ways
NC   without changing its value

NC Are those end members in a continuum? or are hybrids of the 2 impossible?
NC Am I barking up the wrong tree completely?

That's one way of looking at it.

But I'm more inclined to think of the structured data type as a layer
above the raw bits. I see the association of attributes with the underlying
data as an extra 'service'.

If for no other reason, there are many ways of having the attributes
distribute across, deletions, additions, and moves. That is a policy
decision that should not be done at the perl internal level.

chaim
-- 
Chaim FrenkelNonlinear Knowledge, Inc.
[EMAIL PROTECTED]   +1-718-236-0183



Re: SvPV*

2000-11-22 Thread Russ Allbery

Dan Sugalski [EMAIL PROTECTED] writes:

 More often vice versa. INN embeds perl, for example, and uses it for
 spam detection. When it builds scalars for perl to use, it uses the copy
 of the article already in memory to avoid copies. (Given the volume of
 news and the size of some news articles this can save a lot) You
 wouldn't want perl messing with it in that case, since the string memory
 really isn't perl's to manage.

INN marks such "windowed" scalars as read-only, which I think only makes
sense for that situation.  I guess I could think of cases where you might
want to do in-place modifications without changing the allocation, but
that sounds a lot iffier.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/



SvPV*

2000-11-21 Thread Nicholas Clark

(I'm not sure if I've missed all the fun here before I subscribed, but
I can't anything on the RFC list that mentions the following)

perl5 has a tangle of SvPV macros to allow C code to get a pointer
to the scalar. (or the "private", with or without the length, and
more relating to utf8 that don't even appear to be documented)

Has any thought yet been given to the API to get scalars?

Jarkko posted an idea on p5p of "Virtual Values" which would permit a
scalar to point to another scalar's buffer, rather than its own.
Currently the perl5 API assumes that you get a read-write pointer, and that
the thing it points to is "\0" terminated. This makes it hard to implement
copy on write, or to allow a pointer to a sub-length of the parent
scalar's buffer.

IIRC Ilya mailed p5p bemoaning the fact that perl's SVs use a continuous
buffer. A split-buffer representation (where a hole is allowed in the
middle of the buffer data) permits much faster replacement type operations,
as there is less copying, and you can move the hole around to suit your
needs.

So I was wondering if perl6 was going to replace SvPV* with something that
allows the caller to say whether they'd like

* read only or read write
* buffer all in one block or can cope with a hole (plus tell me where it is)
* null terminated buffer or don't care

and possibly

* data must be in utf8 or tell me what the data is in

although this might be better done as caller specifies 1 or more acceptable
encodings they could cope with, and SvPV* returns data in whatever
requires least work to translate consistent with maintaining accuracy.

In particular specifying read/write versus read only would allow
perl to treat scalars as copy-on-write which would mean things like
$a=$b wouldn't actually copy anything (wasting time and (shared) memory
pages) until either $a or $b got changed.
[I have this feeling that there's a bit of this already in sv.c, but I'm
not sure how much]

Nicholas Clark



Re: SvPV*

2000-11-21 Thread Jarkko Hietaniemi

On Tue, Nov 21, 2000 at 05:04:32PM +, Nicholas Clark wrote:
 (I'm not sure if I've missed all the fun here before I subscribed, but
 I can't anything on the RFC list that mentions the following)
 
 perl5 has a tangle of SvPV macros to allow C code to get a pointer
 to the scalar. (or the "private", with or without the length, and
 more relating to utf8 that don't even appear to be documented)
 
 Has any thought yet been given to the API to get scalars?
 
 Jarkko posted an idea on p5p of "Virtual Values" which would permit a
 scalar to point to another scalar's buffer, rather than its own.

That was the other half, yes.  The other half was it that a VV would
point to a 'window' or 'slice' of the other scalar's buffer, not
necessarily the whole buffer.

 Currently the perl5 API assumes that you get a read-write pointer, and that
 the thing it points to is "\0" terminated. This makes it hard to implement
 copy on write, or to allow a pointer to a sub-length of the parent
 scalar's buffer.

What he said.

 IIRC Ilya mailed p5p bemoaning the fact that perl's SVs use a continuous
 buffer. A split-buffer representation (where a hole is allowed in the
 middle of the buffer data) permits much faster replacement type operations,
 as there is less copying, and you can move the hole around to suit your
 needs.

Yet another bummer of the current SVs is that they poorly fit into
'foreign memory' situations where the buffer is managed by something
else than Perl.  "No, thank you, Perl, keep your greedy fingers off
this chunk.  No, you may not play with it."

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen