Re: buffered input
On Sat, 05 Feb 2011 00:46:40 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. They have some commonalities and some differences, but it's been difficult to capture them. I think now I have a clear view, caused by a few recent discussions. One was the CSV reader discussed on the Phobos list; another was the discussion on defining the right std.xml. [snip] What do you all think? I haven't read many of the responses, but I'll say again what I've always said. The range concept does not fit streams very well. I think a range can be built on a stream, but I think a buffered stream should be it's own type (similar to how you laid it out a few weeks ago). IMO, a stream should be a simple class hierarchy that defines input/output and buffering. Then ranges can be built on top of the stream to interface with other parts of phobos. Now, I think as an *input parameter* for algorithms that wish to work with streams (or other ranges), a range of T[] is most appropriate. -Steve
Re: buffered input
On Sun, 06 Feb 2011 10:25:15 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: How does the stream decide between 1 and 2? Clearly it's undesirable to grow the buffer too much and it's also undesirable to copy too much data. A simple approach is to establish a bound on losses, for example copy data only if size to copy is 5% of the entire buffer. I think also, you should test to see if appending will copy the data anyways (i.e. reallocate). We may need to add a runtime function for this. -Steve
Re: buffered input
On Sat, 05 Feb 2011 10:02:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/5/11 2:45 AM, Michel Fortin wrote: One thing I'm wondering is whether it'd be more efficient if we could provide our own buffer to be filled. In cases where you want to preserve the data, this could let you avoid double-copying: first copy in the temporary buffer and then at the permanent storage location. If you need the data only temporarily however providing your buffer to be filled might be less efficient for a range that can't avoid copying to the temporary buffer for some reason.. Generally when one says I want the stream to copy data straight into my buffers this is the same as I want the stream to be unbuffered. So if you want to provide your own buffers to be filled, we need to discuss refining the design of unbuffered input - for example by adding an optional routine for bulk transfer to input ranges. I may want to store 1% of a very large file. You are saying I must either a) unbuffer the entire file (handling the buffering on my own) or b) take the penalty and double copy the data. I want c) temporarily use my buffer for buffering until I say to stop. The range interface doesn't make this easy... -Steve
Re: buffered input
On 02/07/2011 02:01 PM, Steven Schveighoffer wrote: On Sat, 05 Feb 2011 10:02:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/5/11 2:45 AM, Michel Fortin wrote: One thing I'm wondering is whether it'd be more efficient if we could provide our own buffer to be filled. In cases where you want to preserve the data, this could let you avoid double-copying: first copy in the temporary buffer and then at the permanent storage location. If you need the data only temporarily however providing your buffer to be filled might be less efficient for a range that can't avoid copying to the temporary buffer for some reason.. Generally when one says I want the stream to copy data straight into my buffers this is the same as I want the stream to be unbuffered. So if you want to provide your own buffers to be filled, we need to discuss refining the design of unbuffered input - for example by adding an optional routine for bulk transfer to input ranges. I may want to store 1% of a very large file. You are saying I must either a) unbuffer the entire file (handling the buffering on my own) or b) take the penalty and double copy the data. I want c) temporarily use my buffer for buffering until I say to stop. The range interface doesn't make this easy... This looks very similar to a possible current use case of mine. (lookahead on demand -- possible backtracking) Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input (examples?)
On 2011-02-07 08:24:32 -0500, spir denis.s...@gmail.com said: Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface? You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say shiftFront and appendToFront to control what's available in the buffer. -- Michel Fortin michel.for...@michelf.com http://michelf.com/
Re: buffered input
Andrei Alexandrescu Wrote: With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. This surely satisfies most needs for buffered input. I'm thinking about adaptive caller which can do tricks based on the stream content. Say, strings are usually serialized as |length|data| so the caller can preallocate buffer of exact length and fill it directly. Will unbuffered file stream bypass system cache? If yes, you'll have poor performance if the file is already in cache. If no, you'll have double copy if the caller wants to save the data: from system cache to the range's buffer and from the range's buffer to the caller's buffer.
Re: buffered input
On 2/7/11 7:53 AM, Steven Schveighoffer wrote: On Sat, 05 Feb 2011 00:46:40 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. They have some commonalities and some differences, but it's been difficult to capture them. I think now I have a clear view, caused by a few recent discussions. One was the CSV reader discussed on the Phobos list; another was the discussion on defining the right std.xml. [snip] What do you all think? I haven't read many of the responses, but I'll say again what I've always said. The range concept does not fit streams very well. I think a range can be built on a stream, but I think a buffered stream should be it's own type (similar to how you laid it out a few weeks ago). IMO, a stream should be a simple class hierarchy that defines input/output and buffering. Then ranges can be built on top of the stream to interface with other parts of phobos. One good thing about the current proposal is that it integrates seamlessly with that one. Andrei
Re: buffered input (examples?)
On 2/7/11 9:40 AM, Michel Fortin wrote: On 2011-02-07 08:24:32 -0500, spir denis.s...@gmail.com said: Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface? You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say shiftFront and appendToFront to control what's available in the buffer. Very well put, thanks. Andrei
Re: buffered input (examples?)
On 02/07/2011 03:40 PM, Michel Fortin wrote: On 2011-02-07 08:24:32 -0500, spir denis.s...@gmail.com said: Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface? You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say shiftFront and appendToFront to control what's available in the buffer. Right, that's nice. But I was certainly unclear. My question was in fact how to make the lexeme stream be a buffered range. So that, precisely, the parser does not have to cope about buffering (or rather has to cope about it the std way clients will use to cope with buffered ranges once they are standardised). In my hypothetical design, store() and unstore() are commands used to manage buffering, thus similar in purpose, but different from what Andrei's proposal provifdes. They just match the way I see a parser's needs, naively. Now, how to translate this design to make the lexeme stream a buffered range is very unclear for me; needed operations do not directly translate to appendToFront / shiftFront (unless I do not understand their action). I'm not even sure whether it would be simply correct to use a buffered range. Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input (examples?)
On 2011-02-07 13:25:53 -0500, spir denis.s...@gmail.com said: On 02/07/2011 03:40 PM, Michel Fortin wrote: On 2011-02-07 08:24:32 -0500, spir denis.s...@gmail.com said: Does this have anything to do with currently discussed buffered input ranges? If yes, how does such a design, or any alternative, fit their proposed interface? You can build all of this on top of a buffered input range. The buffered input range is not an alternative for your complex parsing algorithm, it's just the component managing the buffer. Having the underlying range manage the buffer (as opposed to having your parser algorithm doing it) means that the buffer implementation can vary depending on the type of range. For instance, if you're parsing directly data in memory, the buffered range can use directly this data in memory as the buffer, requiring no allocation and no copy; if you're parsing from a file or a network stream it'll use a more standard buffer. But how exactly the buffer is implemented does not affect your parsing algorithm in any way. That's the great thing about it, separation of concerns: your algorithm will work independently of the buffer implementation used. All your parser algorithm has to do is say shiftFront and appendToFront to control what's available in the buffer. Right, that's nice. But I was certainly unclear. My question was in fact how to make the lexeme stream be a buffered range. So that, precisely, the parser does not have to cope about buffering (or rather has to cope about it the std way clients will use to cope with buffered ranges once they are standardised). In my hypothetical design, store() and unstore() are commands used to manage buffering, thus similar in purpose, but different from what Andrei's proposal provifdes. They just match the way I see a parser's needs, naively. Now, how to translate this design to make the lexeme stream a buffered range is very unclear for me; needed operations do not directly translate to appendToFront / shiftFront (unless I do not understand their action). I'm not even sure whether it would be simply correct to use a buffered range. You'd probably need a wrapper around the buffered range to handle backtracking nicely. Something like this: struct Backtracker(R) { R bufRange; size_t backtrackLength; ElementType!R front() @property { bufRange.front[0..$-backtrackLength]; } void shiftFront(n) { bufRange.shiftFront(size_t n); } void backtrackFront(size_t n) { backtrackLength += n; } void appendToFront(size_t n) { if (backtrackLength = n) { backtrackLength -= n; } else { bufRange.appendToFront(n-backtrackLength); backtrackLength = 0; } } void popFront() { assert(backtrackLength == 0, mixin backtracking with by-chunk iteration not implemented); buffRange.popFront(); } } Perhaps this backtrackFront() function could be an optional part of the standard buffered range interface. If an algorithm requires a backtracking buffered input range and the input range you're provided with does not support backtracking, then you can use a wrapper like the one I drafted above to make the input acceptable to the algorithm. -- Michel Fortin michel.for...@michelf.com http://michelf.com/
Re: buffered input
On Saturday 05 February 2011 12:57:21 Jonathan M Davis wrote: On Saturday 05 February 2011 07:16:45 Andrei Alexandrescu wrote: On 2/5/11 5:09 AM, Jonathan M Davis wrote: Hmm. I think that I'd have to have an actual implementation to mess around with to say much. My general take on buffered input is that I don't want to worry about it. I want it to be buffered so that it's more efficient, but I don't want to have to care about it in how I use it. I would have expected a buffered input range to be exactly the same as an input range except that it doesn't really just pull in one character behind the scenes. It pulls in 1024 or whatever when popFront() would result in the end of the buffer being reached, and you just get the first one with front. The API doesn't reflect the fact that it's buffered at all except perhaps in how you initialize it (by telling how big the buffer is, though generally I don't want to have to care about that either). Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon The thing is though that if I want to be iterating over a string which is buffered (from a file or stream or whatever), I want front to be immutable(char) or char, not immutable(char)[] or char[]. I can see how having an interface which allows startsWith to efficiently check whether the buffered string starts with a particular string makes good sense, but generally, as far as I'm concerned, that's startsWith's problem. How would I even begin to use a buffered range of string[] as a string? Normally, when I've used buffered anything, it's been purely for efficiency reasons. All I've cared about is having a stream or file or whatever. The fact that reading it from the file (or wherever it came from) in a buffered manner is more efficient means that I want it buffered, but that hasn't had any effect on how I've used it. If I want x characters from the file, I ask for x characters. It's the buffered object's problem how many reads that does or doesn't do. You must be thinking of a use case which I don't normal think of or am not aware of. In my experience, buffering has always been an implementation detail that you use because it's more efficient, but you don't worry about it beyond creating a buffered stream rather than an unbuffered one. Okay. I think that I've been misunderstanding some stuff here. I forgot that we were dealing with input ranges rather than forward ranges, and many range functions just don't work with input ranges, since they lack save(). Bleh. Okay. Honestly, what I'd normally want to be dealing with when reading a stream or file is a buffered forward range which is implemented in a manner which minimized copies. Having to deal with a input range, let alone what Andrei is suggesting here would definitely be annoying to say the least. Couldn't we do something which created a new buffer each time that it read in data from a file, and then it could be a forward range with infinite look-ahead. The cost of creating a new buffer would likely be minimal, if not outright neglible, in comparison to reading in the data from a file, and having multiple buffers would allow it to be a forward range. Perhaps, the creation of a new buffer could even be skipped if save had never been called and therefore no external references to the buffer would exist - at least as long as we're talking about bytes or characters or other value types. Maybe there's some major flaw in that basic idea. I don't know. But Andrei's suggestion sounds like a royal pain for basic I/O. If that's all I had to deal with when trying to lazily read in a file and process it, I'd just use readText() instead, since it would just be way easier to use. But that's not exactly ideal, because it doesn't work well with large files. Maybe Andrei's idea is great to have and maybe it _should_ be in Phobos, but I really think that we need a higher level abstraction that makes a stream into a forward range so that it's actually simple to use buffered I/O. As efficient as Andrei's suggestion may be, it sure sounds like a royal pain to use - especially in comparison to readText(). So, maybe I'm still misunderstanding or missing something here, but what _I_ want to see for I/O streams is a _forward_ range which is buffered and which reads in the file or whatever the data comes from in a lazy manner. The more I think about it, the less I like input ranges. They're just so painfully restrictive. They may be necessary at times, but I'd _much_ prefer to deal with forward ranges. On a related note, perhaps we should add a function to some subset of ranges which is
Re: buffered input
On 02/06/2011 04:11 AM, Andrei Alexandrescu wrote: On 2/5/11 5:22 PM, Nick Sabalausky wrote: Heywood Floydsoul...@gmail.com wrote in message news:mailman.1318.1296941395.4748.digitalmar...@puremagic.com... As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? The problem with that is that in many many cases it forces unnessisary copying. We can get much better performance with this slightly more hands-on version. But that said, if the traditional hands-free automatic buffering really is all you need, then such a thing [should] be easily to construct out of the Andrei's style of buffered range. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! I completely agree. The names of those functions confused the hell out of me until I read Andrei's descriptions of them. Now I understand what they do...but I still don't understand their names at all. Better names are always welcome! If append actually appends into buffer (what I understand as of now), then appendToBuf(er). For shiftFront, maybe eatSlice? popFrontSlice would be very good (esp as opposed to a hypothetical poBackSlice), to hint the operation happens at head (because pop alone means at back). But as others have said front in the context of ranges has now an established sense of first (just like head or car for lists, by the way), or maybe more exactly current. Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
Nick Sabalausky napisał: discard and fetch? I like that.
Re: buffered input
On 2011-02-06 03:22:08 -0500, Jonathan M Davis jmdavisp...@gmx.com said: So, maybe I'm still misunderstanding or missing something here, but what _I_ want to see for I/O streams is a _forward_ range which is buffered and which reads in the file or whatever the data comes from in a lazy manner. The more I think about it, the less I like input ranges. They're just so painfully restrictive. They may be necessary at times, but I'd _much_ prefer to deal with forward ranges. It's true that forward ranges are much easier to deal with. That said, if the underlying abstraction does not support saving a position to return to it later (think of a network stream), implementing them is rather impractical, even though it might be possible with buffering. But still, let's let's see how we could implement a forward range on top of the buffered range abstraction could be done: 1. Wrap the buffered range in a data structure that keeps track of the absolute offset for the first element in the range and contains a sorted list of offsets representing all the forward ranges iterating on it. 2. Make is so that each time the smallest offset in the list is advanced you call shiftFront to to remove from the buffer the no longer referenced elements; and each time an offset goes beyond what's available in the buffer you call appendToFront to make elements up to that index available in the buffer. This ensures the buffer always cover all of the offsets in the list. To make this efficient, the list must be kept sorted at all times. 3. Then you can create forward ranges as reference to that data structure: all they have to do is maintain their offset in the list. Creating a new range would just create another offset in the list and ensures the buffer is preserved. When the range advances or gets destroyed it updates or destroy its offset in the list accordingly. This way the buffer always covers all forward ranges. While it is indeed possible to do what you want, it's hardly a good abstraction to build an efficient parsing algorithm. There's just too much bookkeeping to do. And you must be sure saved ranges get destroyed as soon as possible to avoid the buffer from growing too much, something you usually don't have to care about with forward ranges. -- Michel Fortin michel.for...@michelf.com http://michelf.com/
Re: buffered input
On 2/6/11 3:22 EST, Jonathan M Davis wrote: Okay. I think that I've been misunderstanding some stuff here. I forgot that we were dealing with input ranges rather than forward ranges, and many range functions just don't work with input ranges, since they lack save(). Bleh. Okay. Honestly, what I'd normally want to be dealing with when reading a stream or file is a buffered forward range which is implemented in a manner which minimized copies. Having to deal with a input range, let alone what Andrei is suggesting here would definitely be annoying to say the least. Couldn't we do something which created a new buffer each time that it read in data from a file, and then it could be a forward range with infinite look-ahead. The cost of creating a new buffer would likely be minimal, if not outright neglible, in comparison to reading in the data from a file, and having multiple buffers would allow it to be a forward range. Perhaps, the creation of a new buffer could even be skipped if save had never been called and therefore no external references to the buffer would exist - at least as long as we're talking about bytes or characters or other value types. APIs predicated on the notion that I/O is very expensive and extra overheads are not measurable have paid dearly for it (e.g. C++'s iostreams). Maybe there's some major flaw in that basic idea. I don't know. But Andrei's suggestion sounds like a royal pain for basic I/O. If that's all I had to deal with when trying to lazily read in a file and process it, I'd just use readText() instead, since it would just be way easier to use. Clearly reading the entire file in an in-memory structure simplifies things. But the proposed streaming interface is extremely convenient as it always was; the two added APIs help people who need extra flexibility without hurting efficiency. If you want to read a file in Java: http://www.java-tips.org/java-se-tips/java.io/how-to-read-file-in-java.html In C (with many caveats): http://www.phanderson.com/files/file_read.html In D: foreach (line; File(name).byLine()) { ... } I plan to add a simpler API: foreach (line; File.byLine(name)) { ... } To read fixed-sized chunks, use byChunk. This covers the vast majority of file I/O needs. There are two limitations of the current APIs: 1. You can't add a new line to the existing line (or a buffer to the existing buffer) if you sometimes want to process multiple lines as a logical unit (some programs and file formats need that, as well as composing streams). 2. You can't comfortably read data of user-specified size if that size varies. This is the case for e.g. binary formats where you need to read doped chunks, i.e. chunks prefixed by their lengths. My proposal addresses 1 and makes 2 possible. Andrei
Re: buffered input
On 2/6/11 0:01 EST, Nick Sabalausky wrote: Andrei Alexandrescuseewebsiteforem...@erdani.org wrote in message news:iil64l$1f6s$1...@digitalmars.com... On 2/5/11 7:28 PM, Don wrote: Circular buffers don't seem like an 'optional' use case to me. Most real I/O works that way. I think if the abstraction can't handle it, the abstraction is a failure. The abstraction does handle it implicitly, except that it doesn't fix the buffer size. If you ask for appendToFront() with large numbers without calling shiftFront() too, the size of the buffer will ultimately increase to accommodate the entire input. That's the infinite in infinite lookahead. But what about when the window straddles the border? Ex: The circular buffer's internal size is 1000, the current starting point is 900 and the window (ie, front()) is 200. I guess that could work fine if front() is a random-access range, but if it's an array (which I think is what you proposed unless I misunderstood), then front() would have to return a new allocation: buf[900..$]~buf[0..100]. Say the buffer has 1000 elements of which the last 100 contain data (and the other 900 are past data that's not used). Then say this request comes: stream.appendToFront(150); At this point the stream may go two ways: 1. Append to the internal in-memory buffer, making it larger: _buf.length += 150; ... read into _buf[$ - 150 .. $] ... Now we have a buffer that has 1100 elements, of which the last 250 are used. 2. Move the data to the beginning of the buffer and then read 150 more elements starting at position 100: _buf[0 .. 100] = _buf[$ - 100 .. $]; ... read into _buf[100 .. 250] ... Now _buf has still 1000 elements, of which the first 250 contain data. How does the stream decide between 1 and 2? Clearly it's undesirable to grow the buffer too much and it's also undesirable to copy too much data. A simple approach is to establish a bound on losses, for example copy data only if size to copy is 5% of the entire buffer. Andrei
Re: buffered input
On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei
Re: buffered input
On 02/06/2011 04:25 PM, Andrei Alexandrescu wrote: Say the buffer has 1000 elements of which the last 100 contain data (and the other 900 are past data that's not used). Then say this request comes: stream.appendToFront(150); At this point the stream may go two ways: 1. Append to the internal in-memory buffer, making it larger: _buf.length += 150; ... read into _buf[$ - 150 .. $] ... Now we have a buffer that has 1100 elements, of which the last 250 are used. 2. Move the data to the beginning of the buffer and then read 150 more elements starting at position 100: _buf[0 .. 100] = _buf[$ - 100 .. $]; ... read into _buf[100 .. 250] ... Now _buf has still 1000 elements, of which the first 250 contain data. How does the stream decide between 1 and 2? Clearly it's undesirable to grow the buffer too much and it's also undesirable to copy too much data. A simple approach is to establish a bound on losses, for example copy data only if size to copy is 5% of the entire buffer. Isn't absolute size also relevant, if not more? I mean, as long as buffer size is small (if not insignificant) in absolute value, compared to eg CPU cache or available RAM, may it be good strategy to grow it whatever the relative growth in proportion of current size? Also: could a (truely) circular buffer help solve the above copy problem, concretely? Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
On 2/6/11 12:42 PM, spir wrote: On 02/06/2011 04:25 PM, Andrei Alexandrescu wrote: Say the buffer has 1000 elements of which the last 100 contain data (and the other 900 are past data that's not used). Then say this request comes: stream.appendToFront(150); At this point the stream may go two ways: 1. Append to the internal in-memory buffer, making it larger: _buf.length += 150; ... read into _buf[$ - 150 .. $] ... Now we have a buffer that has 1100 elements, of which the last 250 are used. 2. Move the data to the beginning of the buffer and then read 150 more elements starting at position 100: _buf[0 .. 100] = _buf[$ - 100 .. $]; ... read into _buf[100 .. 250] ... Now _buf has still 1000 elements, of which the first 250 contain data. How does the stream decide between 1 and 2? Clearly it's undesirable to grow the buffer too much and it's also undesirable to copy too much data. A simple approach is to establish a bound on losses, for example copy data only if size to copy is 5% of the entire buffer. Isn't absolute size also relevant, if not more? I mean, as long as buffer size is small (if not insignificant) in absolute value, compared to eg CPU cache or available RAM, may it be good strategy to grow it whatever the relative growth in proportion of current size? Of course. But mathematically you want to find bounds as a fraction of the input size. Also: could a (truely) circular buffer help solve the above copy problem, concretely? Not if you want infinite lookahead, which I think is what any modern buffering system should offer. Andrei
Re: buffered input
Andrei Alexandrescu napisał: Also: could a (truely) circular buffer help solve the above copy problem, concretely? Not if you want infinite lookahead, which I think is what any modern buffering system should offer. Truely circular, probably not, but a wrap-around slice (circular view of length at most underlying.length) does offer that and solves the copy problem with style. -- Tomek
Re: buffered input
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote in message news:iimnm6$1m4a$2...@digitalmars.com... On 2/6/11 12:42 PM, spir wrote: Also: could a (truely) circular buffer help solve the above copy problem, concretely? Not if you want infinite lookahead, which I think is what any modern buffering system should offer. Agreed, but not everything always needs infinite lookahead. And sometimes space guarantees are important.
Re: buffered input
On 02/06/2011 08:49 PM, Tomek Sowiński wrote: Andrei Alexandrescu napisał: Also: could a (truely) circular buffer help solve the above copy problem, concretely? Not if you want infinite lookahead, which I think is what any modern buffering system should offer. Truely circular, probably not, but a wrap-around slice (circular view of length at most underlying.length) does offer that and solves the copy problem with style. Right. _ vita es estrany spir.wikidot.com
Re: buffered input
On 2/6/11 4:13 PM, Nick Sabalausky wrote: Andrei Alexandrescuseewebsiteforem...@erdani.org wrote in message news:iimnm6$1m4a$2...@digitalmars.com... On 2/6/11 12:42 PM, spir wrote: Also: could a (truely) circular buffer help solve the above copy problem, concretely? Not if you want infinite lookahead, which I think is what any modern buffering system should offer. Agreed, but not everything always needs infinite lookahead. And sometimes space guarantees are important. If you don't use infinite lookahead you won't consume infinite memory. You stand to consume more memory though but I believe we're not supposed to optimize for that. To implement O(n) buffering for n elements of lookahead you can use the circular buffer already present in std.range as backend, coupled with routines to replenish it. The disadvantage is that the client doesn't see true T[] buffers, it sees lookalike random-access ranges that use modulo operations for indexing. All in all this is an abstraction worth providing but not as general as discardFromFront() and appendToFront(). Andrei
Re: buffered input
On 2/6/11 4:51 PM, spir wrote: On 02/06/2011 08:49 PM, Tomek Sowiński wrote: Andrei Alexandrescu napisał: Also: could a (truely) circular buffer help solve the above copy problem, concretely? Not if you want infinite lookahead, which I think is what any modern buffering system should offer. Truely circular, probably not, but a wrap-around slice (circular view of length at most underlying.length) does offer that and solves the copy problem with style. Right. With fixed lookahead you can't do a lot of things - such as line continuation in C programs or CSV files. There are plenty of other examples. Generally I believe k-lookahead is a thing of the past and infinite-lookahead is where the future is. Andrei
Re: buffered input
On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty.
Re: buffered input
2011/2/7 Robert Jacques sandf...@jhu.edu: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Why not popFront if empty? Maybe you could use appendToFront if you knew that you needed to expand the stream's buffer, but that doesn't sound like a common case. Torarin
Re: buffered input
On 2/6/11 8:57 PM, Robert Jacques wrote: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Both byLine and byChunk are buffered inputs, are often used, and will have seldom use for the added functions. Andrei
Re: buffered input
2011/2/7 Andrei Alexandrescu seewebsiteforem...@erdani.org: On 2/6/11 9:47 PM, Torarin wrote: 2011/2/7 Robert Jacquessandf...@jhu.edu: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Why not popFront if empty? Maybe you could use appendToFront if you knew that you needed to expand the stream's buffer, but that doesn't sound like a common case. Torarin Exactly. Consider line continuations. Most of the time you read one line at a time and everything goes swell. On occasion you'll have a line continuation convention (line ends with a backslash, unmatched quote etc.) and you need to expand the current buffer to gobble the next line. Andrei Yes, it's really convenient. You don't have to start messing with your own buffer, and you avoid the temptation of doing one-element reads. What seems more unlikely is using it on an empty front, though. Torarin
Re: buffered input
On Sun, 06 Feb 2011 21:53:01 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 8:57 PM, Robert Jacques wrote: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Both byLine and byChunk are buffered inputs, are often used, and will have seldom use for the added functions. Andrei Yes, but byLine is a higher order meta-range designed primarily for end users. I believe that internally, byLine would use shiftFront/appendToFront, as would virtually all library code. I do apologize for thinking only as a library writer, but you need to remember those of us writing the JSON, XML, separated value, etc parsers too, when designing the api.
Re: buffered input
On 2/6/11 10:37 PM, Robert Jacques wrote: On Sun, 06 Feb 2011 21:53:01 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 8:57 PM, Robert Jacques wrote: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Both byLine and byChunk are buffered inputs, are often used, and will have seldom use for the added functions. Andrei Yes, but byLine is a higher order meta-range designed primarily for end users. I believe that internally, byLine would use shiftFront/appendToFront, as would virtually all library code. I do apologize for thinking only as a library writer, but you need to remember those of us writing the JSON, XML, separated value, etc parsers too, when designing the api. I understand. Ultimately such widely used parsers would be themselves encapsulated in library (Phobos?) modules. The general idea is that straight reading tasks only need very simple code, and more sophisticated reading tasks would need to use a couple more primitives. That's not that bad, is it? Andrei
Re: buffered input
On Sun, 06 Feb 2011 21:47:48 -0500, Torarin torar...@gmail.com wrote: 2011/2/7 Robert Jacques sandf...@jhu.edu: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Why not popFront if empty? Because even reading UTF-8 requires more than 1-byte of information. Basically, any routine that processes a raw stream is going to have to handle the case where what they're processing straddles the buffer boundary. Now, if the routine wraps the raw stream in some higher-order range, such as byLine, which guarantees them that none of their inputs straddle, great. But it would be negligent to neglect those coders writing the higher-level ranges.
Re: buffered input
On 2/6/11 10:59 PM, Robert Jacques wrote: On Sun, 06 Feb 2011 21:47:48 -0500, Torarin torar...@gmail.com wrote: 2011/2/7 Robert Jacques sandf...@jhu.edu: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Why not popFront if empty? Because even reading UTF-8 requires more than 1-byte of information. Basically, any routine that processes a raw stream is going to have to handle the case where what they're processing straddles the buffer boundary. Now, if the routine wraps the raw stream in some higher-order range, such as byLine, which guarantees them that none of their inputs straddle, great. But it would be negligent to neglect those coders writing the higher-level ranges. They aren't being neglected. If you need to get more stuff in the current unit of work, use appendToFront(). Let's restart this. What do you want to do that you can't do with the proposed API? Andrei
Re: buffered input
2011/2/7 Robert Jacques sandf...@jhu.edu: On Sun, 06 Feb 2011 21:47:48 -0500, Torarin torar...@gmail.com wrote: 2011/2/7 Robert Jacques sandf...@jhu.edu: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Why not popFront if empty? Because even reading UTF-8 requires more than 1-byte of information. Basically, any routine that processes a raw stream is going to have to handle the case where what they're processing straddles the buffer boundary. Now, if the routine wraps the raw stream in some higher-order range, such as byLine, which guarantees them that none of their inputs straddle, great. But it would be negligent to neglect those coders writing the higher-level ranges. But popFront also reads more than 1 byte. Unless you call appendToFront with a larger value than the current buffer size, unless I have misunderstood, they should end up doing the same thing. Torarin
Re: buffered input
On Sun, 06 Feb 2011 23:01:12 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 10:59 PM, Robert Jacques wrote: On Sun, 06 Feb 2011 21:47:48 -0500, Torarin torar...@gmail.com wrote: 2011/2/7 Robert Jacques sandf...@jhu.edu: On Sun, 06 Feb 2011 10:43:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org wrote: On 2/6/11 6:01 EST, Tomek Sowiński wrote: Nick Sabalausky napisał: discard and fetch? I like that. What's missing is the part that they refer to front. Maybe discardFromFront() and fetchToFront()? But then I like discardFromFront() and appendToFront() better - the latter is about as long and more informative. Don't forget that these are relatively rarely used. Andrei Actually, I don't think these functions would be relatively rarely used. I don't see that many people using a buffered input's popFront. Instead I see shiftFront in its place and an appendToFront call has to be made whenever buffer.front.empty. Why not popFront if empty? Because even reading UTF-8 requires more than 1-byte of information. Basically, any routine that processes a raw stream is going to have to handle the case where what they're processing straddles the buffer boundary. Now, if the routine wraps the raw stream in some higher-order range, such as byLine, which guarantees them that none of their inputs straddle, great. But it would be negligent to neglect those coders writing the higher-level ranges. They aren't being neglected. If you need to get more stuff in the current unit of work, use appendToFront(). Let's restart this. What do you want to do that you can't do with the proposed API? Andrei Nothing. I like the API and it makes things like byChunk actually usable in my mind. My objection was to your 'relatively rarely used' comment and the implications of that with regard to API design. I have also just realized that made a typo when talking about appendToFront; I meant a call to appendToFront would be common when your working slice of .front has run out, but you're not ready to call shiftFront (because you're in the middle of parsing something, etc). The way I stated it, a call to popFront and a call to appendToFront would be (mostly) equivalent. Which may has confused people. Sorry.
Re: buffered input
On Friday 04 February 2011 21:46:40 Andrei Alexandrescu wrote: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. They have some commonalities and some differences, but it's been difficult to capture them. I think now I have a clear view, caused by a few recent discussions. One was the CSV reader discussed on the Phobos list; another was the discussion on defining the right std.xml. First, let's start with the humblest abstraction of all - an input range, which only defines the troika empty/front/popFront with the known semantics. An input range consumes input destructively and has a one-element horizon. It may as well considered a buffered stream with the buffer length exactly one. Going from there, we may say that certain streaming can be done by using an input range of ubyte (or dchar for text). That would be the UTFpowered equivalent of getchar(). The readf function operates that way - it only needs to look one character ahead. Incidentally, the CSV format also requires lookahead of 1, so it also can operate on a range of dchar. At this point we need to ask ourselves an essential question. Since we have this input range abstraction for a 1-element buffer, what would its n-elements buffer representation look like? How do we go from input range of T (which really is unbuffered input range of T to buffered input range of T? Honestly, the answer was extremely unclear to me for the longest time. I thought that such a range would be an extension of the unbuffered one, e.g. a range that still offers T from front() but also offers some additional functions - e.g. a lookahead in the form of a random-access operator. I still think something can be defined along those lines, but today I came together with a design that is considerably simpler both for the client and the designer of the range. I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. This is it. I like many things about this design, although I still fear some fatal flaw may be found with it. With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. One great thing is that buffered ranges as defined above play very well with both ranges and built-in arrays - two quintessential parts of D. I look at this and say, this all makes sense. For example the design could be generalized to operate on some random-access range other than the built-in array, but then I'm thinking, unless some advantage comes about, why not giving T[] a little special status? Probably everyone thinks of contiguous memory when thinking buffers, so here generalization may be excessive (albeit meaningful). Finally, this design is very easy to experiment with and causes no disruption to ranges. I can readily add the primitives to byLine and byChunk so we can see what streaming we can do that way. What do you all think? Hmm. I think that I'd have to have an actual implementation to mess around with to say much. My general take on buffered input is that I don't want to worry about it. I want it to be buffered so that it's more efficient, but I don't want to have to care about it in how I use it. I would have expected a buffered input range to be exactly the same as an input range except that it doesn't really just pull in one character behind the scenes. It pulls in 1024 or whatever when popFront() would result in the end of the buffer being reached, and you just get the first one with front. The API doesn't reflect the fact that it's buffered at all except perhaps in how you initialize it (by telling how big the buffer is, though generally I don't want to have to care about that either). Now, there may be some sort of use case where you actually need to care about the buffering, so using buffered
Re: buffered input
On 02/05/2011 10:36 AM, Nick Sabalausky wrote: On a separate note, I think a good way of testing the design (and end up getting something useful anyway) would be to try to use it to create a range that's automatically-buffered in a more traditional way. Ie, Given any input range 'myRange', buffered(myRange, 2048) (or something like that) would wrap it in a new input range that automatically buffers the next 2048 elements (asynchronously?) whenever its internal buffer is exhausted. Or something like that. It's late and I'm tired and I can't think anymore ;) That's exactly what I'm expecting. Funnily enough, I was about to start a thread on the topic after reading related posts. My point was: I'm not a specialist in efficiency (rather the opposite), I just know there is --theoretically-- relevant performance loss to expect from unbuffered input in various cases. Could we define a generic input-buffering primitive allowing people to benefit from others' competence? Just like Appender. Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
On 02/05/2011 08:22 AM, Ellery Newcomer wrote: 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. Does shiftFront literally move element n to index 0 and so on? It seems to me that if you do, its going to have horrid performance, and if you don't, then you will eventually run into situations where appendToFront will require a wrap around, which loses you your contiguity, or a reallocation of the buffer. Is this really what it means? I naively understood discards as meaning buf = buf[n..$]; or similar. Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
Andrei: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. This is an important part of the range design. This range is useful for other things too, like: - increasing efficiency of some lazy operations, as already done in Clojure. A buffer is meant to be CPU cache friendly, increasing performance of numeric code too. - Buffered I/O - The chunked lazy parallel map dsimcha is working on - Creating a chunked interface in Phobos for DBMSs See some of my posts about it: http://www.digitalmars.com/d/archives/digitalmars/D/Vectorized_Laziness_100525.html http://www.digitalmars.com/d/archives/digitalmars/D/Re_Vectorized_Laziness_100676.html http://www.digitalmars.com/pnews/read.php?server=news.digitalmars.comgroup=digitalmars.Dartnum=103882 http://www.digitalmars.com/webnews/newsgroups.php?art_group=digitalmars.Darticle_id=125876 Bye, bearophile
Re: buffered input
On 02/05/2011 11:09 AM, Jonathan M Davis wrote: Hmm. I think that I'd have to have an actual implementation to mess around with to say much. My general take on buffered input is that I don't want to worry about it. I want it to be buffered so that it's more efficient, but I don't want to have to care about it in how I use it. I would have expected a buffered input range to be exactly the same as an input range except that it doesn't really just pull in one character behind the scenes. It pulls in 1024 or whatever when popFront() would result in the end of the buffer being reached, and you just get the first one with front. The API doesn't reflect the fact that it's buffered at all except perhaps in how you initialize it (by telling how big the buffer is, though generally I don't want to have to care about that either). [...] Regardless, a more normal range could be built on top of what you're suggesting, and it could do essentially what I was thinking buffered ranges would do. So, perhaps doing what you're suggesting and building what I was thinking of on top of that would be the way to go. That way, if you actually care about messing with the buffer, you can, but if not, you just use it normally and the buffering is dealt with underneath. Exactly. I would love something like: auto bufInputRange (R) (R inputRange, size_t capacity=0) if (...) Meaning one can specify (max) buffering capacity; else there is a standard (re)sizing scheme. Just like dyn array (re)sizing. Side-question to specialists: What should actual buf capacity depend on? Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
On 02/05/2011 08:45 AM, Michel Fortin wrote: One thing I'm wondering is whether it'd be more efficient if we could provide our own buffer to be filled. In cases where you want to preserve the data, this could let you avoid double-copying: first copy in the temporary buffer and then at the permanent storage location. If you need the data only temporarily however providing your buffer to be filled might be less efficient for a range that can't avoid copying to the temporary buffer for some reason.. Does this also makes sense when one needs to iterate over a whole set of source data via buferred input rangeS? I mean the same buffer can be reused, avoiding repeted allocation (or is this wrong or irrelevant?). Deins -- _ vita es estrany spir.wikidot.com
Re: buffered input
Does shiftFront literally move element n to index 0 and so on? It seems to me that if you do, its going to have horrid performance, and if you don't, then you will eventually run into situations where appendToFront will require a wrap around, which loses you your contiguity, or a reallocation of the buffer. I think it is basically popFrontN(), and appendToFront() a just append.
Re: buffered input
Andrei Alexandrescu napisał: I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. I don't see a clear need for the two to be separate. Could they fold into popFront(n, m) meaning shiftFront(n); appendToFront(m) ? Nullary popFront() discards all and loads any number it pleases. This is it. I like many things about this design, although I still fear some fatal flaw may be found with it. With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. Some users would benefit if they could just pass in a buffer and say fill'er up. One great thing is that buffered ranges as defined above play very well with both ranges and built-in arrays - two quintessential parts of D. I look at this and say, this all makes sense. For example the design could be generalized to operate on some random-access range other than the built-in array, but then I'm thinking, unless some advantage comes about, why not giving T[] a little special status? Probably everyone thinks of contiguous memory when thinking buffers, so here generalization may be excessive (albeit meaningful). Contiguous, yes. But I'd rather see front() exposing, say, a circular buffer so that appendToFront(n) reallocates only when n buf.length. -- Tomek
Re: buffered input
Tomek Sowiński napisał: Contiguous, yes. But I'd rather see front() exposing, say, a circular buffer so that appendToFront(n) reallocates only when n buf.length. I meant: when n + front.length buf.length. -- Tomek
Re: buffered input
Tomek Sowiñski Wrote: Andrei Alexandrescu napisa³: I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. I don't see a clear need for the two to be separate. Could they fold into popFront(n, m) meaning shiftFront(n); appendToFront(m) ? Nullary popFront() discards all and loads any number it pleases. This is it. I like many things about this design, although I still fear some fatal flaw may be found with it. With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. Some users would benefit if they could just pass in a buffer and say fill'er up. One great thing is that buffered ranges as defined above play very well with both ranges and built-in arrays - two quintessential parts of D. I look at this and say, this all makes sense. For example the design could be generalized to operate on some random-access range other than the built-in array, but then I'm thinking, unless some advantage comes about, why not giving T[] a little special status? Probably everyone thinks of contiguous memory when thinking buffers, so here generalization may be excessive (albeit meaningful). Contiguous, yes. But I'd rather see front() exposing, say, a circular buffer so that appendToFront(n) reallocates only when n buf.length. I find this discussion interesting. There's one idea for an application I'd like to try at some point. Basically a facebook chat thingie, but with richer gaming features. The expected audience will be 10 - 100K simultaneous clients connecting to a single server. Not sure if DOM or SAX will be better. After seeing the Tango's XML benchmarks I was convinced that the implementation platform will be D1/Tango, but now it looks like Phobos is also getting there, propably even outperforming Tango by a clear margin. Since even looking at Tango's documentation has intellectual property problems and likely causes taint, I could make an independent benchmark comparing the two and their interfaces later. But I propaply need to avoid going into too much details, otherwise the Phobos developers wouldn't be able to read it without changing their license. From what I've read so far, the proposed design looks very much like what Tango has now in their I/O framework. But probably Phobos's TLS default and immutable strings improve multithreaded performance even more.
Re: buffered input
On 2011-02-05 07:01:24 -0500, spir denis.s...@gmail.com said: On 02/05/2011 08:45 AM, Michel Fortin wrote: One thing I'm wondering is whether it'd be more efficient if we could provide our own buffer to be filled. In cases where you want to preserve the data, this could let you avoid double-copying: first copy in the temporary buffer and then at the permanent storage location. If you need the data only temporarily however providing your buffer to be filled might be less efficient for a range that can't avoid copying to the temporary buffer for some reason.. Does this also makes sense when one needs to iterate over a whole set of source data via buferred input rangeS? I mean the same buffer can be reused, avoiding repeted allocation (or is this wrong or irrelevant?). As I said in my post, whether a temporary buffer or a user-supplied buffer is better depends on whether you plan to store the data beyond the temporary buffer's lifetime or not. If you just iterate to calculate the SHA1 hash, the temporary buffer is fine (and possibly better depending on the range's implementation). If you iterate to calculate the SHA1 hash *and* also want to store the file in memory, then it's better if you can provide your own buffer which can point directly to the permanent storage location and bypass copying to the temporary buffer (if the range's implementation allows it). -- Michel Fortin michel.for...@michelf.com http://michelf.com/
Re: buffered input
On 2/5/11 2:22 AM, Ellery Newcomer wrote: Does shiftFront literally move element n to index 0 and so on? It seems to me that if you do, its going to have horrid performance, and if you don't, then you will eventually run into situations where appendToFront will require a wrap around, which loses you your contiguity, or a reallocation of the buffer. No, it's a mere internal operation bufpos += n or so. Andrei
Re: buffered input
On 2/5/11 2:45 AM, Michel Fortin wrote: One thing I'm wondering is whether it'd be more efficient if we could provide our own buffer to be filled. In cases where you want to preserve the data, this could let you avoid double-copying: first copy in the temporary buffer and then at the permanent storage location. If you need the data only temporarily however providing your buffer to be filled might be less efficient for a range that can't avoid copying to the temporary buffer for some reason.. Generally when one says I want the stream to copy data straight into my buffers this is the same as I want the stream to be unbuffered. So if you want to provide your own buffers to be filled, we need to discuss refining the design of unbuffered input - for example by adding an optional routine for bulk transfer to input ranges. Andrei
Re: buffered input
On 2011-02-05 10:02:47 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org said: Generally when one says I want the stream to copy data straight into my buffers this is the same as I want the stream to be unbuffered. So if you want to provide your own buffers to be filled, we need to discuss refining the design of unbuffered input - for example by adding an optional routine for bulk transfer to input ranges. You're right, this is a different thing. My major gripe with ranges at this time is that it's almost impossible to design an algorithm that can take slices *or* make copies depending on whether the range supports slicing or not, and whether the slices are stable (not going to be mutated when popping elements from the range). At least not without writing two implementations of it. I reread your initial post to get a clearer idea of what it meant. It seems to me that your buffered range design could be made to fix that hole. If the data you want to parse is all in memory, the buffered range could simply use the original array as its buffer; shiftFront would simply just the whole array to remove the first n elements while appendToFront would do nothing (as the buffer already contains all of the content). And if the data is immutable, then it's safe to just take a slice of it to preserve it instead of doing a copy. So you can't really be more efficient than that, it's just great. As for getting the data in bulk directly so you can avoid needless copies... I think the same optimization is possible with a buffered range. All you need is a buffered range that doesn't reuse the buffer, presumably one of immutable(T)[]. With it, you can slice at will without fear of the data being overwritten at a later time. So my rereading of your proposal convinced me. Go ahead, I can't wait to use it. :-) -- Michel Fortin michel.for...@michelf.com http://michelf.com/
Re: buffered input
Andrei Alexandrescu napisał: Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon Broken sentence?
Re: buffered input
Andrei Alexandrescu napisał: I don't see a clear need for the two to be separate. Could they fold into popFront(n, m) meaning shiftFront(n); appendToFront(m) ? Nullary popFront() discards all and loads any number it pleases. I think combining the two into one hurts usability as often you want to do one without the other. OK, but if you go this way, what would popFront() do? Some users would benefit if they could just pass in a buffer and say fill'er up. Correct. That observation applies to unbuffered input as well. Right. Contiguous, yes. But I'd rather see front() exposing, say, a circular buffer so that appendToFront(n) reallocates only when n buf.length. I think circularity is an implementation detail that is poor as a client-side abstraction. I fear efficiency will get abstracted out. Say this is my internal buffer (pipes indicate front() slice): [ooo|oo|oo] Now I do appendToFront(3) -- how do you expose the expected front() without moving data? -- Tomek
Re: buffered input
Jean Crystof napisał: I find this discussion interesting. There's one idea for an application I'd like to try at some point. Basically a facebook chat thingie, but with richer gaming features. The expected audience will be 10 - 100K simultaneous clients connecting to a single server. Not sure if DOM or SAX will be better. After seeing the Tango's XML benchmarks I was convinced that the implementation platform will be D1/Tango, but now it looks like Phobos is also getting there, propably even outperforming Tango by a clear margin. Thanks for having faith ;-) Since even looking at Tango's documentation has intellectual property problems and likely causes taint, I could make an independent benchmark comparing the two and their interfaces later. But I propaply need to avoid going into too much details, otherwise the Phobos developers wouldn't be able to read it without changing their license. That would be helpful. From what I've read so far, the proposed design looks very much like what Tango has now in their I/O framework. But probably Phobos's TLS default and immutable strings improve multithreaded performance even more. Well, immutability doesn't help much because a buffer must be written to. Speaking of multithreading, I was thinking of an implementation where an internal thread is doing I/O. It loads data in front of the current front() slice, as much as the internal buffer can hold. The motivation is to overlap content processing and I/O operations so that less time is spent in total. Although there is some interaction overhead: locking, syncing caches so that cores see the same buffer. -- Tomek
Re: buffered input
Dang, you beat me to my post on what I have run into trying to provide a slice-able, assignable, buffered Forward Range. I was doing some work on a CSV parser. It is rather simple to build a proper parser from an input range. But providing the ability to use custom separators which could be of any length did not work well with a forward range. It was no longer a look-ahead of one. So I started examining how Splitter[1] work with slice-able ranges. Ok enough of the background. So basically I tried to make a range that would provide everything I needed for the new CSV design[2] and the result[3] didn't work. It actually works better with my CSV parser than it does splitter. The main issue I was having is, if you save the range and move forward, how do you keep the buffer of all instances in sync. Can we turn an input range into a forward range? If not, how would you get splitter working on an input range? (I probably need to file a bug, but my InputFileRange[3, bottom] didn't work with splitter either) The next issue is with slicing. If we can't get an input range to become a forward range then we can't have slicing either. A slice of [0..$] should give me a copy of the range. But even if you could do this, how would you know that the slice should be made of the entire range, or of just what is available in the buffer? So I guess the question is, with the proposal. Can a hasSlicing!R be created from an InputRange!R such that auto range = Hello, World; auto len = countUntil(range, ,); assert(range[0..len] == Hello); where range is replaced by a buffered Input Range. And as an added bonus: range = range[len..$]; assert(range == ,World); You can of course use the Range for equality, instead of strings like Hello. 1. https://github.com/D-Programming-Language/phobos/blob/master/std/algorithm.d#L1317 2. https://github.com/he-the-great/JPDLibs/blob/csvoptimize/csv/csv.d 3. https://gist.github.com/812681
Re: buffered input
On Saturday 05 February 2011 07:16:45 Andrei Alexandrescu wrote: On 2/5/11 5:09 AM, Jonathan M Davis wrote: Hmm. I think that I'd have to have an actual implementation to mess around with to say much. My general take on buffered input is that I don't want to worry about it. I want it to be buffered so that it's more efficient, but I don't want to have to care about it in how I use it. I would have expected a buffered input range to be exactly the same as an input range except that it doesn't really just pull in one character behind the scenes. It pulls in 1024 or whatever when popFront() would result in the end of the buffer being reached, and you just get the first one with front. The API doesn't reflect the fact that it's buffered at all except perhaps in how you initialize it (by telling how big the buffer is, though generally I don't want to have to care about that either). Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon The thing is though that if I want to be iterating over a string which is buffered (from a file or stream or whatever), I want front to be immutable(char) or char, not immutable(char)[] or char[]. I can see how having an interface which allows startsWith to efficiently check whether the buffered string starts with a particular string makes good sense, but generally, as far as I'm concerned, that's startsWith's problem. How would I even begin to use a buffered range of string[] as a string? Normally, when I've used buffered anything, it's been purely for efficiency reasons. All I've cared about is having a stream or file or whatever. The fact that reading it from the file (or wherever it came from) in a buffered manner is more efficient means that I want it buffered, but that hasn't had any effect on how I've used it. If I want x characters from the file, I ask for x characters. It's the buffered object's problem how many reads that does or doesn't do. You must be thinking of a use case which I don't normal think of or am not aware of. In my experience, buffering has always been an implementation detail that you use because it's more efficient, but you don't worry about it beyond creating a buffered stream rather than an unbuffered one. - Jonathan M Davis
Re: buffered input
Nice! And evenin'! Layman's view: - - - - - - - - - - - (I'm serious, please don't take my post too seriously. I'm not a heavy user of D and I don't want to pollute. I know in NGs exposure means influence and I babble a lot. Still, my thoughts, or rather reactions, could be of interest, I assume, or I wouldn't be writing this : ) I'm not sure how these buffered input ranges are supposed to be used (some mockup sample code would be very cool!), but it seems to me, and please correct me if I'm wrong, that it's very desirable for these ranges to be interchangeable? As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? Reusability, generics, bend the spoon neo and all that? If not, then ok. But if yes, then I think these buffered ranges look very troublesome! Naughty even! * * * Then there's the sneaky break of consistency of the D semantics. Even if these ranges are not intended to be interchangeable, still, changing the (human langauge) semantics that the input ranges already define is not good! This makes D a difficult language to get an intuitive feel for, I think. By the definition of input ranges, the word front symbolizes the first _element_ in a forwards facing queue of elements. | 1:st |-- front() | 2:nd | v-- hidden --v | 3:rd | | . | | n:th | -- back ..as front() returns a T. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! So it seems these buffered ranges have redefined the semantics for the word front, as meaning the view window into the front part of the queue. Sneaky! I mean, imagine being new with D and skimming through the API docs for ranges, and picking up these function names at a glance. You'd be setting yourself up for one of those aaahaa-now-I-get-why-I-didn't-get-it-moments for sure. Hmmm. Still, front() could very well refer to the front part—to a list of elements (or the view window), and first() could refer to the first element. Actually, that would make the most sense! Then an input range would be first()/popFirst()/empty(), and a buffered one would have all those, but also amend something like front(n)/widenFront(n)/popFront(n), but yeah, erhm. I call for stricter and more consistent semantics! Decide what front means when talking about ranges, and stick to it! (And I'm talking about human language semantics, not what a function (or primitive?) does.) Erh, I tried to sound resolute there. Not my thing really. * * * Besides that, shiftFront got me thinking about sliding windows, and that would actually be cool! As in | 1st | '\ -- first() | 2nd | |-- front() // view window | 3rd | ./ | 4th | v-- hidden --v | 5th | | .. | | n:th | and then calling shiftFront(2) would shift the view window 2 elements forward (thus fetching 2 and discarding 2). Seems like a useful feature when parsing some encoding with variable point width and known distance to the event horizon, no? As in code.viewDistance = 8; do{ auto p = code.front() if(isLongPoint(p)){ processLong(p) code.shiftFront(8); }else if(isPoint(p)){ process(p) code.shiftFront(4); }else break; }while(p); or something like that. But the semantic that shiftFront would mean the same as popFront(), but on a list of elements? Confusing! Surely, at least popFront(n)... Hm, yeah Ok I'm all out of coffee!!! Thanks for your time! BR /HF
Re: buffered input
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote in message news:iijq99$1a5o$1...@digitalmars.com... On 2/5/11 6:46 AM, spir wrote: On 02/05/2011 10:36 AM, Nick Sabalausky wrote: On a separate note, I think a good way of testing the design (and end up getting something useful anyway) would be to try to use it to create a range that's automatically-buffered in a more traditional way. Ie, Given any input range 'myRange', buffered(myRange, 2048) (or something like that) would wrap it in a new input range that automatically buffers the next 2048 elements (asynchronously?) whenever its internal buffer is exhausted. Or something like that. It's late and I'm tired and I can't think anymore ;) That's exactly what I'm expecting. Funnily enough, I was about to start a thread on the topic after reading related posts. My point was: I'm not a specialist in efficiency (rather the opposite), I just know there is --theoretically-- relevant performance loss to expect from unbuffered input in various cases. Could we define a generic input-buffering primitive allowing people to benefit from others' competence? Just like Appender. Denis The buffered range interface as I defined it supports infinite lookahead. The interface mentioned by Nick has lookahead between 1 and 2048. So I don't think my interface is appropriate for that. Infinite lookahead is a wonderful thing. Consider reading lines from a file. Essentially what you need to do is to keep on reading blocks of data until you see \n (possibly followed by some more stuff). Then you offer the client the line up to the \n. When the client wants a new line, you combine the leftover data you already have with new stuff you read. On occasion you need to move over leftovers, but if your internal buffers are large enough that is negligible (I actually tested this recently). Another example: consider dealing with line continuations in reading CSV files. Under certain conditions, you need to read one more line and stitch it with the existing one. This is easy with infinite lookahead, but quite elaborate with lookahead 1. I think I can see how it might be worthwhile to discourage the traditional buffer interface I described in favor of the above. It wouldn't be as trivial to use as what people are used to, but I can see that it could avoid a lot of unnessisary copying, especially with other people's suggestion of allowing the user to provide their own buffer to be filled (and it seems easy enough to learn). But what about when you want a circular buffer? Ie, When you know a certain maximum lookahead is fine and you want to minimize memory usage and buffer-appends. Circular buffers don't do infinite lookahead so the interface maybe doesn't work as well. Plus you probably wouldn't want to provide an interface for slicing into the buffer, since the slice could straddle the wrap-around point which would require a new allocation (ie return buffer[indexOfFront+sliceStart..$] ~ buffer[0..sliceLength-($-(frontIndex+sliceStart))]). I guess maybe that would just call for another type of range.
Re: buffered input
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote in message news:iijpp7$197f$1...@digitalmars.com... On 2/5/11 5:09 AM, Jonathan M Davis wrote: Hmm. I think that I'd have to have an actual implementation to mess around with to say much. My general take on buffered input is that I don't want to worry about it. I want it to be buffered so that it's more efficient, but I don't want to have to care about it in how I use it. I would have expected a buffered input range to be exactly the same as an input range except that it doesn't really just pull in one character behind the scenes. It pulls in 1024 or whatever when popFront() would result in the end of the buffer being reached, and you just get the first one with front. The API doesn't reflect the fact that it's buffered at all except perhaps in how you initialize it (by telling how big the buffer is, though generally I don't want to have to care about that either). Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon That shouldn't be a problem for the cases where a lookahead of 1 is all that's needed. So both types can exist (with the traditional/automatic type most likely built on top of Andrei's type). Thus, I think the only question is Are the appropriate use-cases for the traditional/automatic type minor enough and infrequent enough to actively discourage it by not providing it? That I can't answer.
Re: buffered input
Jean Crystof a@a.a wrote in message news:iijl2t$10np$1...@digitalmars.com... Tomek Sowiñski Wrote: Andrei Alexandrescu napisa³: I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. I don't see a clear need for the two to be separate. Could they fold into popFront(n, m) meaning shiftFront(n); appendToFront(m) ? Nullary popFront() discards all and loads any number it pleases. This is it. I like many things about this design, although I still fear some fatal flaw may be found with it. With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. Some users would benefit if they could just pass in a buffer and say fill'er up. One great thing is that buffered ranges as defined above play very well with both ranges and built-in arrays - two quintessential parts of D. I look at this and say, this all makes sense. For example the design could be generalized to operate on some random-access range other than the built-in array, but then I'm thinking, unless some advantage comes about, why not giving T[] a little special status? Probably everyone thinks of contiguous memory when thinking buffers, so here generalization may be excessive (albeit meaningful). Contiguous, yes. But I'd rather see front() exposing, say, a circular buffer so that appendToFront(n) reallocates only when n buf.length. I find this discussion interesting. There's one idea for an application I'd like to try at some point. Basically a facebook chat thingie, but with richer gaming features. The expected audience will be 10 - 100K simultaneous clients connecting to a single server. Not sure if DOM or SAX will be better. After seeing the Tango's XML benchmarks I was convinced that the implementation platform will be D1/Tango, but now it looks like Phobos is also getting there, propably even outperforming Tango by a clear margin. I don'r mean to derail the topic, but if I were aiming for that many simultaneous users I wouldn't even consider using XML at all. Despite MS's, Java's and AJAX's infatuation with it, XML is really only appropriate in two situations: 1. When memory/bandwidth/speed/etc don't matter and 2. When you don't have a choice in the matter.
Re: buffered input
Heywood Floyd soul...@gmail.com wrote in message news:mailman.1318.1296941395.4748.digitalmar...@puremagic.com... As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? The problem with that is that in many many cases it forces unnessisary copying. We can get much better performance with this slightly more hands-on version. But that said, if the traditional hands-free automatic buffering really is all you need, then such a thing [should] be easily to construct out of the Andrei's style of buffered range. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! I completely agree. The names of those functions confused the hell out of me until I read Andrei's descriptions of them. Now I understand what they do...but I still don't understand their names at all.
Re: buffered input
2011/2/5 Andrei Alexandrescu seewebsiteforem...@erdani.org: I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. This is really cool. I realise now that appendToFront fills the gap in the design providing only shiftFront/advance. I also thought their names were well-chosen. Torarin
Re: buffered input
On 02/05/2011 10:44 PM, Nick Sabalausky wrote: Andrei Alexandrescuseewebsiteforem...@erdani.org wrote in message news:iijq99$1a5o$1...@digitalmars.com... On 2/5/11 6:46 AM, spir wrote: On 02/05/2011 10:36 AM, Nick Sabalausky wrote: On a separate note, I think a good way of testing the design (and end up getting something useful anyway) would be to try to use it to create a range that's automatically-buffered in a more traditional way. Ie, Given any input range 'myRange', buffered(myRange, 2048) (or something like that) would wrap it in a new input range that automatically buffers the next 2048 elements (asynchronously?) whenever its internal buffer is exhausted. Or something like that. It's late and I'm tired and I can't think anymore ;) That's exactly what I'm expecting. Funnily enough, I was about to start a thread on the topic after reading related posts. My point was: I'm not a specialist in efficiency (rather the opposite), I just know there is --theoretically-- relevant performance loss to expect from unbuffered input in various cases. Could we define a generic input-buffering primitive allowing people to benefit from others' competence? Just like Appender. Denis The buffered range interface as I defined it supports infinite lookahead. The interface mentioned by Nick has lookahead between 1 and 2048. So I don't think my interface is appropriate for that. Infinite lookahead is a wonderful thing. Consider reading lines from a file. Essentially what you need to do is to keep on reading blocks of data until you see \n (possibly followed by some more stuff). Then you offer the client the line up to the \n. When the client wants a new line, you combine the leftover data you already have with new stuff you read. On occasion you need to move over leftovers, but if your internal buffers are large enough that is negligible (I actually tested this recently). Another example: consider dealing with line continuations in reading CSV files. Under certain conditions, you need to read one more line and stitch it with the existing one. This is easy with infinite lookahead, but quite elaborate with lookahead 1. I think I can see how it might be worthwhile to discourage the traditional buffer interface I described in favor of the above. It wouldn't be as trivial to use as what people are used to, but I can see that it could avoid a lot of unnessisary copying, especially with other people's suggestion of allowing the user to provide their own buffer to be filled (and it seems easy enough to learn). But what about when you want a circular buffer? Ie, When you know a certain maximum lookahead is fine and you want to minimize memory usage and buffer-appends. Circular buffers don't do infinite lookahead so the interface maybe doesn't work as well. Plus you probably wouldn't want to provide an interface for slicing into the buffer, since the slice could straddle the wrap-around point which would require a new allocation (ie return buffer[indexOfFront+sliceStart..$] ~ buffer[0..sliceLength-($-(frontIndex+sliceStart))]). I guess maybe that would just call for another type of range. Becomes too complicated, doesnt it? Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
On 02/05/2011 11:00 PM, Nick Sabalausky wrote: Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon That shouldn't be a problem for the cases where a lookahead of 1 is all that's needed. So both types can exist (with the traditional/automatic type most likely built on top of Andrei's type). Thus, I think the only question is Are the appropriate use-cases for the traditional/automatic type minor enough and infrequent enough to actively discourage it by not providing it? That I can't answer. And what about backtracking (eg for parsing the source)? Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
On 2/5/11 12:59 PM, Tomek Sowiński wrote: Andrei Alexandrescu napisał: Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon Broken sentence? Sorry. Well it was nothing interesting anyway. Andrei
Re: buffered input
spir wrote: On 02/05/2011 10:44 PM, Nick Sabalausky wrote: Andrei Alexandrescuseewebsiteforem...@erdani.org wrote in message news:iijq99$1a5o$1...@digitalmars.com... On 2/5/11 6:46 AM, spir wrote: On 02/05/2011 10:36 AM, Nick Sabalausky wrote: On a separate note, I think a good way of testing the design (and end up getting something useful anyway) would be to try to use it to create a range that's automatically-buffered in a more traditional way. Ie, Given any input range 'myRange', buffered(myRange, 2048) (or something like that) would wrap it in a new input range that automatically buffers the next 2048 elements (asynchronously?) whenever its internal buffer is exhausted. Or something like that. It's late and I'm tired and I can't think anymore ;) That's exactly what I'm expecting. Funnily enough, I was about to start a thread on the topic after reading related posts. My point was: I'm not a specialist in efficiency (rather the opposite), I just know there is --theoretically-- relevant performance loss to expect from unbuffered input in various cases. Could we define a generic input-buffering primitive allowing people to benefit from others' competence? Just like Appender. Denis The buffered range interface as I defined it supports infinite lookahead. The interface mentioned by Nick has lookahead between 1 and 2048. So I don't think my interface is appropriate for that. Infinite lookahead is a wonderful thing. Consider reading lines from a file. Essentially what you need to do is to keep on reading blocks of data until you see \n (possibly followed by some more stuff). Then you offer the client the line up to the \n. When the client wants a new line, you combine the leftover data you already have with new stuff you read. On occasion you need to move over leftovers, but if your internal buffers are large enough that is negligible (I actually tested this recently). Another example: consider dealing with line continuations in reading CSV files. Under certain conditions, you need to read one more line and stitch it with the existing one. This is easy with infinite lookahead, but quite elaborate with lookahead 1. I think I can see how it might be worthwhile to discourage the traditional buffer interface I described in favor of the above. It wouldn't be as trivial to use as what people are used to, but I can see that it could avoid a lot of unnessisary copying, especially with other people's suggestion of allowing the user to provide their own buffer to be filled (and it seems easy enough to learn). But what about when you want a circular buffer? Ie, When you know a certain maximum lookahead is fine and you want to minimize memory usage and buffer-appends. Circular buffers don't do infinite lookahead so the interface maybe doesn't work as well. Plus you probably wouldn't want to provide an interface for slicing into the buffer, since the slice could straddle the wrap-around point which would require a new allocation (ie return buffer[indexOfFront+sliceStart..$] ~ buffer[0..sliceLength-($-(frontIndex+sliceStart))]). I guess maybe that would just call for another type of range. Becomes too complicated, doesnt it? Denis Circular buffers don't seem like an 'optional' use case to me. Most real I/O works that way. I think if the abstraction can't handle it, the abstraction is a failure.
Re: buffered input
On 2/5/11 1:18 PM, Tomek Sowiński wrote: Andrei Alexandrescu napisał: I don't see a clear need for the two to be separate. Could they fold into popFront(n, m) meaning shiftFront(n); appendToFront(m) ? Nullary popFront() discards all and loads any number it pleases. I think combining the two into one hurts usability as often you want to do one without the other. OK, but if you go this way, what would popFront() do? Discard everything in the current buffer and fill a new buffer. The new size depends on the stream; for byLine a new line would be read, for byChunk(4096) 4096 more bytes would be read. Some users would benefit if they could just pass in a buffer and say fill'er up. Correct. That observation applies to unbuffered input as well. Right. Contiguous, yes. But I'd rather see front() exposing, say, a circular buffer so that appendToFront(n) reallocates only when n buf.length. I think circularity is an implementation detail that is poor as a client-side abstraction. I fear efficiency will get abstracted out. Say this is my internal buffer (pipes indicate front() slice): [ooo|oo|oo] Now I do appendToFront(3) -- how do you expose the expected front() without moving data? You do end up moving data, but proportionally little if the buffer is large enough. Andrei
Re: buffered input
On 02/05/2011 11:22 PM, Nick Sabalausky wrote: Heywood Floydsoul...@gmail.com wrote in message news:mailman.1318.1296941395.4748.digitalmar...@puremagic.com... As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? The problem with that is that in many many cases it forces unnessisary copying. We can get much better performance with this slightly more hands-on version. But that said, if the traditional hands-free automatic buffering really is all you need, then such a thing [should] be easily to construct out of the Andrei's style of buffered range. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! I completely agree. The names of those functions confused the hell out of me until I read Andrei's descriptions of them. Now I understand what they do...but I still don't understand their names at all. Same here; thought: maybe he meant shiftBuf() appendToBuf(), or such?. (Then, as nobody reacted about that point, thought: You're the stupid one; shut your mouth!) I also agree with Heywood about first() / popFirst(). Then, shiftFront() / appendToFront() would be less confusing --but still hard to guess (for me). I wonder if his view window is the whole or part of the buffer. Well... (Else, I actually share most of Heywood's views, I guess, at least at first read.) Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
On Sat, 05 Feb 2011 17:22:01 -0500, Nick Sabalausky a@a.a wrote: Heywood Floyd soul...@gmail.com wrote in message news:mailman.1318.1296941395.4748.digitalmar...@puremagic.com... As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? The problem with that is that in many many cases it forces unnessisary copying. We can get much better performance with this slightly more hands-on version. But that said, if the traditional hands-free automatic buffering really is all you need, then such a thing [should] be easily to construct out of the Andrei's style of buffered range. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! I completely agree. The names of those functions confused the hell out of me until I read Andrei's descriptions of them. Now I understand what they do...but I still don't understand their names at all. See point of Andrei's post: 1. R is an input range of T[] Which means that front returns an array, not a single element. So they sound like they operate on the first element, because that's exactly what they do. Conceptually, you need to think of buffered inputs as range of ranges, not a range of elements.
Re: buffered input
On 02/05/2011 06:42 PM, Heywood Floyd wrote: Nice! And evenin'! Layman's view: - - - - - - - - - - - (I'm serious, please don't take my post too seriously. I'm not a heavy user of D and I don't want to pollute. I know in NGs exposure means influence and I babble a lot. Still, my thoughts, or rather reactions, could be of interest, I assume, or I wouldn't be writing this : ) I'm not sure how these buffered input ranges are supposed to be used (some mockup sample code would be very cool!), but it seems to me, and please correct me if I'm wrong, that it's very desirable for these ranges to be interchangeable? As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? Reusability, generics, bend the spoon neo and all that? If not, then ok. But if yes, then I think these buffered ranges look very troublesome! Naughty even! * * * Then there's the sneaky break of consistency of the D semantics. Even if these ranges are not intended to be interchangeable, still, changing the (human langauge) semantics that the input ranges already define is not good! This makes D a difficult language to get an intuitive feel for, I think. By the definition of input ranges, the word front symbolizes the first _element_ in a forwards facing queue of elements. | 1:st |-- front() | 2:nd | v-- hidden --v | 3:rd | | . | | n:th |-- back ..as front() returns a T. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! So it seems these buffered ranges have redefined the semantics for the word front, as meaning the view window into the front part of the queue. Sneaky! I mean, imagine being new with D and skimming through the API docs for ranges, and picking up these function names at a glance. You'd be setting yourself up for one of those aaahaa-now-I-get-why-I-didn't-get-it-moments for sure. Hmmm. Still, front() could very well refer to the front part—to a list of elements (or the view window), and first() could refer to the first element. Actually, that would make the most sense! Then an input range would be first()/popFirst()/empty(), and a buffered one would have all those, but also amend something like front(n)/widenFront(n)/popFront(n), but yeah, erhm. +++ (everything above) I call for stricter and more consistent semantics! Decide what front means when talking about ranges, and stick to it! (And I'm talking about human language semantics, not what a function (or primitive?) does.) Erh, I tried to sound resolute there. Not my thing really. pleased to see there is at least one other programmer still considering that semantic applies to human thoughts, rather than machine process... * * * Besides that, shiftFront got me thinking about sliding windows, and that would actually be cool! As in | 1st | '\-- first() | 2nd | |-- front() // view window | 3rd | ./ | 4th | v-- hidden --v | 5th | | .. | | n:th | There is an off-by-one error between 1st first, I guess ;-) What's your view window? Is it buffer, or the needed amount of lookahead, or what else? How would you draw the buffer, on the first picture or the one above? Sliding window is, for me, the mental picture my brain intuitively forms when thinking at buffered input. But the sliding move may not be smooth (element per element), instead could happen as is most practical or efficient; as long as it remains a point not exposed on the interface (or only on request by client code). Meaning there would be an independant index pointing to current/first/front element, in the buffer or the window, automagically maintained when sliding happens (index -= offset). and then calling shiftFront(2) would shift the view window 2 elements forward (thus fetching 2 and discarding 2). Seems like a useful feature when parsing some encoding with variable point width and known distance to the event horizon, no? As in code.viewDistance = 8; do{ auto p = code.front() if(isLongPoint(p)){ processLong(p) code.shiftFront(8); }else if(isPoint(p)){ process(p) code.shiftFront(4); }else break; }while(p); or something like that. But the semantic that shiftFront would mean the same as popFront(), but on a list of elements? Confusing! Surely, at least popFront(n)... Hm, yeah Ok I'm all out of coffee!!! Thanks for your time! BR /HF Veeery interesting message, thank you. I share your care for correct naming. And the rest, actually. Wish you would post regularly. Denis -- _ vita es estrany
Re: buffered input
On 02/06/2011 01:28 AM, Don wrote: spir wrote: On 02/05/2011 10:44 PM, Nick Sabalausky wrote: Andrei Alexandrescuseewebsiteforem...@erdani.org wrote in message news:iijq99$1a5o$1...@digitalmars.com... On 2/5/11 6:46 AM, spir wrote: On 02/05/2011 10:36 AM, Nick Sabalausky wrote: On a separate note, I think a good way of testing the design (and end up getting something useful anyway) would be to try to use it to create a range that's automatically-buffered in a more traditional way. Ie, Given any input range 'myRange', buffered(myRange, 2048) (or something like that) would wrap it in a new input range that automatically buffers the next 2048 elements (asynchronously?) whenever its internal buffer is exhausted. Or something like that. It's late and I'm tired and I can't think anymore ;) That's exactly what I'm expecting. Funnily enough, I was about to start a thread on the topic after reading related posts. My point was: I'm not a specialist in efficiency (rather the opposite), I just know there is --theoretically-- relevant performance loss to expect from unbuffered input in various cases. Could we define a generic input-buffering primitive allowing people to benefit from others' competence? Just like Appender. Denis The buffered range interface as I defined it supports infinite lookahead. The interface mentioned by Nick has lookahead between 1 and 2048. So I don't think my interface is appropriate for that. Infinite lookahead is a wonderful thing. Consider reading lines from a file. Essentially what you need to do is to keep on reading blocks of data until you see \n (possibly followed by some more stuff). Then you offer the client the line up to the \n. When the client wants a new line, you combine the leftover data you already have with new stuff you read. On occasion you need to move over leftovers, but if your internal buffers are large enough that is negligible (I actually tested this recently). Another example: consider dealing with line continuations in reading CSV files. Under certain conditions, you need to read one more line and stitch it with the existing one. This is easy with infinite lookahead, but quite elaborate with lookahead 1. I think I can see how it might be worthwhile to discourage the traditional buffer interface I described in favor of the above. It wouldn't be as trivial to use as what people are used to, but I can see that it could avoid a lot of unnessisary copying, especially with other people's suggestion of allowing the user to provide their own buffer to be filled (and it seems easy enough to learn). But what about when you want a circular buffer? Ie, When you know a certain maximum lookahead is fine and you want to minimize memory usage and buffer-appends. Circular buffers don't do infinite lookahead so the interface maybe doesn't work as well. Plus you probably wouldn't want to provide an interface for slicing into the buffer, since the slice could straddle the wrap-around point which would require a new allocation (ie return buffer[indexOfFront+sliceStart..$] ~ buffer[0..sliceLength-($-(frontIndex+sliceStart))]). I guess maybe that would just call for another type of range. Becomes too complicated, doesnt it? Denis Circular buffers don't seem like an 'optional' use case to me. Most real I/O works that way. I think if the abstraction can't handle it, the abstraction is a failure. Sorry, I meant the way we start to draw the picture; not circular buffers, can see the point about them. Think Heywood's view window is a helpful image and a good modelling starting point. (maybe it's only me) Denis -- _ vita es estrany spir.wikidot.com
Re: buffered input
Andrei Alexandrescu napisał: I fear efficiency will get abstracted out. Say this is my internal buffer (pipes indicate front() slice): [ooo|oo|oo] Now I do appendToFront(3) -- how do you expose the expected front() without moving data? You do end up moving data, but proportionally little if the buffer is large enough. It still matters for frequent big munches. I'd like a minimum memory option if that's neccessary. -- Tomek
Re: buffered input
spir denis.s...@gmail.com wrote in message news:mailman.1321.1296950957.4748.digitalmar...@puremagic.com... On 02/05/2011 11:00 PM, Nick Sabalausky wrote: Transparent buffering sounds sensible but in fact it robs you of important capabilities. It essentially forces you to use grammars with lookahead 1 for all input operations. Being able to peek forward into the stream without committing to read from it allows you to e.g. do operations like does this stream start with a specific word etc. As soon That shouldn't be a problem for the cases where a lookahead of 1 is all that's needed. So both types can exist (with the traditional/automatic type most likely built on top of Andrei's type). Thus, I think the only question is Are the appropriate use-cases for the traditional/automatic type minor enough and infrequent enough to actively discourage it by not providing it? That I can't answer. And what about backtracking (eg for parsing the source)? Like I said, there are certainly cases where a lookahead of 1 isn't sufficient, and for those, something more like Andrei's proposal can be used. (FWIW, LR doesn't usually need backtracking. That's more typically an LL thing. Not that LL is any less important, though. Of course, if the lexical grammer supports non-consuming lookahead, then you'd still need lookahead 1 no matter what parsing algorithm is used.)
Re: buffered input
On 2/5/11 5:22 PM, Nick Sabalausky wrote: Heywood Floydsoul...@gmail.com wrote in message news:mailman.1318.1296941395.4748.digitalmar...@puremagic.com... As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? The problem with that is that in many many cases it forces unnessisary copying. We can get much better performance with this slightly more hands-on version. But that said, if the traditional hands-free automatic buffering really is all you need, then such a thing [should] be easily to construct out of the Andrei's style of buffered range. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! I completely agree. The names of those functions confused the hell out of me until I read Andrei's descriptions of them. Now I understand what they do...but I still don't understand their names at all. Better names are always welcome! Andrei
Re: buffered input
This sounds similar to how my network code works. I called the functions fetchMore() to append to the buffer and eat(int n) to advance the front position.
Re: buffered input
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote in message news:iil3lv$1bb1$1...@digitalmars.com... On 2/5/11 5:22 PM, Nick Sabalausky wrote: Heywood Floydsoul...@gmail.com wrote in message news:mailman.1318.1296941395.4748.digitalmar...@puremagic.com... As in, you've built some library that passes around ranges, but then some part of it is slow and needs buffered ones. Isn't the point that you can then swap out your ranges for buffered ones here and there, but continue to have a functioning library? The problem with that is that in many many cases it forces unnessisary copying. We can get much better performance with this slightly more hands-on version. But that said, if the traditional hands-free automatic buffering really is all you need, then such a thing [should] be easily to construct out of the Andrei's style of buffered range. Then follows that popFront() means discard the first _element_, so that the element that was second now becomes first. And if we can agree to that, then shiftFront() is possibly very confusing, and so is appendToFront(). They sound like they operate on the first element! I completely agree. The names of those functions confused the hell out of me until I read Andrei's descriptions of them. Now I understand what they do...but I still don't understand their names at all. Better names are always welcome! Borrowing slightly from Adam: discard and fetch?
Re: buffered input
On 2/5/11 7:28 PM, Don wrote: spir wrote: On 02/05/2011 10:44 PM, Nick Sabalausky wrote: Andrei Alexandrescuseewebsiteforem...@erdani.org wrote in message news:iijq99$1a5o$1...@digitalmars.com... On 2/5/11 6:46 AM, spir wrote: On 02/05/2011 10:36 AM, Nick Sabalausky wrote: On a separate note, I think a good way of testing the design (and end up getting something useful anyway) would be to try to use it to create a range that's automatically-buffered in a more traditional way. Ie, Given any input range 'myRange', buffered(myRange, 2048) (or something like that) would wrap it in a new input range that automatically buffers the next 2048 elements (asynchronously?) whenever its internal buffer is exhausted. Or something like that. It's late and I'm tired and I can't think anymore ;) That's exactly what I'm expecting. Funnily enough, I was about to start a thread on the topic after reading related posts. My point was: I'm not a specialist in efficiency (rather the opposite), I just know there is --theoretically-- relevant performance loss to expect from unbuffered input in various cases. Could we define a generic input-buffering primitive allowing people to benefit from others' competence? Just like Appender. Denis The buffered range interface as I defined it supports infinite lookahead. The interface mentioned by Nick has lookahead between 1 and 2048. So I don't think my interface is appropriate for that. Infinite lookahead is a wonderful thing. Consider reading lines from a file. Essentially what you need to do is to keep on reading blocks of data until you see \n (possibly followed by some more stuff). Then you offer the client the line up to the \n. When the client wants a new line, you combine the leftover data you already have with new stuff you read. On occasion you need to move over leftovers, but if your internal buffers are large enough that is negligible (I actually tested this recently). Another example: consider dealing with line continuations in reading CSV files. Under certain conditions, you need to read one more line and stitch it with the existing one. This is easy with infinite lookahead, but quite elaborate with lookahead 1. I think I can see how it might be worthwhile to discourage the traditional buffer interface I described in favor of the above. It wouldn't be as trivial to use as what people are used to, but I can see that it could avoid a lot of unnessisary copying, especially with other people's suggestion of allowing the user to provide their own buffer to be filled (and it seems easy enough to learn). But what about when you want a circular buffer? Ie, When you know a certain maximum lookahead is fine and you want to minimize memory usage and buffer-appends. Circular buffers don't do infinite lookahead so the interface maybe doesn't work as well. Plus you probably wouldn't want to provide an interface for slicing into the buffer, since the slice could straddle the wrap-around point which would require a new allocation (ie return buffer[indexOfFront+sliceStart..$] ~ buffer[0..sliceLength-($-(frontIndex+sliceStart))]). I guess maybe that would just call for another type of range. Becomes too complicated, doesnt it? Denis Circular buffers don't seem like an 'optional' use case to me. Most real I/O works that way. I think if the abstraction can't handle it, the abstraction is a failure. The abstraction does handle it implicitly, except that it doesn't fix the buffer size. If you ask for appendToFront() with large numbers without calling shiftFront() too, the size of the buffer will ultimately increase to accommodate the entire input. That's the infinite in infinite lookahead. Most uses, however, stick with front/popFront (i.e. let the range choose the buffer size and handle circularity transparently) and on occasion call appendToFront()/shiftFront(). Whenever a part of the buffer has been released by calling shiftFront(), the implementation may use it as a circular buffer. Circularity is all transparent, as I think it should be. But the power is there. Consider FIE* for contrast. The C API provides setbuf and setvbuf, but no way to otherwise take advantage of buffering - at all. This really hurts; you can only call fungetc() once and be guaranteed success - i.e. all FILE* offer L(1) capabilities no matter how big a buffer you set! Andrei
Re: buffered input
Andrei Alexandrescu seewebsiteforem...@erdani.org wrote in message news:iil64l$1f6s$1...@digitalmars.com... On 2/5/11 7:28 PM, Don wrote: Circular buffers don't seem like an 'optional' use case to me. Most real I/O works that way. I think if the abstraction can't handle it, the abstraction is a failure. The abstraction does handle it implicitly, except that it doesn't fix the buffer size. If you ask for appendToFront() with large numbers without calling shiftFront() too, the size of the buffer will ultimately increase to accommodate the entire input. That's the infinite in infinite lookahead. But what about when the window straddles the border? Ex: The circular buffer's internal size is 1000, the current starting point is 900 and the window (ie, front()) is 200. I guess that could work fine if front() is a random-access range, but if it's an array (which I think is what you proposed unless I misunderstood), then front() would have to return a new allocation: buf[900..$]~buf[0..100].
Re: buffered input
Interesting. I was just writing LazyMap and AsyncBuf in std.parallelism, and I ran into these exact issues. (A LazyMap computes a map lazily and in parallel and stores the result in a buffer. An AsyncBuf reads from an unbuffered input range in a background thread and buffers the results for when you need them. I wanted to optimize chaining LazyMaps and AsyncBufs for pipelining parallelism.) I solved them with very ad-hoc way with lots of static if statements and encapsulation violation within the module. One thing that would make a more principled approach in std.parallelism possible is a swapBuffer() primitive. You provide the range with a new buffer to fill, and it gives you complete ownership of the old buffer. This is basically how LazyMap/AsyncBuf pipelining works under the hood, though I never gave any serious consideration to the more general case. On 2/5/2011 12:46 AM, Andrei Alexandrescu wrote: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. They have some commonalities and some differences, but it's been difficult to capture them. I think now I have a clear view, caused by a few recent discussions. One was the CSV reader discussed on the Phobos list; another was the discussion on defining the right std.xml. First, let's start with the humblest abstraction of all - an input range, which only defines the troika empty/front/popFront with the known semantics. An input range consumes input destructively and has a one-element horizon. It may as well considered a buffered stream with the buffer length exactly one. Going from there, we may say that certain streaming can be done by using an input range of ubyte (or dchar for text). That would be the UTFpowered equivalent of getchar(). The readf function operates that way - it only needs to look one character ahead. Incidentally, the CSV format also requires lookahead of 1, so it also can operate on a range of dchar. At this point we need to ask ourselves an essential question. Since we have this input range abstraction for a 1-element buffer, what would its n-elements buffer representation look like? How do we go from input range of T (which really is unbuffered input range of T to buffered input range of T? Honestly, the answer was extremely unclear to me for the longest time. I thought that such a range would be an extension of the unbuffered one, e.g. a range that still offers T from front() but also offers some additional functions - e.g. a lookahead in the form of a random-access operator. I still think something can be defined along those lines, but today I came together with a design that is considerably simpler both for the client and the designer of the range. I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. This is it. I like many things about this design, although I still fear some fatal flaw may be found with it. With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. One great thing is that buffered ranges as defined above play very well with both ranges and built-in arrays - two quintessential parts of D. I look at this and say, this all makes sense. For example the design could be generalized to operate on some random-access range other than the built-in array, but then I'm thinking, unless some advantage comes about, why not giving T[] a little special status? Probably everyone thinks of contiguous memory when thinking buffers, so here generalization may be excessive (albeit meaningful). Finally, this design is very easy to experiment with and causes no disruption to ranges. I can readily add the primitives to byLine and byChunk so we can see what streaming we can do that way. What do you all think? Andrei
Re: buffered input
On 02/04/2011 11:46 PM, Andrei Alexandrescu wrote: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. They have some commonalities and some differences, but it's been difficult to capture them. I think now I have a clear view, caused by a few recent discussions. One was the CSV reader discussed on the Phobos list; another was the discussion on defining the right std.xml. First, let's start with the humblest abstraction of all - an input range, which only defines the troika empty/front/popFront with the known semantics. An input range consumes input destructively and has a one-element horizon. It may as well considered a buffered stream with the buffer length exactly one. Going from there, we may say that certain streaming can be done by using an input range of ubyte (or dchar for text). That would be the UTFpowered equivalent of getchar(). The readf function operates that way - it only needs to look one character ahead. Incidentally, the CSV format also requires lookahead of 1, so it also can operate on a range of dchar. At this point we need to ask ourselves an essential question. Since we have this input range abstraction for a 1-element buffer, what would its n-elements buffer representation look like? How do we go from input range of T (which really is unbuffered input range of T to buffered input range of T? Honestly, the answer was extremely unclear to me for the longest time. I thought that such a range would be an extension of the unbuffered one, e.g. a range that still offers T from front() but also offers some additional functions - e.g. a lookahead in the form of a random-access operator. I still think something can be defined along those lines, but today I came together with a design that is considerably simpler both for the client and the designer of the range. I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. Does shiftFront literally move element n to index 0 and so on? It seems to me that if you do, its going to have horrid performance, and if you don't, then you will eventually run into situations where appendToFront will require a wrap around, which loses you your contiguity, or a reallocation of the buffer.
Re: buffered input
On 2011-02-05 00:46:40 -0500, Andrei Alexandrescu seewebsiteforem...@erdani.org said: I've had the opportunity today to put some solid hours of thinking into the relationship (better said the relatedness) of what would be called buffered streams and ranges. They have some commonalities and some differences, but it's been difficult to capture them. I think now I have a clear view, caused by a few recent discussions. One was the CSV reader discussed on the Phobos list; another was the discussion on defining the right std.xml. First, let's start with the humblest abstraction of all - an input range, which only defines the troika empty/front/popFront with the known semantics. An input range consumes input destructively and has a one-element horizon. It may as well considered a buffered stream with the buffer length exactly one. Going from there, we may say that certain streaming can be done by using an input range of ubyte (or dchar for text). That would be the UTFpowered equivalent of getchar(). The readf function operates that way - it only needs to look one character ahead. Incidentally, the CSV format also requires lookahead of 1, so it also can operate on a range of dchar. At this point we need to ask ourselves an essential question. Since we have this input range abstraction for a 1-element buffer, what would its n-elements buffer representation look like? How do we go from input range of T (which really is unbuffered input range of T to buffered input range of T? Honestly, the answer was extremely unclear to me for the longest time. I thought that such a range would be an extension of the unbuffered one, e.g. a range that still offers T from front() but also offers some additional functions - e.g. a lookahead in the form of a random-access operator. I still think something can be defined along those lines, but today I came together with a design that is considerably simpler both for the client and the designer of the range. I hereby suggest we define buffered input range of T any range R that satisfies the following conditions: 1. R is an input range of T[] 2. R defines a primitive shiftFront(size_t n). The semantics of the primitive is that, if r.front.length = n, then shiftFront(n) discards the first n elements in r.front. Subsequently r.front will return a slice of the remaining elements. 3. R defines a primitive appendToFront(size_t n). Semantics: adds at most n more elements from the underlying stream and makes them available in addition to whatever was in front. For example if r.front.length was 1024, after the call r.appendToFront(512) will have r.front have length 1536 of which the first 1024 will be the old front and the rest will be newly-read elements (assuming that the stream had enough data). If n = 0, this instructs the stream to add any number of elements at its own discretion. This is it. I like many things about this design, although I still fear some fatal flaw may be found with it. With these primitives a lot of good operating operating on buffered streams can be written efficiently. The range is allowed to reuse data in its buffers (unless that would contradict language invariants, e.g. if T is invariant), so if client code wants to stash away parts of the input, it needs to make a copy. One great thing is that buffered ranges as defined above play very well with both ranges and built-in arrays - two quintessential parts of D. I look at this and say, this all makes sense. For example the design could be generalized to operate on some random-access range other than the built-in array, but then I'm thinking, unless some advantage comes about, why not giving T[] a little special status? Probably everyone thinks of contiguous memory when thinking buffers, so here generalization may be excessive (albeit meaningful). Finally, this design is very easy to experiment with and causes no disruption to ranges. I can readily add the primitives to byLine and byChunk so we can see what streaming we can do that way. What do you all think? One thing I'm wondering is whether it'd be more efficient if we could provide our own buffer to be filled. In cases where you want to preserve the data, this could let you avoid double-copying: first copy in the temporary buffer and then at the permanent storage location. If you need the data only temporarily however providing your buffer to be filled might be less efficient for a range that can't avoid copying to the temporary buffer for some reason.. Overall, it looks like a good design. It's quite low-level, but that's not a bad thing. I'll have to think a little to see how I could integrate it into my XML parser (which only deal with complete files in memory at this time). Being able to say fill this buffer would certainly make things easier for me. -- Michel Fortin michel.for...@michelf.com http://michelf.com/