Didn't mean this to be kind of under the tree .. Planned to post long ago but I barely had the time to flesh it out.

TL;DR: Solve the input side of I/O problem with new kind of ranges.
I'm taking on the buffering primitive on top of the low-level unbuffered I/O sources specifically.

Links

Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka@steves-laptop?page=4

http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmar...@puremagic.com?page=1

http://forum.dlang.org/thread/vnpriguleebpbzhkp...@forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com

An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388

Into and motivation

Anyone doing serious work on parsers (text & binary alike) with D soon finds out that while slice-em-up with random access ranges works very nice any attempts to extend that beyond arrays and in-memory stuff are getting real awkward. It sums up to implementing not a small amount of bookkeeping (buffering) all the while having to use primitives that just don't map well to the underlying "hardware" - buffered stream.

For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)? Save the range, do 3 front/popFront with empty checks then copy saved range back. Painful especially if you know full well that it must be just a length check plus 3 direct reads of array. Even if the underlying buffer allows you to slice up the contents and do array-wise operations on them, the dumbed-down forward range interface just won't let you.

And once all of the hard (and useless) work to support forward ranges is done - you get that you have this 2-nd parser that should support forward ranges as well. It leads to conclusion that parsers simply don't want to work forward ranges, they need something bigger and all extra work you did on buffering simply belongs to that bigger primitive.

To put the last nail in the coffin - most of interesting sources can't even be forward ranges! Stream over a TCP socket - how do you 'save' that, Sherlock? Keeping in mind to return the same type - we'd better called it "fork" back then.

To sum it up - the current selection of ranges is not good enough to _efficiently_ work with I/O. Yet ranges are very neat and we'd better retain their strengths and capitalize on the existing work done in this area (e.g. a slew of good stuff in std.algorithm and std.range). Got to extend the notion then.

Goals

Introduce a better primitive aimed squarely at solving the _input_ side of I/O problem. Output IMHO might need some tweaks but with OutputRange it's in a much better shape then input.

The said new input primitive (from now on "buffer range") must naturally and efficiently support the following:
1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory
4) Primitives that (all kinds of) parsers need, including lookahead and lookbehind.

It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:

input-source <--> buffer range <--> parser/consumer

Meaning that if we can mix and match parsers with buffer ranges, and buffer ranges with input sources we had grown something powerful indeed.

Spoiler

I have a proof of concept implemented. It *works*. What I'm looking for is to simplify the set of primitives and/or better suggestions.

Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/

See it in action with e.g. a bit trimmed-down std.regex using this abstraction and working directly over files:

grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d

and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d

(for usage see build.sh and bench.sh)

It's as efficient as it was with arrays but no more need to work line by line and/or load the whole file.

In fact it's faster then the fastest line by line solution I had before: fgets + std.regex.matchFirst. Note that this (old) solution is cheating twice - it seeks only 1 match per line and it knows the line length ahead of time.

For me this proves that (5) is well within our reach.

Proposal

The BufferRange concept itself (for now called simply Buffer) is defined here:
http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I'm not comfortable with the sheer number of primitives but my goal was sufficient first, minimal second.

Rationale for the selection of the primitives follows.

1. Start with InputRange, as there is no need to break the good thing (and foreach). This gets us somewhat close to (4). :)

Accept that given requirements (1)-(3) we are working with "sliding window" over whatever is the true source of data. Thus a sliding window can be the whole array, a mapped area of a file or a buffer of network stream. A sliding window may be moved across the input stream (~ re-loading the buffer) or extended to reach further.

Now what we need is to properly exploit capabilities of sliding window model and match that with requirements a parser would "like".

2. Slicing "after the fact".

This means ability to mark relevant parts of buffer as a start of something "useful" and require the underlying implementation when the time comes to move the window to keep the data starting from here. Typically later "down the stream" when the boundaries of slice are established it's extracted, examined and (rarely!) copied over. Ability to avoid copy unless absolutely necessary is _crucial_.

Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:

{
  //pin this position so that buffering won't lose it
  auto m = input.mark();
  while(!input.empty && isAlpha(input.front)){
        input.popFront();
  }
  //get a slice from 'm' to current position
  auto word = input.slice(m);
  //grab a copy if haven't seen before
  if(word !in identifiers)
      identifiers.insert(word.idup);
} //here m's destructor unpins position in input

To address slicing (parser requirement) I had to introduce marking and pinning concept explicitly. See mark/slice in docs.

3. Cheap save/restore.

Copying the whole range to save iteration state was a bad idea. It's not just wasteful as in memory, it's _semantically_ costly. In case of buffer range it would also imply some "re-reading" of I/O source as the copy has to be completely independent view of the same input(!). Hence "save" of forward range is an inadequate primitive that must be dropped.

However when time comes to backtrack (and many parsers do that, quite often) the state has to be restored. To minimize primitive count and not break requirement (2) reuse 'mark' to save the state and add 'seek' to restore position to the marked point. As it was pinned it's always in the buffer and readily accessible, keeping the ability to work over streams.

4. Even cheaper save/restore.

Something discovered the hard way in the practical setting is that setting up boundaries of stuff to slice (captures in regex) must be dirt cheap, like integer assignment cheap.

This means always using marking for every sub-slice is prohibitively costly as it has to communicate with buffer (currently bump a counter in some array). The idea that worked out with std.regex is to use a single "vantage point" mark and take a big slice off that and then make sub-slices of it as the plain array it is. Then from time to time "vantage point" should be updated when there is no sub-slices in mid-air.

This leads to a 'tell' primitive that gives offset from a given mark to the current position plus 'seek' overloads that work with relative offsets (positive and negative).

Another usage for relative seek is skipping the of data without looking, potentially dropping the whole buffers away (there is overlap with popFrontN though). Also fixed-width backtracking is easily had with relative seek if we can instruct the buffer range to keep at least K last bytes in the buffer (require minimal history).

5. Cheap lookahead/lookbehind.

This exploits the fact that underlying buffer is nothing but array, hence one may easily take a peek at some portion of it as plain ubyte[]. Implementation makes sure it buffers up enough of data as required and/or returns empty range if not possible. This supports things like LL(k) lookahead and fixed-length lookbehind that is common in regex 0-width assertions.

These 2 can be implemented with relative 'seek' +front/popFront, the question remains is how effective it'll be.

--
Dmitry Olshansky

Reply via email to