[RFC] I/O and Buffer Range

Dmitry Olshansky Sun, 29 Dec 2013 14:06:40 -0800

Didn't mean this to be kind of under the tree .. Planned to post longago but I barely had the time to flesh it out.


TL;DR: Solve the input side of I/O problem with new kind of ranges.

I'm taking on the buffering primitive on top of the low-level unbufferedI/O sources specifically.


Links

Prior related discussions:
http://forum.dlang.org/thread/op.wegg9vyneav7ka@steves-laptop?page=4

http://forum.dlang.org/thread/mailman.17.1317564725.22016.digitalmar...@puremagic.com?page=1

http://forum.dlang.org/thread/vnpriguleebpbzhkp...@forum.dlang.org#post-kh5g46:242prc:241:40digitalmars.com

An early precursor of the proposed design found in DScanner:
https://github.com/Hackerpilot/Dscanner/blob/master/stdx/d/lexer.d#L2388

Into and motivation

Anyone doing serious work on parsers (text & binary alike) with D soonfinds out that while slice-em-up with random access ranges works verynice any attempts to extend that beyond arrays and in-memory stuff aregetting real awkward. It sums up to implementing not a small amount ofbookkeeping (buffering) all the while having to use primitives that justdon't map well to the underlying "hardware" - buffered stream.

For instance - want to peek 3 bytes ahead (as in LL(3) disambiguation)?Save the range, do 3 front/popFront with empty checks then copy savedrange back. Painful especially if you know full well that it must bejust a length check plus 3 direct reads of array. Even if the underlyingbuffer allows you to slice up the contents and do array-wise operationson them, the dumbed-down forward range interface just won't let you.

And once all of the hard (and useless) work to support forward ranges isdone - you get that you have this 2-nd parser that should supportforward ranges as well. It leads to conclusion that parsers simply don'twant to work forward ranges, they need something bigger and all extrawork you did on buffering simply belongs to that bigger primitive.

To put the last nail in the coffin - most of interesting sources can'teven be forward ranges! Stream over a TCP socket - how do you 'save'that, Sherlock? Keeping in mind to return the same type - we'd bettercalled it "fork" back then.

To sum it up - the current selection of ranges is not good enough to_efficiently_ work with I/O. Yet ranges are very neat and we'd betterretain their strengths and capitalize on the existing work done in thisarea (e.g. a slew of good stuff in std.algorithm and std.range). Got toextend the notion then.


Goals

Introduce a better primitive aimed squarely at solving the _input_ sideof I/O problem. Output IMHO might need some tweaks but with OutputRangeit's in a much better shape then input.

The said new input primitive (from now on "buffer range") must naturallyand efficiently support the following:

1) Zero-copy insanely fast & popular slice em up approach for in-memory use.
2) "Over the wire" sources such as pipes, sockets and the like
3) Memory mapped files esp. the ones that don't fit into memory

4) Primitives that (all kinds of) parsers need, including lookahead andlookbehind.


It would be awesome if we can:
4) Retain backwards compatibility and integrate well with existing ranges.
5) Avoid dependency on C run-time and surpass it performance-wise
6) Remove extra burden and decouple dependencies in the chain:

input-source <--> buffer range <--> parser/consumer

Meaning that if we can mix and match parsers with buffer ranges, andbuffer ranges with input sources we had grown something powerful indeed.


Spoiler

I have a proof of concept implemented. It *works*. What I'm looking foris to simplify the set of primitives and/or better suggestions.


Code: https://github.com/blackwhale/datapicked/tree/master/dpick/buffer
Docs: http://blackwhale.github.io/datapicked/

See it in action with e.g. a bit trimmed-down std.regex using thisabstraction and working directly over files:


grep-like tool
https://github.com/blackwhale/datapicked/blob/master/dgrep.d

and the module itself
https://github.com/blackwhale/datapicked/blob/master/regex.d

(for usage see build.sh and bench.sh)

It's as efficient as it was with arrays but no more need to work line byline and/or load the whole file.

In fact it's faster then the fastest line by line solution I had before:fgets + std.regex.matchFirst. Note that this (old) solution is cheatingtwice - it seeks only 1 match per line and it knows the line lengthahead of time.


For me this proves that (5) is well within our reach.

Proposal

The BufferRange concept itself (for now called simply Buffer) is definedhere:

http://blackwhale.github.io/datapicked/dpick.buffer.traits.html

I'm not comfortable with the sheer number of primitives but my goal wassufficient first, minimal second.


Rationale for the selection of the primitives follows.

1. Start with InputRange, as there is no need to break the good thing(and foreach). This gets us somewhat close to (4). :)

Accept that given requirements (1)-(3) we are working with "slidingwindow" over whatever is the true source of data. Thus a sliding windowcan be the whole array, a mapped area of a file or a buffer of networkstream. A sliding window may be moved across the input stream (~re-loading the buffer) or extended to reach further.

Now what we need is to properly exploit capabilities of sliding windowmodel and match that with requirements a parser would "like".


2. Slicing "after the fact".

This means ability to mark relevant parts of buffer as a start ofsomething "useful" and require the underlying implementation when thetime comes to move the window to keep the data starting from here.Typically later "down the stream" when the boundaries of slice areestablished it's extracted, examined and (rarely!) copied over. Abilityto avoid copy unless absolutely necessary is _crucial_.


Motivating (pseudo-)code I've seen in lexers (e.g. DScanner)
with my primitives looks like this:

{
  //pin this position so that buffering won't lose it
  auto m = input.mark();
  while(!input.empty && isAlpha(input.front)){
        input.popFront();
  }
  //get a slice from 'm' to current position
  auto word = input.slice(m);
  //grab a copy if haven't seen before
  if(word !in identifiers)
      identifiers.insert(word.idup);
} //here m's destructor unpins position in input

To address slicing (parser requirement) I had to introduce marking andpinning concept explicitly. See mark/slice in docs.


3. Cheap save/restore.

Copying the whole range to save iteration state was a bad idea. It's notjust wasteful as in memory, it's _semantically_ costly. In case ofbuffer range it would also imply some "re-reading" of I/O source as thecopy has to be completely independent view of the same input(!). Hence"save" of forward range is an inadequate primitive that must be dropped.

However when time comes to backtrack (and many parsers do that, quiteoften) the state has to be restored. To minimize primitive count and notbreak requirement (2) reuse 'mark' to save the state and add 'seek' torestore position to the marked point. As it was pinned it's always inthe buffer and readily accessible, keeping the ability to work over streams.


4. Even cheaper save/restore.

Something discovered the hard way in the practical setting is thatsetting up boundaries of stuff to slice (captures in regex) must be dirtcheap, like integer assignment cheap.

This means always using marking for every sub-slice is prohibitivelycostly as it has to communicate with buffer (currently bump a counter insome array). The idea that worked out with std.regex is to use a single"vantage point" mark and take a big slice off that and then makesub-slices of it as the plain array it is. Then from time to time"vantage point" should be updated when there is no sub-slices in mid-air.

This leads to a 'tell' primitive that gives offset from a given mark tothe current position plus 'seek' overloads that work with relativeoffsets (positive and negative).

Another usage for relative seek is skipping the of data without looking,potentially dropping the whole buffers away (there is overlap withpopFrontN though). Also fixed-width backtracking is easily had withrelative seek if we can instruct the buffer range to keep at least Klast bytes in the buffer (require minimal history).


5. Cheap lookahead/lookbehind.

This exploits the fact that underlying buffer is nothing but array,hence one may easily take a peek at some portion of it as plain ubyte[].Implementation makes sure it buffers up enough of data as requiredand/or returns empty range if not possible. This supports things likeLL(k) lookahead and fixed-length lookbehind that is common in regex0-width assertions.

These 2 can be implemented with relative 'seek' +front/popFront, thequestion remains is how effective it'll be.


--
Dmitry Olshansky

[RFC] I/O and Buffer Range

Reply via email to