On Wednesday, 8 August 2018 at 21:01:18 UTC, Steven Schveighoffer wrote:
Not trying to give too much away about the library I'm writing, but the problem I'm trying to solve is parsing out tokens from a buffer. I want to delineate the whole, as well as the parts, but it's difficult to get back to the original buffer once you split and slice up the buffer using phobos functions.

I wonder if there are some parallels in the tsv utilities I wrote. The tsv parser is extremely simple, byLine and splitter on a char buffer. Most of the tools just iterate the split result in order, but a couple do things like operate on a subset of fields, potentially reordered. For these a separate structure is created that maps back the to original buffer to avoid copying. Likely quite simple compared to what you are doing.

The csv2tsv tool may be more interesting. Parsing is relatively simple, mostly identifying field values in the context of CSV escape syntax. It's modeled as reading an infinite stream of utf-8 characters, byte-by-byte. Occasionally the bytes forming the value need to be modified due to the escape syntax, but most of the time the characters in the original buffer remain untouched and parsing is identifying the start and end positions.

The infinite stream is constructed by reading fixed size blocks from the input stream and concatenating them with joiner. This eliminates the need to worry about utf-8 characters spanning block boundaries, but it comes at a cost: either write byte-at-a-time, or make an extra copy (also byte-at-a-time). Making an extra copy is faster, that what the code does. But, as a practical matter, most of the time large blocks could often be written directly from the original input buffer.

If I wanted it make it faster than current I'd do this. But I don't see an easy way to do this with phobos ranges. At minimum I'd have to be able to run code when the joiner operation hits block boundaries. And it'd also be necessary to create a mapping back to the original input buffer.

Autodecoding comes into play of course. Basically, splitter on char arrays is fine, but in a number of cases it's necessary to work using ubtye to avoid the performance penalty.

--Jon

Reply via email to