Re: More fun with autodecoding

Jon Degenhardt via Digitalmars-d Thu, 09 Aug 2018 10:36:02 -0700

On Wednesday, 8 August 2018 at 21:01:18 UTC, Steven Schveighofferwrote:

Not trying to give too much away about the library I'm writing,but the problem I'm trying to solve is parsing out tokens froma buffer. I want to delineate the whole, as well as the parts,but it's difficult to get back to the original buffer once yousplit and slice up the buffer using phobos functions.

I wonder if there are some parallels in the tsv utilities Iwrote. The tsv parser is extremely simple, byLine and splitter ona char buffer. Most of the tools just iterate the split result inorder, but a couple do things like operate on a subset of fields,potentially reordered. For these a separate structure is createdthat maps back the to original buffer to avoid copying. Likelyquite simple compared to what you are doing.

The csv2tsv tool may be more interesting. Parsing is relativelysimple, mostly identifying field values in the context of CSVescape syntax. It's modeled as reading an infinite stream ofutf-8 characters, byte-by-byte. Occasionally the bytes formingthe value need to be modified due to the escape syntax, but mostof the time the characters in the original buffer remainuntouched and parsing is identifying the start and end positions.

The infinite stream is constructed by reading fixed size blocksfrom the input stream and concatenating them with joiner. Thiseliminates the need to worry about utf-8 characters spanningblock boundaries, but it comes at a cost: either writebyte-at-a-time, or make an extra copy (also byte-at-a-time).Making an extra copy is faster, that what the code does. But, asa practical matter, most of the time large blocks could often bewritten directly from the original input buffer.

If I wanted it make it faster than current I'd do this. But Idon't see an easy way to do this with phobos ranges. At minimumI'd have to be able to run code when the joiner operation hitsblock boundaries. And it'd also be necessary to create a mappingback to the original input buffer.

Autodecoding comes into play of course. Basically, splitter onchar arrays is fine, but in a number of cases it's necessary towork using ubtye to avoid the performance penalty.


--Jon

Re: More fun with autodecoding

Reply via email to