On 5/15/18 4:36 PM, Dennis wrote:
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read) - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)

I want to write a program that splits it up into multiple files, with the splits happening every n lines. I keep encountering roadblocks though:

- You can't give Yes.useReplacementChar to `byLine` and `byLine` (or `readln`) throws an Exception upon encountering an invalid character.
- decodeFront doesn't work on inputRanges like `byChunk(4096).joiner`
- std.algorithm.splitter doesn't work on inputRanges either
- When you convert chunks to arrays, you have the risk of a split being in the middle of a character with multiple code units

Is there a simple way to do this?


Using iopipe, you can split on N lines (iopipe doesn't autodecode when searching for newlines), or split on a pre-determined chunk size (and ensure you don't split a code point).

Splitting on N lines:

import iopipe.bufpipe;
import iopipe.textpipe;

auto infile = openDev("filename").bufd.assumeText.byLine;

foreach(i; 0 .. N) infile.extend(0); // ensure N lines in the buffer

Splitting on pre-determined chunk size

auto infile = openDev("filename")
    .bufd!(ubyte, chunkSize) // use chunkSize as minimum read size
    .assumeText // it's text, not ubyte
    .ensureDecodeable; // do not end in the middle of a codepoint

The output isn't as straightforward. Ideally you would want to simply create an output pipe that split into multiple files, and process the whole thing at once. I haven't created such a thing yet though (will add an enhancement request to do so).

Easiest thing to do is to write the entire window of the input pipe into an output pipe, or cast it back to ubyte[] and write directly to an output device.

e.g.:

auto infile = ... // one of the above ideas
   .encodeText; // convert to ubyte

auto outfile = openDev("outputFilename1", "w");
outfile.write(infile.window);
outfile.close;
infile.release(infile.window.length); // flush the input buffer
... // refill the buffer using the chosen technique above.

-Steve

Reply via email to