On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read) - It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)

I want to write a program that splits it up into multiple files, with the splits happening every n lines. I keep encountering roadblocks though:

- You can't give Yes.useReplacementChar to `byLine` and `byLine` (or `readln`) throws an Exception upon encountering an invalid character.

Can you show the program you are using that throws when using byLine? I tried a very simple program that reads and outputs line-by-line, then fed it a file that contained invalid utf-8. I did not see an exception. The invalid utf-8 was created by taking part of this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a commonly used file with utf-8 edge cases), plus adding a number of random hex characters, including null. I don't see exceptions thrown.

The program I used:

int main(string[] args)
{
    import std.stdio;
    import std.conv : to;
    try
    {
auto inputStream = (args.length < 2 || args[1] == "-") ? stdin : args[1].File; foreach (line; inputStream.byLine(KeepTerminator.yes)) write(line);
    }
    catch (Exception e)
    {
        stderr.writefln("Error [%s]: %s", args[0], e.msg);
        return 1;
    }
    return 0;
}



Reply via email to