On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb
would fit but I get an out of memory error when using
std.file.read)
- It is dirty (contains invalid Unicode characters, null bytes
in the middle of lines)
I want to write a program that splits it up into multiple
files, with the splits happening every n lines. I keep
encountering roadblocks though:
- You can't give Yes.useReplacementChar to `byLine` and
`byLine` (or `readln`) throws an Exception upon encountering an
invalid character.
Can you show the program you are using that throws when using
byLine? I tried a very simple program that reads and outputs
line-by-line, then fed it a file that contained invalid utf-8. I
did not see an exception. The invalid utf-8 was created by taking
part of this file:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (a
commonly used file with utf-8 edge cases), plus adding a number
of random hex characters, including null. I don't see exceptions
thrown.
The program I used:
int main(string[] args)
{
import std.stdio;
import std.conv : to;
try
{
auto inputStream = (args.length < 2 || args[1] == "-") ?
stdin : args[1].File;
foreach (line; inputStream.byLine(KeepTerminator.yes))
write(line);
}
catch (Exception e)
{
stderr.writefln("Error [%s]: %s", args[0], e.msg);
return 1;
}
return 0;
}