On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb
would fit but I get an out of memory error when using
std.file.read)
Memory mapping should work. That's in core.sys.posix.sys.mman for
Posix systems, and Windows has some equivalent probably. (But
nobody uses Windows, right?)
- It is dirty (contains invalid Unicode characters, null bytes
in the middle of lines)
std.algorithm should generally work with sequences of anything,
not just strings. So memory map, cast to ubyte[], and deal with
it that way?
- When you convert chunks to arrays, you have the risk of a
split being in the middle of a character with multiple code
units
It's straightforward to scan for the start of a Unicode
character; you just skip past characters where the highest bit is
set and the next-highest is not. (0b1100_0000 through 0b1111_1110
is the start of a multibyte character; 0b0000_0000 through
0b0111_1111 is a single-byte character.)
That said, you seem to only need to split based on a newline
character, so you might be able to ignore this entirely, even if
you go by chunks.