On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:
I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gb would fit but I get an out of memory error when using std.file.read)

Memory mapping should work. That's in core.sys.posix.sys.mman for Posix systems, and Windows has some equivalent probably. (But nobody uses Windows, right?)

- It is dirty (contains invalid Unicode characters, null bytes in the middle of lines)

std.algorithm should generally work with sequences of anything, not just strings. So memory map, cast to ubyte[], and deal with it that way?

- When you convert chunks to arrays, you have the risk of a split being in the middle of a character with multiple code units

It's straightforward to scan for the start of a Unicode character; you just skip past characters where the highest bit is set and the next-highest is not. (0b1100_0000 through 0b1111_1110 is the start of a multibyte character; 0b0000_0000 through 0b0111_1111 is a single-byte character.)

That said, you seem to only need to split based on a newline character, so you might be able to ignore this entirely, even if you go by chunks.

Reply via email to