Re: Splitting up large dirty file

Neia Neutuladh via Digitalmars-d-learn Thu, 17 May 2018 14:48:01 -0700

On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gbwould fit but I get an out of memory error when usingstd.file.read)

Memory mapping should work. That's in core.sys.posix.sys.mman forPosix systems, and Windows has some equivalent probably. (Butnobody uses Windows, right?)

- It is dirty (contains invalid Unicode characters, null bytesin the middle of lines)

std.algorithm should generally work with sequences of anything,not just strings. So memory map, cast to ubyte[], and deal withit that way?

- When you convert chunks to arrays, you have the risk of asplit being in the middle of a character with multiple codeunits

It's straightforward to scan for the start of a Unicodecharacter; you just skip past characters where the highest bit isset and the next-highest is not. (0b1100_0000 through 0b1111_1110is the start of a multibyte character; 0b0000_0000 through0b0111_1111 is a single-byte character.)

That said, you seem to only need to split based on a newlinecharacter, so you might be able to ignore this entirely, even ifyou go by chunks.

Re: Splitting up large dirty file

Reply via email to