Re: Splitting up large dirty file

Jon Degenhardt via Digitalmars-d-learn Tue, 15 May 2018 19:50:34 -0700

On Tuesday, 15 May 2018 at 20:36:21 UTC, Dennis wrote:

I have a file with two problems:
- It's too big to fit in memory (apparently, I thought 1.5 Gbwould fit but I get an out of memory error when usingstd.file.read)- It is dirty (contains invalid Unicode characters, null bytesin the middle of lines)
I want to write a program that splits it up into multiplefiles, with the splits happening every n lines. I keepencountering roadblocks though:
- You can't give Yes.useReplacementChar to `byLine` and`byLine` (or `readln`) throws an Exception upon encountering aninvalid character.

Can you show the program you are using that throws when usingbyLine? I tried a very simple program that reads and outputsline-by-line, then fed it a file that contained invalid utf-8. Idid not see an exception. The invalid utf-8 was created by takingpart of this file:http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt (acommonly used file with utf-8 edge cases), plus adding a numberof random hex characters, including null. I don't see exceptionsthrown.


The program I used:

int main(string[] args)
{
    import std.stdio;
    import std.conv : to;
    try
    {

auto inputStream = (args.length < 2 || args[1] == "-") ?stdin : args[1].File;foreach (line; inputStream.byLine(KeepTerminator.yes))write(line);

    }
    catch (Exception e)
    {
        stderr.writefln("Error [%s]: %s", args[0], e.msg);
        return 1;
    }
    return 0;
}

Re: Splitting up large dirty file

Reply via email to