On Wednesday, 16 May 2018 at 08:20:06 UTC, drug wrote:
What is the purpose of `.drop(4)`? I'm pretty sure this is the reason of the exception.

The file in question is a .json database dump with an array "rows" of 10 million 8-line objects. The newlines in the string fields are escaped, but they still contain other invalid characters which makes std.json reject it.

The first 4 lines of the file are basically "header" and the last 2 lines are a closing ] and }, so I want to split every 4 + 8*(10_000_000/amountOfFiles) n lines and also remove trailing the comma, add brackets, drop the last 2 lines etc.

I thought it wouldn't be hard to crudely split this file using D's range functions and basic string manipulation, but the combination of being to large for a string and having invalid encoding seems to defeat most simple solutions. For now I decided to use Git Bash and do:
tail -n80000002 inputfile.json | split -l 8000000 - outputfile

And now I have files that do fit in memory. I'm still interested in complete D solutions though, thanks for the iopipe and memory mapped file suggestions Steven and Jonathan. I will check those out.

Reply via email to