Parsing and splitting textfile
Hello, Can you point me to an efficient way to parse a text file and split it by certain expression (for example, `\n\nFrom\ .+@.+$`), copying what has already been read to a separate file, and so on till the end of the file? I am trying to implement a mailbox to maildir format conversion application in D, but I would like to avoid loading each mbox completely into memory. Regards, Hugo
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 13:52:45 -0500, Hugo Florentino h...@acdam.cu wrote: Hello, Can you point me to an efficient way to parse a text file and split it by certain expression (for example, `\n\nFrom\ .+@.+$`), copying what has already been read to a separate file, and so on till the end of the file? I am trying to implement a mailbox to maildir format conversion application in D, but I would like to avoid loading each mbox completely into memory. Regards, Hugo std.regex -Steve
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 14:00:09 -0500, Steven Schveighoffer wrote: On Mon, 24 Feb 2014 13:52:45 -0500, Hugo Florentino h...@acdam.cu wrote: Hello, Can you point me to an efficient way to parse a text file and split it by certain expression (for example, `\n\nFrom\ .+@.+$`), copying what has already been read to a separate file, and so on till the end of the file? I am trying to implement a mailbox to maildir format conversion application in D, but I would like to avoid loading each mbox completely into memory. Regards, Hugo std.regex -Steve Specifically std.regex.splitter[1] creates a lazy range over the input. You can couple this with lazy file reading (e.g. `File(mailbox).byChunk (1024).joiner`). Justin
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 14:00:09 -0500, Steven Schveighoffer wrote: std.regex I should have explained myself better. I have already used regular expressions a couple of times. My doubt here is how parse the file progressively, not loading it completely into memory. If this can be done solely with std.regex, please clarify futher I was thinking in using byLine, but for that, I see first I must use something like: auto myfile = File(usermbox); Doesn't that load the whole file into memory? Regards, Hugo
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 19:08:16 + (UTC), Justin Whear wrote: Specifically std.regex.splitter[1] creates a lazy range over the input. You can couple this with lazy file reading (e.g. `File(mailbox).byChunk (1024).joiner`). Interesting, thanks.
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 14:17:14 -0500, Hugo Florentino h...@acdam.cu wrote: On Mon, 24 Feb 2014 14:00:09 -0500, Steven Schveighoffer wrote: std.regex I should have explained myself better. I have already used regular expressions a couple of times. My doubt here is how parse the file progressively, not loading it completely into memory. OK, I did not understand that. If this can be done solely with std.regex, please clarify futher I'm not completely sure, Justin may have a solution. I was thinking in using byLine, but for that, I see first I must use something like: auto myfile = File(usermbox); Doesn't that load the whole file into memory? I do know the answer to this, and it's no. File wraps a C FILE * buffered file. -Steve
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 19:08:16 + (UTC), Justin Whear wrote: Specifically std.regex.splitter[1] creates a lazy range over the input. You can couple this with lazy file reading (e.g. `File(mailbox).byChunk (1024).joiner`). Would something like this work? (I cannot test it right now) auto themailbox = args[1]; immutable uint chunksize = 1024 * 64; static auto re = regex(`\n\nFrom .+@.+$`); auto mailbox; auto mail; while (mailbox = File(themailbox).byChunk(chunksize).joiner) != EOF) { mail = splitter(mailbox, re); } If so, I have a couple of furter doubts: Using splitter actually removes the expression from the string, how could I reinsert it to the beginning of each resulting string in an efficient way (i.e. avoiding copying something which is already loaded in memory)? I am seeing the splitter fuction returns a struct, how could I progressively dump to disk each resulting string, removing it from the struct, so that so that it does not end up having the full mailbox loaded into memory, in this case as a struct? Regards, Hugo
Re: Parsing and splitting textfile
On Mon, 24 Feb 2014 15:19:06 -0500, Hugo Florentino wrote: On Mon, 24 Feb 2014 19:08:16 + (UTC), Justin Whear wrote: Specifically std.regex.splitter[1] creates a lazy range over the input. You can couple this with lazy file reading (e.g. `File(mailbox).byChunk (1024).joiner`). Would something like this work? (I cannot test it right now) auto themailbox = args[1]; immutable uint chunksize = 1024 * 64; static auto re = regex(`\n\nFrom .+@.+$`); auto mailbox; auto mail; while (mailbox = File(themailbox).byChunk(chunksize).joiner) != EOF) { mail = splitter(mailbox, re); } If so, I have a couple of furter doubts: Using splitter actually removes the expression from the string, how could I reinsert it to the beginning of each resulting string in an efficient way (i.e. avoiding copying something which is already loaded in memory)? I am seeing the splitter fuction returns a struct, how could I progressively dump to disk each resulting string, removing it from the struct, so that so that it does not end up having the full mailbox loaded into memory, in this case as a struct? Regards, Hugo The code you've posted won't work, primarily because you don't need to loop over the file-reading range, nor will it ever return EOF. Also, if you don't actually want to remove the regex matches, you can just use the matchAll function. Here's some _untested_ sample code to set you on the right track. import std.algorithm, std.range, std.stdio, std.regex; void main(string[] args) { auto mailboxPath = args[1]; immutable size_t chunksize = 1024 * 64; auto re = regex(`\n\nFrom .+@.+$`); // you might want to try using ctRegex auto mailStarts = File(mailboxPath).byChunk(chunksize).joiner .matchAll(re); } This code won't actually do any work--no data will be loaded from the file (caveat: the first chunk might be prefetched, not sure), no matches will actually be performed. If you use `take(10)` on the mailStarts variable, the code will load only as much of the file (down to the granularity of chunksize) as is needed to find the first 10 instances of the regular expression. The regex matches will not copy, but rather provide slices over the data that is in memory. And, thinking about this further, you don't want to use my code either-- partly because byChunk reuses its buffer, partly because the functions in std.regex provide slices over the input data. I think what you'll want to do is: load the data from File chunk-by-chunk lazily, scan each chunk with the regex, if you don't find a match, copy that data into an overlap buffer and repeat, if you do find a match then the contents of the overlap buffer + the slice up to the current match is one mail, rinse and repeat. You should be able to encapsulate all of this in clean, lazy range, but I don't have the time right now to work out if it can be done by simply compositing existing functions from Phobos. Justin