I would considered using appender for pushed and npushed. Can you post file on which you are running benchmarking?
On Fri, Jun 9, 2017 at 9:50 AM, rikki cattermole via Digitalmars-d-learn < digitalmars-d-learn@puremagic.com> wrote: > On 09/06/2017 8:34 AM, uncorroded wrote: > >> Hi guys, >> >> I am a beginner in D. As a project, I converted a log-parsing script in >> Python which we use at work, to D. This link was helpful - ( >> https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) I >> compiled it with dmd and ldc. The log file is 52 MB. With dmd (not release >> build), it takes 1.1 sec and with ldc, it takes 0.3 sec. >> >> The Python script (run with system python, not Pypy) takes 0.75 sec. The >> D and Python functions are here and on pastebin ( D - >> https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ). >> >> Basically, i am reading a line, checking for 2 constants. If either one >> is found, some processing is done on line and stored to an array for later >> analysis. I tried reading the file entirely in one go using std.file : >> readText and using std.algorithm : splitter for lazily splitting newline >> but there is no difference in speed, so I used the byLine approach >> mentioned in the linked blog. Is there a better way of doing this in D? >> >> Note: >> I ran GC profiling as mentioned in linked blog. The results were: >> >> Number of collections: 3 >> Total GC prep time: 0 milliseconds >> Total mark time: 0 milliseconds >> Total sweep time: 0 milliseconds >> Total page recovery time: 0 milliseconds >> Max Pause Time: 0 milliseconds >> Grand total GC time: 2 milliseconds >> GC summary: 12 MB, 3 GC 2 ms, Pauses 0 ms < 0 ms >> >> So GC does not seem to be an issue. >> >> Here's the D script: >> >> import std.stdio; >> import std.string; >> import std.array; >> import std.algorithm : splitter; >> import std.typecons : tuple, Tuple; >> import std.conv : to; >> >> void read_log(string filename) { >> File file = File(filename, "r"); >> Tuple!(char[], int, char[])[] npushed; >> Tuple!(int, char[], int, bool, bool)[] pushed; >> foreach (line; file.byLine) { >> if (line.indexOf("SOC_NOT_PUSHED") != -1) { >> auto tarr = line.split(); >> npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]), >> tarr[$ - 2]); >> continue; >> } >> if (line.indexOf("SYNC_PUSH:") != -1) { >> auto rel = line.split("SYNC_PUSH:")[1].strip(); >> auto att = rel.split(" at "); >> auto ina = att[1].split(" in "); >> auto msa = ina[1].split(" ms "); >> pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]), >> msa[1].indexOf("PA-SOC_POP") != -1, >> msa[1].indexOf("CU-SOC_POP") != -1); >> } >> } >> // Using the arrays later on in production script >> writeln(npushed.length); >> writeln(pushed.length); >> } >> >> >> Here is Python function: >> >> def read_log(fname): >> try: >> with open(fname, 'r') as f: >> raw = f.read().splitlines() >> ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s] >> ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if >> 'SYNC_PUSH:' in w] >> not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns] >> ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split(' >> in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]), >> set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for e in >> ss] >> pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else >> 0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww] >> return not_pushed, pushed >> except: >> return [] >> >> > The code isn't entirely 1:1. Any usage of IO (includes stdout via writeln) > is expensive. Your python code doesn't write anything to stdout (or perform > any calls). It would also be good to get the results of dmd -release as > well. >