Re: Can I speed up this log parsing script further?

rikki cattermole via Digitalmars-d-learn Fri, 09 Jun 2017 00:55:59 -0700

On 09/06/2017 8:34 AM, uncorroded wrote:

Hi guys,
I am a beginner in D. As a project, I converted a log-parsing script inPython which we use at work, to D. This link was helpful - (https://dlang.org/blog/2017/05/24/faster-command-line-tools-in-d/ ) Icompiled it with dmd and ldc. The log file is 52 MB. With dmd (notrelease build), it takes 1.1 sec and with ldc, it takes 0.3 sec.
The Python script (run with system python, not Pypy) takes 0.75 sec. TheD and Python functions are here and on pastebin ( D -https://pastebin.com/SeUR3wFP , Python - https://pastebin.com/F5JbfBmE ).
Basically, i am reading a line, checking for 2 constants. If either oneis found, some processing is done on line and stored to an array forlater analysis. I tried reading the file entirely in one go usingstd.file : readText and using std.algorithm : splitter for lazilysplitting newline but there is no difference in speed, so I used thebyLine approach mentioned in the linked blog. Is there a better way ofdoing this in D?
Note:
I ran GC profiling as mentioned in linked blog. The results were:

Number of collections:  3
     Total GC prep time:  0 milliseconds
     Total mark time:  0 milliseconds
     Total sweep time:  0 milliseconds
     Total page recovery time:  0 milliseconds
     Max Pause Time:  0 milliseconds
     Grand total GC time:  2 milliseconds
GC summary:   12 MB,    3 GC    2 ms, Pauses    0 ms <    0 ms

So GC does not seem to be an issue.

Here's the D script:

import std.stdio;
import std.string;
import std.array;
import std.algorithm : splitter;
import std.typecons : tuple, Tuple;
import std.conv : to;

void read_log(string filename) {
     File file = File(filename, "r");
     Tuple!(char[], int, char[])[] npushed;
     Tuple!(int, char[], int, bool, bool)[] pushed;
     foreach (line; file.byLine) {
         if (line.indexOf("SOC_NOT_PUSHED") != -1) {
             auto tarr = line.split();
npushed ~= tuple(tarr[2] ~ tarr[3], to!int(tarr[$ - 1]),tarr[$ - 2]);
             continue;
         }
         if (line.indexOf("SYNC_PUSH:") != -1) {
             auto rel = line.split("SYNC_PUSH:")[1].strip();
             auto att = rel.split(" at ");
             auto ina = att[1].split(" in ");
             auto msa = ina[1].split(" ms ");
             pushed ~= tuple(to!int(att[0]), ina[0], to!int(msa[0]),
msa[1].indexOf("PA-SOC_POP") != -1,msa[1].indexOf("CU-SOC_POP") != -1);
         }
     }
     // Using the arrays later on in production script
     writeln(npushed.length);
     writeln(pushed.length);
}


Here is Python function:

def read_log(fname):
     try:
         with open(fname, 'r') as f:
             raw = f.read().splitlines()
             ns = [s.split() for s in raw if 'SOC_NOT_PUSHED' in s]
ss = [w.split("SYNC_PUSH:")[1].strip() for w in raw if'SYNC_PUSH:' in w]
             not_pushed = [[s[2]+s[3], int(s[-1]), s[-2]] for s in ns]
ww = [(int(e.split(' at ')[0]), e.split(' at ')[1].split('in ')[0], int(e.split(' at ')[1].split(' in ')[1].split(' ms ')[0]),set(e.split(' at ')[1].split(' in ')[1].split(' ms ')[1].split())) for ein ss]pushed = [[w[0], w[1], w[2], 1 if 'PA-SOC_POP' in w[3] else0, 1 if 'CU-SOC_POP' in w[3] else 0] for w in ww]
             return not_pushed, pushed
     except:
         return []

The code isn't entirely 1:1. Any usage of IO (includes stdout viawriteln) is expensive. Your python code doesn't write anything to stdout(or perform any calls). It would also be good to get the results of dmd-release as well.

Re: Can I speed up this log parsing script further?

Reply via email to