Re: Searching through two logfiles in parallel?

2013-01-08 Thread darnold
i don't think in iterators (yet), so this is a bit wordy.
same basic idea, though: for each message (set of parameters), build a
list of transactions consisting of matching send/receive times.

mildly tested:


from datetime import datetime, timedelta

sendData = '''\
05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
05:00:10 Message sent - Value A: 3.0, Value B: 0.4, Value C: 5.4
#orphan
05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4
07:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4
'''

receiveData = '''\
05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C:
9.9
05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C:
2.3
05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C:
5.4
07:00:18 Message received - Value A: 1.0, Value B: 0.4, Value C:
5.4
07:00:30 Message received - Value A: 1.0, Value B: 0.4, Value C:
5.4   #orphan
07:00:30 Message received - Value A: 17.0, Value B: 0.4, Value C:
5.4  #orphan
'''

def parse(line):
timestamp, rest = line.split(' Message ')
action, params = rest.split(' - ' )
params = params.split('#')[0]
return timestamp.strip(), params.strip()

def isMatch(sendTime,receiveTime,maxDelta):
if sendTime is None:
return False

sendDT = datetime.strptime(sendTime,'%H:%M:%S')
receiveDT = datetime.strptime(receiveTime,'%H:%M:%S')
return receiveDT - sendDT = maxDelta

results = {}

for line in sendData.split('\n'):
if not line.strip():
continue

timestamp, params = parse(line)
if params not in results:
results[params] = [{'sendTime': timestamp, 'receiveTime':
None}]
else:
results[params].append({'sendTime': timestamp, 'receiveTime':
None})

for line in receiveData.split('\n'):
if not line.strip():
continue

timestamp, params = parse(line)
if params not in results:
results[params] = [{'sendTime': None, 'receiveTime':
timestamp}]
else:
for tranNum, transaction in enumerate(results[params]):
if
isMatch(transaction['sendTime'],timestamp,timedelta(seconds=5)):
results[params][tranNum]['receiveTime'] = timestamp
break
else:
results[params].append({'sendTime': None, 'receiveTime':
timestamp})

for params in sorted(results):
print params
for transaction in results[params]:
print '\t%s' % transaction


  RESTART 

Value A: 1.0, Value B: 0.4, Value C: 5.4
{'sendTime': '05:00:14', 'receiveTime': '05:00:15'}
{'sendTime': '07:00:14', 'receiveTime': '07:00:18'}
{'sendTime': None, 'receiveTime': '07:00:30'}
Value A: 17.0, Value B: 0.4, Value C: 5.4
{'sendTime': None, 'receiveTime': '07:00:30'}
Value A: 3.0, Value B: 0.4, Value C: 5.4
{'sendTime': '05:00:10', 'receiveTime': None}
Value A: 3.3, Value B: 4.3, Value C: 2.3
{'sendTime': '05:00:08', 'receiveTime': '05:00:12'}
Value A: 5.6, Value B: 6.2, Value C: 9.9
{'sendTime': '05:00:06', 'receiveTime': '05:00:09'}


HTH,
Don
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Searching through two logfiles in parallel?

2013-01-08 Thread Oscar Benjamin
On 8 January 2013 19:16, darnold darnold992...@yahoo.com wrote:
 i don't think in iterators (yet), so this is a bit wordy.
 same basic idea, though: for each message (set of parameters), build a
 list of transactions consisting of matching send/receive times.

The advantage of an iterator based solution is that we can avoid
loading all of both log files into memory.

[SNIP]

 results = {}

 for line in sendData.split('\n'):
 if not line.strip():
 continue

 timestamp, params = parse(line)
 if params not in results:
 results[params] = [{'sendTime': timestamp, 'receiveTime':
 None}]
 else:
 results[params].append({'sendTime': timestamp, 'receiveTime':
 None})
[SNIP]

This kind of logic is made a little easier (and more efficient) if you
use a collections.defaultdict instead of a dict since it saves needing
to check if the key is in the dict yet. Example:

 import collections
 results = collections.defaultdict(list)
 results
defaultdict(type 'list', {})
 results['asd'].append(1)
 results
defaultdict(type 'list', {'asd': [1]})
 results['asd'].append(2)
 results
defaultdict(type 'list', {'asd': [1, 2]})
 results['qwe'].append(3)
 results
defaultdict(type 'list', {'qwe': [3], 'asd': [1, 2]})


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Searching through two logfiles in parallel?

2013-01-07 Thread Victor Hooi
Hi,

I'm trying to compare two logfiles in Python.

One logfile will have lines recording the message being sent:

05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

the other logfile has line recording the message being received

05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

The goal is to compare the time stamp between the two - we can safely assume 
the timestamp on the message being received is later than the timestamp on 
transmission.

If it was a direct line-by-line, I could probably use itertools.izip(), right?

However, it's not a direct line-by-line comparison of the two files - the lines 
I'm looking for are interspersed among other loglines, and the time difference 
between sending/receiving is quite variable.

So the idea is to iterate through the sending logfile - then iterate through 
the receiving logfile from that timestamp forwards, looking for the matching 
pair. Obviously I want to minimise the amount of back-forth through the file.

Also, there is a chance that certain messages could get lost - so I assume 
there's a threshold after which I want to give up searching for the matching 
received message, and then just try to resync to the next sent message.

Is there a Pythonic way, or some kind of idiom that I can use to approach this 
problem?

Cheers,
Victor
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Searching through two logfiles in parallel?

2013-01-07 Thread Oscar Benjamin
On 7 January 2013 22:10, Victor Hooi victorh...@gmail.com wrote:
 Hi,

 I'm trying to compare two logfiles in Python.

 One logfile will have lines recording the message being sent:

 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

 the other logfile has line recording the message being received

 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

 The goal is to compare the time stamp between the two - we can safely assume 
 the timestamp on the message being received is later than the timestamp on 
 transmission.

 If it was a direct line-by-line, I could probably use itertools.izip(), right?

 However, it's not a direct line-by-line comparison of the two files - the 
 lines I'm looking for are interspersed among other loglines, and the time 
 difference between sending/receiving is quite variable.

 So the idea is to iterate through the sending logfile - then iterate through 
 the receiving logfile from that timestamp forwards, looking for the matching 
 pair. Obviously I want to minimise the amount of back-forth through the file.

 Also, there is a chance that certain messages could get lost - so I assume 
 there's a threshold after which I want to give up searching for the matching 
 received message, and then just try to resync to the next sent message.

 Is there a Pythonic way, or some kind of idiom that I can use to approach 
 this problem?

Assuming that you can impose a maximum time between the send and
recieve timestamps, something like the following might work
(untested):

def find_matching(logfile1, logfile2, maxdelta):
buf = {}
logfile2 = iter(logfile2)
for msg1 in logfile1:
if msg1.key in buf:
yield msg1, buf.pop(msg1.key)
continue
maxtime = msg1.time + maxdelta
for msg2 in logfile2:
if msg2.key == msg1.key:
yield msg1, msg2
break
buf[msg2.key] = msg2
if msg2.time  maxtime:
break
else:
yield msg1, 'No match'


Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Searching through two logfiles in parallel?

2013-01-07 Thread Victor Hooi
Hi Oscar,

Thanks for the quick reply =).

I'm trying to understand your code properly, and it seems like for each line in 
logfile1, we loop through all of logfile2?

The idea was that it would remember it's position in logfile2 as well - since 
we can assume that the loglines are in chronological order - we only need to 
search forwards in logfile2 each time, not from the beginning each time.

So for example - logfile1:

05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9 
05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4

logfile2:

05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9 
05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C: 2.3
05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C: 5.4

The idea is that I'd iterate through logfile 1 - I'd get the 05:00:06 logline - 
I'd search through logfile2 and find the 05:00:09 logline.

Then, back in logline1 I'd find the next logline at 05:00:08. Then in logfile2, 
instead of searching back from the beginning, I'd start from the next line, 
which happens to be 5:00:12.

In reality, I'd need to handle missing messages in logfile2, but that's the 
general idea.

Does that make sense? (There's also a chance I've misunderstood your buf code, 
and it does do this - in that case, I apologies - is there any chance you could 
explain it please?)

Cheers,
Victor

On Tuesday, 8 January 2013 09:58:36 UTC+11, Oscar Benjamin  wrote:
 On 7 January 2013 22:10, Victor Hooi victorh...@gmail.com wrote:
 
  Hi,
 
 
 
  I'm trying to compare two logfiles in Python.
 
 
 
  One logfile will have lines recording the message being sent:
 
 
 
  05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
 
 
 
  the other logfile has line recording the message being received
 
 
 
  05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
 
 
 
  The goal is to compare the time stamp between the two - we can safely 
  assume the timestamp on the message being received is later than the 
  timestamp on transmission.
 
 
 
  If it was a direct line-by-line, I could probably use itertools.izip(), 
  right?
 
 
 
  However, it's not a direct line-by-line comparison of the two files - the 
  lines I'm looking for are interspersed among other loglines, and the time 
  difference between sending/receiving is quite variable.
 
 
 
  So the idea is to iterate through the sending logfile - then iterate 
  through the receiving logfile from that timestamp forwards, looking for the 
  matching pair. Obviously I want to minimise the amount of back-forth 
  through the file.
 
 
 
  Also, there is a chance that certain messages could get lost - so I assume 
  there's a threshold after which I want to give up searching for the 
  matching received message, and then just try to resync to the next sent 
  message.
 
 
 
  Is there a Pythonic way, or some kind of idiom that I can use to approach 
  this problem?
 
 
 
 Assuming that you can impose a maximum time between the send and
 
 recieve timestamps, something like the following might work
 
 (untested):
 
 
 
 def find_matching(logfile1, logfile2, maxdelta):
 
 buf = {}
 
 logfile2 = iter(logfile2)
 
 for msg1 in logfile1:
 
 if msg1.key in buf:
 
 yield msg1, buf.pop(msg1.key)
 
 continue
 
 maxtime = msg1.time + maxdelta
 
 for msg2 in logfile2:
 
 if msg2.key == msg1.key:
 
 yield msg1, msg2
 
 break
 
 buf[msg2.key] = msg2
 
 if msg2.time  maxtime:
 
 break
 
 else:
 
 yield msg1, 'No match'
 
 
 
 
 
 Oscar
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Searching through two logfiles in parallel?

2013-01-07 Thread Oscar Benjamin
On 7 January 2013 23:41, Victor Hooi victorh...@gmail.com wrote:
 Hi Oscar,

 Thanks for the quick reply =).

 I'm trying to understand your code properly, and it seems like for each line 
 in logfile1, we loop through all of logfile2?

No we don't. It iterates once through both files but keeps a buffer of
lines that are within maxdelta time of the current message.

The important line is the line that calls iter(logfile2). Since
logfile2 is replaced by an iterator when we break out of the inner for
loop and then re-enter our place in the iterator is saved. If you can
follow the interactive session below it should make sense:

 a = [1,2,3,4,5]
 for x in a:
...print x,
...
1 2 3 4 5
 for x in a:
...print x,
...
1 2 3 4 5
 it = iter(a)
 next(it)
1
 for x in it:
... print x,
...
2 3 4 5
 next(it)
Traceback (most recent call last):
  File stdin, line 1, in module
StopIteration
 for x in it:
... print x,
...
 it = iter(a)
 for x in it:
... print x,
... if x == 2: break
...
1 2
 for x in it:
... print x,
...
3 4 5


I'll repeat the code (with a slight fix):


def find_matching(logfile1, logfile2, maxdelta):
buf = {}
logfile2 = iter(logfile2)
for msg1 in logfile1:
if msg1.key in buf:
yield msg1, buf.pop(msg1.key)
continue
maxtime = msg1.time + maxdelta
for msg2 in logfile2:
if msg2.key == msg1.key:
yield msg1, msg2
break
buf[msg2.key] = msg2
if msg2.time  maxtime:
   yield msg1, 'No match'
   break
else:
yield msg1, 'No match'


Oscar



 The idea was that it would remember it's position in logfile2 as well - since 
 we can assume that the loglines are in chronological order - we only need to 
 search forwards in logfile2 each time, not from the beginning each time.

 So for example - logfile1:

 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
 05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
 05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4

 logfile2:

 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
 05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C: 2.3
 05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C: 5.4

 The idea is that I'd iterate through logfile 1 - I'd get the 05:00:06 logline 
 - I'd search through logfile2 and find the 05:00:09 logline.

 Then, back in logline1 I'd find the next logline at 05:00:08. Then in 
 logfile2, instead of searching back from the beginning, I'd start from the 
 next line, which happens to be 5:00:12.

 In reality, I'd need to handle missing messages in logfile2, but that's the 
 general idea.

 Does that make sense? (There's also a chance I've misunderstood your buf 
 code, and it does do this - in that case, I apologies - is there any chance 
 you could explain it please?)

 Cheers,
 Victor

 On Tuesday, 8 January 2013 09:58:36 UTC+11, Oscar Benjamin  wrote:
 On 7 January 2013 22:10, Victor Hooi victorh...@gmail.com wrote:

  Hi,

 

  I'm trying to compare two logfiles in Python.

 

  One logfile will have lines recording the message being sent:

 

  05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

 

  the other logfile has line recording the message being received

 

  05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

 

  The goal is to compare the time stamp between the two - we can safely 
  assume the timestamp on the message being received is later than the 
  timestamp on transmission.

 

  If it was a direct line-by-line, I could probably use itertools.izip(), 
  right?

 

  However, it's not a direct line-by-line comparison of the two files - the 
  lines I'm looking for are interspersed among other loglines, and the time 
  difference between sending/receiving is quite variable.

 

  So the idea is to iterate through the sending logfile - then iterate 
  through the receiving logfile from that timestamp forwards, looking for 
  the matching pair. Obviously I want to minimise the amount of back-forth 
  through the file.

 

  Also, there is a chance that certain messages could get lost - so I assume 
  there's a threshold after which I want to give up searching for the 
  matching received message, and then just try to resync to the next sent 
  message.

 

  Is there a Pythonic way, or some kind of idiom that I can use to approach 
  this problem?



 Assuming that you can impose a maximum time between the send and

 recieve timestamps, something like the following might work

 (untested):



 def find_matching(logfile1, logfile2, maxdelta):

 buf = {}

 logfile2 = iter(logfile2)

 for msg1 in logfile1:

 if msg1.key in buf:

 yield msg1, buf.pop(msg1.key)

 continue

 maxtime = msg1.time + maxdelta

 for msg2 in logfile2: