how to find difference in number of characters

2010-10-09 Thread harryos
hi
I am trying to write a compare method which takes two strings and find
how many characters have changed.


def compare_strings(s1,s2):
pass


text1=goat milk
text2=cow milk
print compare_strings(text1,text2)

This must give 3 ,since 3 characters are changed between strings.I was
advised to use levenshtein algorithm ..but then the matrix ops take a
long time for nontrivial strings of say 2 characters ..Can this
comparison be implemented using difflib module?..I am at a loss as to
how to implement this using difflib  .Is there some way I can get the
difference as a number ?
Can somebody help?
thanks
harry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to find difference in number of characters

2010-10-09 Thread Peter Otten
harryos wrote:

 I am trying to write a compare method which takes two strings and find
 how many characters have changed.
 
 
 def compare_strings(s1,s2):
 pass
 
 
 text1=goat milk
 text2=cow milk
 print compare_strings(text1,text2)
 
 This must give 3 ,since 3 characters are changed between strings.I was
 advised to use levenshtein algorithm ..but then the matrix ops take a
 long time for nontrivial strings of say 2 characters ..Can this

I tried it with a string of 2 chars and the python-levenshtein package 
that comes with ubuntu. It took about one second to calculate the distance:

import functools
import random
import string
import time

from Levenshtein import distance

def make_strings(n, delete, insert, swap, replace, 
charset=string.ascii_letters):
def gen_chars():
while True:
yield random.choice(charset)
chars = gen_chars()
a = [next(chars) for i in xrange(n)]
s = .join(a)
for i in range(delete):
del a[random.randrange(len(a))]
for i in range(insert):
a.insert(random.randrange(len(a)+1), next(chars))
for i in range(swap):
p = random.randrange(len(a)-1)
a[p], a[p+1] = a[p+1], a[p]
for i in range(replace):
a[random.randrange(len(a))] = next(chars)
t = .join(a)
return s, t

N = 2
M = 100
ms = functools.partial(make_strings, N, M, M, M, M)

def measure(f, *args):
start = time.time()
try:
return f(*args)
finally:
print time.time() - start

if __name__ == __main__:
import sys
args = sys.argv[1:]
if args:
N, M = map(int, args)

s, t = make_strings(N, M, M, M, M)
print measure(distance, s, t)

$ python levenshtein_demo.py 1 1000
0.225363969803
3644
$ python levenshtein_demo.py 2 1000
1.05217313766
4197
$ python levenshtein_demo.py 3 1000
2.38736391068
4390
$ python levenshtein_demo.py 4 1000
4.1686527729
4558

What would be an acceptable time?

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to find difference in number of characters

2010-10-09 Thread harryos
On Oct 9, 2:45 pm, Peter Otten __pete...@web.de wrote:


 What would be an acceptable time?


Thanks for the reply Peter,
I was using python functions I came across the net..not cpython
implementations..Probably my low config machine is also to blame..(I
am no expert at judging algorithm performance either),but is there a
way I can use difflib module to do this job?Even though I went through
the docs I couldn't make out how..

regards
harry

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to find difference in number of characters

2010-10-09 Thread Peter Otten
harryos wrote:

 but is there a way I can use difflib module to do this job?

I'm afraid I can't help you with that. 

You might get more/better answers if you tell us more about the context of 
the problem and add some details that may be relevant.

Peter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to find difference in number of characters

2010-10-09 Thread harryos
On Oct 9, 4:52 pm, Peter Otten __pete...@web.de wrote:


 You might get more/better answers if you tell us more about the context of
 the problem and add some details that may be relevant.

 Peter

I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text..I was wondering if difflib could offer something
in the way of determining the size of delta..
Thanks again for the reply..
harry
-- 
http://mail.python.org/mailman/listinfo/python-list


Testing for changes on a web page (was: how to find difference in number of characters)

2010-10-09 Thread Stefan Behnel

harryos, 09.10.2010 14:24:

I am trying to determine if a wep page is updated by x number of
characters..Mozilla firefox plugin 'update scanner' has a similar
functionality ..A user can specify the x ..I think this would be done
by reading from the same url at two different times and finding the
change in body text.


Number of characters sounds like a rather useless measure here. I'd 
rather apply an XPath, CSS selector or PyQuery expression to the parsed 
page and check if the interesting subtree of it has changed at all or not, 
potentially disregarding any structural changes by stripping all tags and 
normalising the resulting text to ignore whitespace and case differences.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Testing for changes on a web page (was: how to find difference in number of characters)

2010-10-09 Thread harryos
On Oct 9, 5:41 pm, Stefan Behnel stefan...@behnel.de wrote:

 Number of characters sounds like a rather useless measure here.

What I meant by number of characters was the number of edits happened
between the two versions..Levenshtein distance may be one way for
this..but I was wondering if difflib could do this
regards

harry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Testing for changes on a web page (was: how to find difference in number of characters)

2010-10-09 Thread Emmanuel Surleau
 On Oct 9, 5:41 pm, Stefan Behnel stefan...@behnel.de wrote:
  Number of characters sounds like a rather useless measure here.
 
 What I meant by number of characters was the number of edits happened
 between the two versions..Levenshtein distance may be one way for
 this..but I was wondering if difflib could do this
 regards

As pointed out above, you also need to consider how the structure of the web 
page has changed. If you are only looking at plain text, the Levenshtein 
distance measures the number of edit operations (insertion, deletion or 
substition) necessary to transform string A into string B.

Cheers,

Emm
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Testing for changes on a web page (was: how to find difference in number of characters)

2010-10-09 Thread Seebs
On 2010-10-09, harryos oswald.ha...@gmail.com wrote:
 What I meant by number of characters was the number of edits happened
 between the two versions..

Consider two strings:

Hello, world!

Yo, there.

What is the number of edits happened between the two versions?  It could
be:

* Zero.  I just typed them both from scratch, no editing occurred between
  them.
* Two.  Two words are different.
* Ten or so -- counting changed characters.
* Three.  Two words and a punctuation mark are different.

In other words, your problem here is that you haven't actually described
what you want.  Slow down.  Think!  Describe what you want clearly enough
that any other person who reads your description can always come up with
the same answer you would for a given set of inputs.

-s
-- 
Copyright 2010, all wrongs reversed.  Peter Seebach / usenet-nos...@seebs.net
http://www.seebs.net/log/ -- lawsuits, religion, and funny pictures
http://en.wikipedia.org/wiki/Fair_Game_(Scientology) -- get educated!
I am not speaking for my employer, although they do rent some of my opinions.
-- 
http://mail.python.org/mailman/listinfo/python-list


Testing for changes on a web page (was: how to find difference in number of characters)

2010-10-09 Thread geremy condra
On Sat, Oct 9, 2010 at 10:47 AM, Seebs usenet-nos...@seebs.net wrote:
 On 2010-10-09, harryos oswald.ha...@gmail.com wrote:
 What I meant by number of characters was the number of edits happened
 between the two versions..

 Consider two strings:

 Hello, world!

 Yo, there.

 What is the number of edits happened between the two versions?  It could
 be:

 * Zero.  I just typed them both from scratch, no editing occurred between
  them.
 * Two.  Two words are different.
 * Ten or so -- counting changed characters.
 * Three.  Two words and a punctuation mark are different.

 In other words, your problem here is that you haven't actually described
 what you want.  Slow down.  Think!  Describe what you want clearly enough
 that any other person who reads your description can always come up with
 the same answer you would for a given set of inputs.

 -s

He mentioned L distance earlier, I'm sure he means 'number of edits'
in that context...

Geremy Condra
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: how to find difference in number of characters

2010-10-09 Thread Diez B. Roggisch
harryos oswald.ha...@gmail.com writes:

 On Oct 9, 4:52 pm, Peter Otten __pete...@web.de wrote:


 You might get more/better answers if you tell us more about the context of
 the problem and add some details that may be relevant.

 Peter

 I am trying to determine if a wep page is updated by x number of
 characters..Mozilla firefox plugin 'update scanner' has a similar
 functionality ..A user can specify the x ..I think this would be done
 by reading from the same url at two different times and finding the
 change in body text..I was wondering if difflib could offer something
 in the way of determining the size of delta..

If you normalize the data, this might be worth trying.

Make all tags appear on one single line, possibly re-order attributes so
that they are in alphabetical order. Each text child git's also
normalized, by replacing all whitespace with a single space.

Then run difflib over these, and count the number of diffrences.


Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Testing for changes on a web page (was: how to find difference in number of characters)

2010-10-09 Thread D'Arcy J.M. Cain
On 09 Oct 2010 17:47:56 GMT
Seebs usenet-nos...@seebs.net wrote:
 In other words, your problem here is that you haven't actually described
 what you want.  Slow down.  Think!  Describe what you want clearly enough
 that any other person who reads your description can always come up with
 the same answer you would for a given set of inputs.

Better yet, write the unit tests for us.

-- 
D'Arcy J.M. Cain da...@druid.net |  Democracy is three wolves
http://www.druid.net/darcy/|  and a sheep voting on
+1 416 425 1212 (DoD#0082)(eNTP)   |  what's for dinner.
-- 
http://mail.python.org/mailman/listinfo/python-list