Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread n00m
My Py solution:

=

import psyco
psyco.full()

def foo(s):
n = len(s)
s = s + ' '
a = [[] for i in xrange(128)]
ans = 0
for i in xrange(n - 1, -1, -1):
lev = 0
for st in xrange(len(a[ord(s[i])]) - 1, -1, -1):
j = a[ord(s[i])][st]
if (n - j = lev):
break
if (s[j + lev] != s[i + lev]):
continue
if (s[j + lev / 2] != s[i + lev / 2]):
continue
k = 0
while (s[j + k] == s[i + k]):
k += 1
if (k  lev):
lev = k
a[ord(s[i])] += [i]
ans += n - i - lev
return ans


from time import time
t = time()

import sys
sys.stdin = open('D:/88.txt', 'rt')
f = sys.stdin.read().split()
sys.stdin.close()

z = open('D:/99.txt', 'wt')

for tc in range(int(f[0])):
s = f[tc + 1]
print  z, foo(s)

print  z, time() - t
z.close()


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread n00m
This worked out in 5.28s
Imo it's not that *much* slower
(of course, Psyco can't help here)
===

import itertools
def subs(s):
return len(set(itertools.chain(
s[i:j]
for i in xrange(len(s))
for j in xrange(i, len(s)+1 - 1


from time import time
t = time()

import sys
sys.stdin = open('D:/88.txt', 'rt')
f = sys.stdin.read().split()
sys.stdin.close()

z = open('D:/99.txt', 'wt')

for tc in range(int(f[0])):
s = f[tc + 1]
print  z, subs(s)

print  z, time() - t
z.close()

==
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread Bearophile
n00m:
 This worked out in 5.28s
 Imo it's not that *much* slower
 (of course, Psyco can't help here)

There's no need of a chain here, so you can rewrite this:

import itertools
def subs(s):
return len(set(itertools.chain(
s[i:j]
for i in xrange(len(s))
for j in xrange(i, len(s)+1 - 1

as:

def subs(s):
return len(set(s[i:j] for i in xrange(len(s))
  for j in xrange(i, len(s)+1))) - 1


Psyco helps a little (about 10% on my PC) if you use it like this:

from time import time
import psyco; psyco.full()

def main():
t = time()
fin = open(input.txt)
fout = open(output.txt, w)
fin.next()
for line in fin:
r = set()
s = line.rstrip()
len_s = len(s)
for i in xrange(len_s):
for j in xrange(i, len_s + 1):
r.add(s[i : j])
print  fout, len(r) - 1
fout.close()
print time() - t

main()

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread Bearophile
n00m:
 My Py solution:
 ...

Cute. You can replace:
a[ord(s[i])] += [i]
With:
a[ord(s[i])].append(i)

If you want to micro-optimize this with Psyco may be a little faster:
(lev  1)
Than:
lev // 2

But to increase speed it's usually better to reduce memory
allocations. So you can try to find a way to pull this out of the
function:
a = [[] for i in xrange(128)]
And to avoid a true append:
a[ord(s[i])] += [i]

(There are many ways to avoid an append. One of them is to have an
already allocated list/array and just bump forward an index variable
that starts from 0. This is worth doing especially when you use
Psyco).

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread Bearophile
n00m:
 my home tests proved Python is a fellow fast beast of C++.
 Quite unexpected (I expected Py would be by ~10 times slower).
 PS
 Both my codes are identical in their algorithms.

 =
 0.016         0.0150001049042   --- exec times

Maybe in your C++ code there's something that can be improved, this is
a 1:1 translation to D (V.1) language (using dlibs) and it's about 2.2
times faster than the Psyco version:
http://codepad.org/MQLj0ydB
Using a smarter usage of memory (that is avoiding all or most memory
allocations inside all loops), the performance difference will surely
grow.

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread n00m
On Nov 29, 3:15 pm, Bearophile bearophileh...@lycos.com wrote:
 Maybe in your C++ code there's something that can be improved, this is
 a 1:1 translation to D (V.1) language (using dlibs) and it's about 2.2
 times faster than the Psyco version:http://codepad.org/MQLj0ydB
 Using a smarter usage of memory (that is avoiding all or most memory
 allocations inside all loops), the performance difference will surely
 grow.

Very interesting. Thanks.
D code looks pretty neat. Btw D lang is among accepted langs there.
Even if Py by 4x *slower* -- it's still a perfect Ok for it (C# will
be
much (much) slower than Python).

Micro-optimizations.
Of course, the best optimization would be to implement Suffix Tree:
http://en.wikipedia.org/wiki/Trie
Currently I hardly understand/know/read/etc its core idea. My algo is
plainly stupid as soldier muddy boots.

My C++ code:

BEGIN
#include stdlib.h
#include stdio.h
//#include string.h
//#include ctype.h
//#include math.h
#include time.h
#include iostream
#include vector
#include string
using namespace std;

int main() {
clock_t start_time = clock();
freopen(88.txt, rt, stdin);
freopen(99.txt, wt, stdout);

int tcs;
string s;
cin  tcs;
while (tcs--  0) {
cin  s;
int n = s.size();
s = s + ' ';
vector vectorint  a(128);
int ans = 0;
for (int i = n - 1; i = 0; --i) {
int lev = 0;
for (int st = (int)a[s[i]].size() - 1; st = 0; --st) {
int j = a[s[i]][st];
if (n - j = lev) break;
if (s[j + lev] != s[i + lev]) continue;
if (s[j + lev / 2] != s[i + lev / 2]) continue;
int k = 0;
while (s[j + k] == s[i + k]) ++k;
if (k  lev) lev = k;
}
a[s[i]].push_back(i);
ans += n - i - lev;
}
cout  ans  endl;
}

cout  (clock() - start_time) / CLOCKS_PER_SEC  endl;

return 0;
}
END







-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread n00m
http://en.wikipedia.org/wiki/Suffix_tree

Looks not very friendly appealing :-)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread inhahe
On Sun, Nov 29, 2009 at 9:10 AM, n00m n...@narod.ru wrote:


 Even if Py by 4x *slower* -- it's still a perfect Ok for it (C# will
 be
 much (much) slower than Python).


How do you figure? As far as I know C# is many, many times faster than
Python.  (i was disappointed to find out that even IronPython is dozens of
times slower than other .net langs)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread n00m
Tested both my codes against
a random string of length = 1.

===

from random import choice
s = ''
for i in xrange(1):
s += choice(('a','b','c','d','e','f'))

===

C++: ~0.28s
Python: ~0.48s

PS
I suspect that building of Suffix Tree would
be a big exec.time-consuming overhead
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread Bearophile
n00m:

 I suspect that building of Suffix Tree would
 be a big exec.time-consuming overhead

In C/D/C++ there are ways to allocate memory in smarter ways, using
pools, arenas, stacks, freelists, etc. With that, using a compact
encoding of tree nodes (there are many different tricks to reduce
space used by a suffix tree), you can go fast enough. Probably a basic
implementation too will suffice in many cases :-)

Anyway, you may try a pure-Python2.x implementation:
http://suffixtree.googlecode.com/files/suffixtree-0.1.py
If you find time  will to try to use that (or similar code) to
improve your Python solution to the problem 99, you can show us the
code here...

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Number of distinct substrings of a string [continuation]

2009-11-29 Thread n00m
On Nov 29, 11:43 pm, Bearophile bearophileh...@lycos.com wrote:
 Anyway, you may try a pure-Python2.x 
 implementation:http://suffixtree.googlecode.com/files/suffixtree-0.1.py

Ouch, Bearie... 214 lines of code are there.
Ok.. I tested it.
Firstly I replaced all print  with pass##print 
(aiming to avoid intensive printing from inside of the code).

Then I left in its final part only building of suffix trees.

==
...
...
...

from time import time
t = time()

import sys
sys.stdin = open('D:/88.txt', 'rt')
f = sys.stdin.read().split()
sys.stdin.close()

z = open('D:/99.txt', 'wt')

for tc in range(int(f[0])):
s = f[tc + 1]
test_str = s + '$'
POSITIVE_INFINITY = len(test_str) - 1
suffix_tree = SuffixTree(test_str)
print  z, 'len(s) =', len(s)

print  z, 'time =', time() - t
z.close()


Output:

len(s) = 1000
len(s) = 1000
len(s) = 1
time = 0.64132425


0.64s  0.48s (of my algo)
I.e. the code can't help in my very special and narrow case.

But of course it is worthy to be studied and memoized.
E.g.:
from huge string s we built its Suffix Tree.
Now we are given arbitrary string ss.
Task: to find the largest its prefix occured in s,
traversing its Suffix Tree.




-- 
http://mail.python.org/mailman/listinfo/python-list