On Tue, 20 Jul 2004 17:28:25 -0400, Chris Santerre <[EMAIL PROTECTED]> writes:

> (Scott just letting you know in case you want to improve your already great
> script.)
> 
> I just posted another update and noticed one key thing, there were no ?: in
> the regex! As in 
> 
> (?:a|b|c|d)
> 
> OUCH! I added them in so that should definetley speed things up for people!
> I'm actually running BE again on my server after a small memory upgrade and
> this tweak. 
> 


> Now, for the heck of it I pulled out all the "\.com"'s from the file. It was
> 100k. What I'm looking for is a script to try to fix this. Check everyline
> starting with URI, if all that line includes is the .com TLD then change
> that rule so only one \.com show up at the end. Does anyone understand what
> I mean AND know how to do this? Or even more advanced, group the whole line
> of regex by ending TLD instead of alpha order. 

It could be done, but its hard to tune an optimizer to know whether to
right-factor or left-factor. Right now it always left-factors.

I can offer a compromise position to help with common
TLD's... Something like:

# cat moreDomains | ../prefixStringFactor | wc   # Origional decompisition 
version.

# (for i in com net biz org us info ; do cat moreDomains | grep "\.$i\$" | sed 
"-es/\\.$i//g" | ../prefixStringFactor | sed "-es/$/\\.${i}/"; done) 
# cat moreDomains | grep -v -E '.(com|net|biz|org|us|info)$' | 
../prefixStringFactor 

Shrinks the regexp size from 400kb to about 300kb for the bigevil
list. The idea is sed and grep out a few common suffixes, prefixfilter
them, then add the suffixes back on. Also have a second catchall to
get everything not with the particular suffixes.

As a bonus, this should improve performance. Perl uses an optimization
and does a literal search before it runs the regexp engine that can
check for both a known prefix and suffix.

Ex:

 (abcd|a123d|axyzd) is worse than a(bcd|123d|xyzd) is worse than a(bc|123|xyz)d

In the first case, it has to run th regexp engine at all points in the
input. In the second case, it runs at all points in the input starting
with 'a' and in the third, only at all points both starting at 'a' and
with a 'd' within 2-3 characters later.

> The code we are using to make the rule is like one step. I wish I could make
> it 2-4 steps. What do I mean? It could have written these more streamlined:
> 
> It should have SUB sections, but our parser is only one level deep :(
> 
> /\bhomel(?:oanace\.com|andunited\.com|anddefensejournal\.com|anddefenseradio
> \.com|andsecurityresearch\.com|ead\.net|essprelates\.com|essteens\.com)\b/i

> Could be written
> 
> /\bhomel(?:oanace\.com|and(?:united|defensejournal|defenseradio|securityrese
> arch)\.com|e(?:ad\.net|(?:ss(prelates|teens)))\.com)\b/i

I doubt this will help much. A subsection requires the perl regexp
engine to create a new node, and experience extra overhead dealing
with matching it. That means it should only be done if there are
enough regexp's with a common prefix that the gain by skipping them
en-mass is worth those overheads. I've guessed that tradeoff to be >5
or more. And for those, I already peel em off as seperate rule. (See
the cutpointSMALL paramater in the source.)

Thus, I don't think that recursive decomposition is much of a win.

> And I still can't thank Scott enough for giving me this script in
> the first place. Without which BigEvil would have died with the
> start of ws.surbl.org.

You're welcome. I didn't know it was even being used. When I posted
it, it was greeted with silence. Glad it helped.

Scott

> Scott's source:
> http://www.cs.rice.edu/~scrosby/datamining/src/prefixStringFactor/prefixStri
> ngFactor.ml

Reply via email to