On Fri, 23 Jan 2009, Dennis Hardy wrote:

Here is what I have been using (from previous help from this mail list!):

   uri SSS_URI30 /\bhttp:\/\/[^\.\/]+\.(?i:com|net|info|biz)\/\w{30}\b/
   uri SSS_URI30 1.5

this uri rule does work very well. but they change the length sometimes, so I have a few rules that handle different lengths. Maybe I should use 29,31 instead of just 30 for example?

Am I being too conservative? Should I consider bumping the score of this up more? And my meta up more perhaps?

Again, I'd have to see more examples to comment meaningfully. I would be especially interested in whether or not the part after the domain name is indeed free from punctuation.

A long string of unpunctuated letters is less likely to FP than a long string of letters, numbers and underscores.

You might want to anchor your rule with a $ as it may FP if there is stuff in the URI following the string of gibberish. Try it against this very legitimate looking (if overly verbose) URI:

  http://fnord.com/retrieve_document_as_pdf3_file.php?123456

And the rule I suggested makes an attempt to detect gibberish by looking for a "q" that is not followed by a "u", which is rare in English words.

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhar...@impsec.org    FALaholic #11174     pgpk -a jhar...@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Vista: because the audio experience is *far* more important than
  network throughput.
-----------------------------------------------------------------------
 4 days until Wolfgang Amadeus Mozart's 253rd Birthday

Reply via email to