Re: Url regex keeps django busy/crashing

2012-08-02 Thread Melvyn Sopacua
On 3-8-2012 2:34, Melvyn Sopacua wrote:

Correction:
> url(r'^(?P\w[\w-]+-/$', 'detail')
insert question mark here -->  ?

-- 
Melvyn Sopacua

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: Url regex keeps django busy/crashing

2012-08-02 Thread Melvyn Sopacua
On 26-7-2012 16:45, Joe wrote:
> Hey, I have a url regex like this which is keeping django extremely busy 
> (20secs to 1min to handle a request). On some urls it even crashes.
> 
> my regex:
> 
> url(r'^(?P(\w+-?)*)/$', 'detail'),

Turn the * into a + and you'll see great improvements and I also think
you don't want to match '//' as a valid URL part.
Also, I think this example will satisfy your requirements in practice:

url(r'^(?P\w[\w-]+-/$', 'detail')

The only difference is that dashes are allowed to follow each other. I
can only think of one valid reason to not use the above URL and that is
if "multiple dashes" are captured in another URL.
Remember that URL patterns are not your validators. It's nice if you can
prevent a view from being called by carefully constructing your URL
patterns, but if parsing the regex takes longer then calling the view
you loose performance instead.
Also, validating if a URL contains two or more consecutive dashes is
easily done in a view and does not even need regular expressions:
def detail(request, item_url) :
if '--' in item_url :
raise 404

Even more improvements if you keep the urls lower case (or uppercase,
but not mixed case) and use [a-z0-9_] instead of \w.
-- 
Melvyn Sopacua

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: Url regex keeps django busy/crashing

2012-07-28 Thread Tim Chase
On 07/26/12 09:45, Joe wrote:
> url(r'^(?P(\w+-?)*)/$', 'detail'),
> 
> replaced with:
> 
> url(r'^(?P[\w-]+)/$', 'detail'),

Russell gave you good background on the why (including that Django
was stung by the same issue).  It would help if you more clearly
defined what you wanted to target.  Your first one can match things like

  x-x-x-x-

with trailing dashes, and your second one can match things like

  --  # pure dashes
  ---xxx  # leading dashes
  --xx--  # leading and trailing dashes

I suspect you want an expression something like

  (?P\w+(?:-\w+)*)/$

perhaps having a "?" after the terminal slash to make it optional.
This expression is roughly "one or more \words separated by one
dash."  You might change "\w" to "[a-zA-Z]" to ensure you can't
match odd things like

  _-_-_-_

("\w" includes underscores).

-tkc



-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Re: Url regex keeps django busy/crashing

2012-07-27 Thread Russell Keith-Magee
On Thu, Jul 26, 2012 at 10:45 PM, Joe  wrote:
> Hey, I have a url regex like this which is keeping django extremely busy
> (20secs to 1min to handle a request). On some urls it even crashes.
>
> my regex:
>
> url(r'^(?P(\w+-?)*)/$', 'detail'),
>
>
> view:
>
> def detail(request, item_url):
>i = get_object_or_404(Page, url=item_url,published=True)
>return render_to_response('item/detail.html', {'item':i},
>context_instance=RequestContext(request))
>
> replaced with:
>
> url(r'^(?P[\w-]+)/$', 'detail'),
>
>
> The replacement works like a charm. What is wrong with the first regex?

Hi Joe,

There's nothing strictly *wrong* with the first regex -- it's just
describes a very complex lookup strategy, and as a result, it takes
extra time to compute it.

In the second regex, you're asking for "a string of 1 or more
characters that are either word-like or '-'". That's a very easy thing
to check - if you think of how you would manually implement code that
check that policy, it could be done with a simple if inside a while
loop; as soon as you find a character that doesn't match, you can bail
out.

However, the first regex is asking for "0 or more groups of word like
characters, each of which might be followed by a '-'". Consider a
trivial case, matching against the string abcde. It can match the
first regex in an incredible number of ways:

(a)(b)(c)(d)(e)
(ab)(c)(d)(e)
(abc)(d)(e)
(abcd)(e)
(abcde)
(a)(bc)(d)(e)
(a)(bcd)(e)
(a)(bcde)
(a)(b)(cde)
…
and so on. Because you're asking the regex to preserve groups, the
algorithm needs to essentially work out every single one of these
groups, and then determine which set will be reported as the actual
match. As you can guess, this can take some time, which you're
observing as a 1 minute delay in serving a URL.

This is one of the gotchas that comes from using regular expressions.
They're a very powerful language for expressing constraints, but you
need to be careful that you don't accidentally fall into a trap where
you're asking for something very complex.

And don't worry - you're in good company being bitten by this problem.
There was a Django security release caused *specifically* by a regular
expression like yours. Django uses regular expressions to validate
URLs and email form inputs, and at one point, the regex that was used
to validate email addresses was constructed in such a way that it was
possible to provide a very simple string that would cause the
validator to take 30 seconds to confirm that it wasn't valid. Write a
tool that hits the same URL and validates the same string 100 times,
and you've got yourself a DDOS attack.

So - when you're building your URL patterns, you should be trying to
keep your regular expressions as simple as possible -- i.e., simple
linear probes. If you really do need to match a complex pattern, you'd
be better served using a simple regex in the URL pattern, and then
doing more specific validation in the view (and raising 404 if the
pattern doesn't match what you need it to).

Yours,
Russ Magee %-)

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.



Url regex keeps django busy/crashing

2012-07-26 Thread Joe
Hey, I have a url regex like this which is keeping django extremely busy 
(20secs to 1min to handle a request). On some urls it even crashes.

my regex:

url(r'^(?P(\w+-?)*)/$', 'detail'),


view:

def detail(request, item_url):
   i = get_object_or_404(Page, url=item_url,published=True)
   return render_to_response('item/detail.html', {'item':i},
   context_instance=RequestContext(request))

replaced with:

url(r'^(?P[\w-]+)/$', 'detail'),


The replacement works like a charm. What is wrong with the first regex?

Thanks in advance.

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/django-users/-/lVIrewdZipMJ.
To post to this group, send email to django-users@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.