28tommy wrote: > Hi, > I'm trying to find scripts in html source of a page retrieved from the > web. > I'm trying to use the following rule: > > match = re.compile('<script [re.DOTALL]+ src=[re.DOTALL]+>') > > I'm testing it on a page that includes the following source: > > <script language="JavaScript1.2" > src="http://i.cnn.net/cnn/.element/ssi/js/1.3/mainVideoMod.js" > type="text/javascript"></script> > > But I get - 'None' as my result. > Here's (in words) what I'm trying to do: '<script ' followed by any > type and a number of charecters, and then followed by ' src=' followed > by any type and a number of charecters, and then finished by '>' > > What am I doing wrong?
Several things. First, re.DOTALL is a flag, a _parameter_ to be passed to the compile function, not sumething you stick inside the RE itself: re.compile('<script .+ src=.+>',re.DOTALL) Second, this won't match your example above, because src appears immediately after script. So you probably want something like re.compile('<script .*src=.+>',re.DOTALL) Third, IIRC * and + are _greedy_ by default, this means they will "eat up" as many characters as possible. Try and see what I mean. The solution is to use the non-greedy variant of *, that is *? re.compile('<script .*?src=.+?>',re.DOTALL) All this and more at http://docs.python.org/lib/module-re.html and, I'm sure, several online tutorials. To RTFM is never a bad idea. -- http://mail.python.org/mailman/listinfo/python-list