Re: need help with re module
En Sat, 23 Jun 2007 01:12:17 -0300, samwyse [EMAIL PROTECTED] escribió: Speak for yourself. If I'm writing an HTML syntax checker, I think I'll skip BeautifulSoup and use something that gives me the results that I expect, not the results that you expect. Sure! By the way, I'm looking for a different sound. I have an Ibanez but I think the Jackson is far better for thrash metal, the Jackson Kelly Pro series KE3 sounds good, and it's a classic. Maybe as a self-gift for my birthday next month. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
Gabriel Genellina wrote: En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler [EMAIL PROTECTED] escribió: On 6/20/07, Gabriel Genellina [EMAIL PROTECTED] wrote: [snip] I agree that BeautifulSoup is probably the best tool for the job, but this doesn't sound right to me. Since the OP doesn't care about tags being properly nested, I don't see why a regex (albeit a tricky one) wouldn't work. For example: [snip] Granted, this misses out a few things (e.g. DOCTYPE declarations), but those should be straightforward to handle. It doesn't handle a lot of things. For this input (not very special, just a few simple mistakes): html a href=http://foo.com/baz.htmlclick here/a pWhat if price100? You lose. pWhat if HitPoints-10? You are dead. pAssignment: target -- any_expression Just a few last words. /html the BeautifulSoup version gives: click here What if price100? You lose. What if HitPoints-10? You are dead. Assignment: target -- any_expression Just a few last words. and the regular expression version gives: a href=http://foo.com/baz.htmlclick here What if priceWhat if HitPointsAssignment: target Clearly the BeautifulSoup version gives the right result, or the expected one. It's hard to get that with only a regular expression, you need more power; and BeautifulSoup fills the gap. Speak for yourself. If I'm writing an HTML syntax checker, I think I'll skip BeautifulSoup and use something that gives me the results that I expect, not the results that you expect. -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
On Jun 20, 9:58 am, linuxprog [EMAIL PROTECTED] wrote: hello i have that string htmlhello/aworldanytagok and i want to extract all the text , without html tags , the result should be some thing like that : helloworldok i have tried that : from re import findall chaine = htmlhello/aworldanytagok print findall('[a-zA-z][^(.*)].+?[a-zA-Z]',chaine) ['html', 'hell', 'worl', 'anyt', 'ago'] the result is not correct ! what would be the correct regex to use ? This: [^(.*)] is a set that contains everything but the characters (,,.,*, and ). It most certainly doesn't do what you want it to. Is it absolutely necessary that you use a regular expression? There are a few HTML parsing libraries out there. The easiest approach using re might be to do a search and replace on all tags. Just replace the tags with nothing. Matt -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog [EMAIL PROTECTED] escribió: i have that string htmlhello/aworldanytagok and i want to extract all the text , without html tags , the result should be some thing like that : helloworldok i have tried that : from re import findall chaine = htmlhello/aworldanytagok print findall('[a-zA-z][^(.*)].+?[a-zA-Z]',chaine) ['html', 'hell', 'worl', 'anyt', 'ago'] the result is not correct ! what would be the correct regex to use ? You can't use a regular expression for this task (no matter how complicated you write it). Use BeautifulSoup, that can handle invalid HTML like yours: py from BeautifulSoup import BeautifulSoup py chaine = htmlhello/aworldanytagok py soup = BeautifulSoup(chaine) py soup.findAll(text=True) [u'hello', u'world', u'ok'] Get it from http://www.crummy.com/software/BeautifulSoup/ -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
Gabriel Genellina wrote: py from BeautifulSoup import BeautifulSoup py chaine = htmlhello/aworldanytagok py soup = BeautifulSoup(chaine) py soup.findAll(text=True) [u'hello', u'world', u'ok'] Wow. That *is* beautiful. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
Here is an example: s = htmlHello/aworldanytagok matchtags = re.compile(r[^]+) matchtags.findall(s) ['html', '/a', 'anytag'] matchtags.sub('',s) 'Helloworldok' I probably shouldn't have shown you that. It may not work for all HTML, and you should probably be looking at something like BeautifulSoup. Matt -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
On 6/20/07, Gabriel Genellina [EMAIL PROTECTED] wrote: En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog [EMAIL PROTECTED] escribió: i have that string htmlhello/aworldanytagok and i want to extract all the text , without html tags , the result should be some thing like that : helloworldok i have tried that : from re import findall chaine = htmlhello/aworldanytagok print findall('[a-zA-z][^(.*)].+?[a-zA-Z]',chaine) ['html', 'hell', 'worl', 'anyt', 'ago'] the result is not correct ! what would be the correct regex to use ? You can't use a regular expression for this task (no matter how complicated you write it). [snip] I agree that BeautifulSoup is probably the best tool for the job, but this doesn't sound right to me. Since the OP doesn't care about tags being properly nested, I don't see why a regex (albeit a tricky one) wouldn't work. For example: regex = re.compile(r''' [^!] # beginning of normal tag ([^']*# unquoted text... |'[^']*'# or single-quoted text... |[^]*)* # or double-quoted text # end of tag |!-- # beginning of comment ([^-]|-[^-])* --\s*# end of comment ''', re.VERBOSE) text = regex.sub('', html) Granted, this misses out a few things (e.g. DOCTYPE declarations), but those should be straightforward to handle. -- David -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
En Wed, 20 Jun 2007 17:24:27 -0300, John Salerno [EMAIL PROTECTED] escribió: Gabriel Genellina wrote: py from BeautifulSoup import BeautifulSoup py chaine = htmlhello/aworldanytagok py soup = BeautifulSoup(chaine) py soup.findAll(text=True) [u'hello', u'world', u'ok'] Wow. That *is* beautiful. :) Thanks to Leonard Richardson, the main author. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list
Re: need help with re module
En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler [EMAIL PROTECTED] escribió: On 6/20/07, Gabriel Genellina [EMAIL PROTECTED] wrote: En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog [EMAIL PROTECTED] escribió: i have that string htmlhello/aworldanytagok and i want to extract all the text , without html tags , the result should be some thing like that : helloworldok You can't use a regular expression for this task (no matter how complicated you write it). [snip] I agree that BeautifulSoup is probably the best tool for the job, but this doesn't sound right to me. Since the OP doesn't care about tags being properly nested, I don't see why a regex (albeit a tricky one) wouldn't work. For example: regex = re.compile(r''' [^!] # beginning of normal tag ([^']*# unquoted text... |'[^']*'# or single-quoted text... |[^]*)* # or double-quoted text # end of tag |!-- # beginning of comment ([^-]|-[^-])* --\s*# end of comment ''', re.VERBOSE) text = regex.sub('', html) Granted, this misses out a few things (e.g. DOCTYPE declarations), but those should be straightforward to handle. It doesn't handle a lot of things. For this input (not very special, just a few simple mistakes): html a href=http://foo.com/baz.htmlclick here/a pWhat if price100? You lose. pWhat if HitPoints-10? You are dead. pAssignment: target -- any_expression Just a few last words. /html the BeautifulSoup version gives: click here What if price100? You lose. What if HitPoints-10? You are dead. Assignment: target -- any_expression Just a few last words. and the regular expression version gives: a href=http://foo.com/baz.htmlclick here What if priceWhat if HitPointsAssignment: target Clearly the BeautifulSoup version gives the right result, or the expected one. It's hard to get that with only a regular expression, you need more power; and BeautifulSoup fills the gap. -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list