Re: need help with re module

2007-06-23 Thread Gabriel Genellina
En Sat, 23 Jun 2007 01:12:17 -0300, samwyse [EMAIL PROTECTED] escribió:

 Speak for yourself.  If I'm writing an HTML syntax checker, I think I'll
 skip BeautifulSoup and use something that gives me the results that I
 expect, not the results that you expect.

Sure! By the way, I'm looking for a different sound. I have an Ibanez but  
I think the Jackson is far better for thrash metal, the Jackson Kelly Pro  
series KE3 sounds good, and it's a classic. Maybe as a self-gift for my  
birthday next month.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-22 Thread samwyse
Gabriel Genellina wrote:
 En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler [EMAIL PROTECTED]  
 escribió:
 
 On 6/20/07, Gabriel Genellina [EMAIL PROTECTED] wrote:

[snip]
 I agree that BeautifulSoup is probably the best tool for the job, but
 this doesn't sound right to me. Since the OP doesn't care about tags
 being properly nested, I don't see why a regex (albeit a tricky one)
 wouldn't work. For example:

[snip]

 Granted, this misses out a few things (e.g. DOCTYPE declarations), but
 those should be straightforward to handle.
 
 It doesn't handle a lot of things. For this input (not very special, 
 just  a few simple mistakes):
 
 html
 a href=http://foo.com/baz.htmlclick here/a
 pWhat if price100? You lose.
 pWhat if HitPoints-10? You are dead.
 pAssignment: target -- any_expression
 Just a few last words.
 /html
 
 the BeautifulSoup version gives:
 
 click here
 What if price100? You lose.
 What if HitPoints-10? You are dead.
 Assignment: target -- any_expression
 Just a few last words.
 
 and the regular expression version gives:
 
 a href=http://foo.com/baz.htmlclick here
 What if priceWhat if HitPointsAssignment: target
 
 Clearly the BeautifulSoup version gives the right result, or the  
 expected one.
 It's hard to get that with only a regular expression, you need more 
 power;  and BeautifulSoup fills the gap.

Speak for yourself.  If I'm writing an HTML syntax checker, I think I'll 
skip BeautifulSoup and use something that gives me the results that I 
expect, not the results that you expect.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread Matimus
On Jun 20, 9:58 am, linuxprog [EMAIL PROTECTED] wrote:
 hello

 i have that string htmlhello/aworldanytagok and i want to
 extract all the text , without html tags , the result should be some
 thing like that : helloworldok

 i have tried that :

 from re import findall

 chaine = htmlhello/aworldanytagok

 print findall('[a-zA-z][^(.*)].+?[a-zA-Z]',chaine)

 ['html', 'hell', 'worl', 'anyt', 'ago']

 the result is not correct ! what would be the correct regex to use ?

This: [^(.*)] is a set that contains everything but the characters
(,,.,*, and ). It most certainly doesn't do what you
want it to. Is it absolutely necessary that you use a regular
expression? There are a few HTML parsing libraries out there. The
easiest approach using re might be to do a search and replace on all
tags. Just replace the tags with nothing.

Matt

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread Gabriel Genellina
En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog [EMAIL PROTECTED]  
escribió:

 i have that string htmlhello/aworldanytagok and i want to
 extract all the text , without html tags , the result should be some
 thing like that : helloworldok

 i have tried that :

 from re import findall

 chaine = htmlhello/aworldanytagok

 print findall('[a-zA-z][^(.*)].+?[a-zA-Z]',chaine)
['html', 'hell', 'worl', 'anyt', 'ago']

 the result is not correct ! what would be the correct regex to use ?

You can't use a regular expression for this task (no matter how  
complicated you write it).
Use BeautifulSoup, that can handle invalid HTML like yours:

py from BeautifulSoup import BeautifulSoup
py chaine = htmlhello/aworldanytagok
py soup = BeautifulSoup(chaine)
py soup.findAll(text=True)
[u'hello', u'world', u'ok']

Get it from http://www.crummy.com/software/BeautifulSoup/

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread John Salerno
Gabriel Genellina wrote:

 py from BeautifulSoup import BeautifulSoup
 py chaine = htmlhello/aworldanytagok
 py soup = BeautifulSoup(chaine)
 py soup.findAll(text=True)
 [u'hello', u'world', u'ok']

Wow. That *is* beautiful. :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread Matimus
Here is an example:

 s = htmlHello/aworldanytagok
 matchtags = re.compile(r[^]+)
 matchtags.findall(s)
['html', '/a', 'anytag']
 matchtags.sub('',s)
'Helloworldok'

I probably shouldn't have shown you that. It may not work for all
HTML, and you should probably be looking at something like
BeautifulSoup.

Matt

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread David Wahler
On 6/20/07, Gabriel Genellina [EMAIL PROTECTED] wrote:
 En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog [EMAIL PROTECTED]
 escribió:

  i have that string htmlhello/aworldanytagok and i want to
  extract all the text , without html tags , the result should be some
  thing like that : helloworldok
 
  i have tried that :
 
  from re import findall
 
  chaine = htmlhello/aworldanytagok
 
  print findall('[a-zA-z][^(.*)].+?[a-zA-Z]',chaine)
 ['html', 'hell', 'worl', 'anyt', 'ago']
 
  the result is not correct ! what would be the correct regex to use ?

 You can't use a regular expression for this task (no matter how
 complicated you write it).
[snip]

I agree that BeautifulSoup is probably the best tool for the job, but
this doesn't sound right to me. Since the OP doesn't care about tags
being properly nested, I don't see why a regex (albeit a tricky one)
wouldn't work. For example:

regex = re.compile(r'''
[^!] # beginning of normal tag
([^']*# unquoted text...
|'[^']*'# or single-quoted text...
|[^]*)*  # or double-quoted text
 # end of tag
   |!--  # beginning of comment
([^-]|-[^-])*
--\s*# end of comment
''', re.VERBOSE)
text = regex.sub('', html)

Granted, this misses out a few things (e.g. DOCTYPE declarations), but
those should be straightforward to handle.

-- David
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread Gabriel Genellina
En Wed, 20 Jun 2007 17:24:27 -0300, John Salerno  
[EMAIL PROTECTED] escribió:

 Gabriel Genellina wrote:

 py from BeautifulSoup import BeautifulSoup
 py chaine = htmlhello/aworldanytagok
 py soup = BeautifulSoup(chaine)
 py soup.findAll(text=True)
 [u'hello', u'world', u'ok']

 Wow. That *is* beautiful. :)

Thanks to Leonard Richardson, the main author.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: need help with re module

2007-06-20 Thread Gabriel Genellina
En Wed, 20 Jun 2007 17:56:30 -0300, David Wahler [EMAIL PROTECTED]  
escribió:

 On 6/20/07, Gabriel Genellina [EMAIL PROTECTED] wrote:
 En Wed, 20 Jun 2007 13:58:34 -0300, linuxprog [EMAIL PROTECTED]
 escribió:

  i have that string htmlhello/aworldanytagok and i want to
  extract all the text , without html tags , the result should be some
  thing like that : helloworldok

 You can't use a regular expression for this task (no matter how
 complicated you write it).
 [snip]

 I agree that BeautifulSoup is probably the best tool for the job, but
 this doesn't sound right to me. Since the OP doesn't care about tags
 being properly nested, I don't see why a regex (albeit a tricky one)
 wouldn't work. For example:

 regex = re.compile(r'''
 [^!] # beginning of normal tag
 ([^']*# unquoted text...
 |'[^']*'# or single-quoted text...
 |[^]*)*  # or double-quoted text
  # end of tag
|!--  # beginning of comment
 ([^-]|-[^-])*
 --\s*# end of comment
 ''', re.VERBOSE)
 text = regex.sub('', html)

 Granted, this misses out a few things (e.g. DOCTYPE declarations), but
 those should be straightforward to handle.

It doesn't handle a lot of things. For this input (not very special, just  
a few simple mistakes):

html
a href=http://foo.com/baz.htmlclick here/a
pWhat if price100? You lose.
pWhat if HitPoints-10? You are dead.
pAssignment: target -- any_expression
Just a few last words.
/html

the BeautifulSoup version gives:

click here
What if price100? You lose.
What if HitPoints-10? You are dead.
Assignment: target -- any_expression
Just a few last words.

and the regular expression version gives:

a href=http://foo.com/baz.htmlclick here
What if priceWhat if HitPointsAssignment: target

Clearly the BeautifulSoup version gives the right result, or the  
expected one.
It's hard to get that with only a regular expression, you need more power;  
and BeautifulSoup fills the gap.

-- 
Gabriel Genellina

-- 
http://mail.python.org/mailman/listinfo/python-list