Re: Issue with regular expressions

2008-04-29 Thread harvey . thomas
On Apr 29, 2:46 pm, Julien [EMAIL PROTECTED] wrote:
 Hi,

 I'm fairly new in Python and I haven't used the regular expressions
 enough to be able to achieve what I want.
 I'd like to select terms in a string, so I can then do a search in my
 database.

 query = '     some words  with and without    quotes     '
 p = re.compile(magic_regular_expression)   $ --- the magic happens
 m = p.match(query)

 I'd like m.groups() to return:
 ('some words', 'with', 'and', 'without quotes')

 Is that achievable with a single regular expression, and if so, what
 would it be?

 Any help would be much appreciated.

 Thanks!!

 Julien

You can't do it simply and completely with regular expressions alone
because of the requirement to strip the quotes and normalize
whitespace, but its not too hard to write a function to do it. Viz:

import re

wordre = re.compile('[^]+|[a-zA-Z]+').findall
def findwords(src):
ret = []
for x in wordre(src):
if x[0] == '':
#strip off the quotes and normalise spaces
ret.append(' '.join(x[1:-1].split()))
else:
ret.append(x)
return ret

query = ' Some words  withand withoutquotes '
print findwords(query)

Running this gives
['Some words', 'with', 'and', 'without quotes']

HTH

Harvey
--
http://mail.python.org/mailman/listinfo/python-list


Re: Matching XML Tag Contents with Regex

2007-12-11 Thread harvey . thomas
On Dec 11, 4:05 pm, Chris [EMAIL PROTECTED] wrote:
 I'm trying to find the contents of an XML tag. Nothing fancy. I don't
 care about parsing child tags or anything. I just want to get the raw
 text. Here's my script:

 import re

 data = 
 ?xml version='1.0'?
 body
 div class='default'
 hereapos;s some text#33;
 /div
 div class='default'
 hereapos;s some text#33;
 /div
 div class='default'
 hereapos;s some text#33;
 /div
 /body
 

 tagName = 'div'
 pattern = re.compile('%(tagName)s\s[^]*[.\n\r\w\s\d\D\S\W]*[^(%
 (tagName)s)]*' % dict(tagName=tagName))

 matches = pattern.finditer(data)
 for m in matches:
 contents = data[m.start():m.end()]
 print repr(contents)
 assert tagName not in contents

 The problem I'm running into is that the [^%(tagName)s]* portion of my
 regex is being ignored, so only one match is being returned, starting
 at the first div and ending at the end of the text, when it should
 end at the first /div. For this example, it should return three
 matches, one for each div.

 Is what I'm trying to do possible with Python's Regex library? Is
 there an error in my Regex?

 Thanks,
 Chris

print re.findall(r'%s(?=[\s/])[^]*' % 'div', r)

[div class='default', div class='default', div
class='default']

HTH

Harvey
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: just a bug (was: xml.dom.minidom: how to preserve CRLF's inside CDATA?)

2007-05-25 Thread harvey . thomas
On May 25, 12:03 pm, sim.sim [EMAIL PROTECTED] wrote:
 On 25 ÍÁÊ, 12:45, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote:

  In [EMAIL PROTECTED], sim.sim wrote:
   Below the code that tryes to parse an well-formed xml, but it fails
   with error message:
   not well-formed (invalid token): line 3, column 85

  How did you verified that it is well formed?  `xmllint` barf on it too.

 you can try to write iMessage to file and open it using Mozilla
 Firefox (web-browser)







   The problem within CDATA-section: it consists a part of utf-8
   encoded string wich was splited (widely used for memory limited
   devices).

   When minidom parses the xml-string, it fails becouse it tryes to convert
   into unicode the data within CDATA-section, insted of just to return the
   value of the section as is. The convertion contradicts the
   specificationhttp://www.w3.org/TR/REC-xml/#sec-cdata-sect

  An XML document contains unicode characters, so does the CDTATA section.
  CDATA is not meant to put arbitrary bytes into a document.  It must
  contain valid characters of this 
  typehttp://www.w3.org/TR/REC-xml/#NT-Char(linkedfrom the grammar of CDATA in
  your link above).

  Ciao,
  Marc 'BlackJack' Rintsch

 my CDATA-section contains only symbols in the range specified for
 Char:
 Char ::=   #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
 [#x1-#x10]

 filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)- Hide quoted 
 text -

 - Show quoted text -

You need to explicitly convert the string of UTF8 encoded bytes to a
Unicode string before parsing e.g.
unicodestring = unicode(encodedbytes, 'utf8')

Unless I messed up copying and pasting, your original string had an
erroneous byte immediately before ]]. With that corrected I was able
to process the string correctly - the CDATA marked section consits
entirely of spaces and Cyrillic characters. As I noted earlier you
will lose \r characters as part of the basic XML processing.

HTH

Harvey

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: xml.dom.minidom: how to preserve CRLF's inside CDATA?

2007-05-22 Thread harvey . thomas
On May 22, 2:45 pm, sim.sim [EMAIL PROTECTED] wrote:
 Hi all.
 i'm faced to trouble using minidom:

 #i have a string (xml) within CDATA section, and the section includes
 \r\n:
 iInStr = '?xml version=1.0?\nData![CDATA[BEGIN:VCALENDAR\r
 \nEND:VCALENDAR\r\n]]/Data\n'

 #After i create DOM-object, i get the value of Data without \r\n

 from xml.dom import minidom
 iDoc = minidom.parseString(iInStr)
 iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR
 \nEND:VCALENDAR\n'

 according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends

 it looks normal, but another part of the documentation says that only
 the CDEnd string is recognized as 
 markup:http://www.w3.org/TR/REC-xml/#sec-cdata-sect

 so parser must (IMHO) give the value of CDATA-section as is (neither
 both of parts of the document do not contradicts to each other).

 How to get the value of CDATA-section with preserved all symbols
 within? (perhaps use another parser - which one?)

 Many thanks for any help.

You will lose the \r characters. From the document you referred to

This section defines some symbols used widely in the grammar.

S (white space) consists of one or more space (#x20) characters,
carriage returns, line feeds, or tabs.

White Space
[3]S::=(#x20 | #x9 | #xD | #xA)+

Note:

The presence of #xD in the above production is maintained purely for
backward compatibility with the First Edition. As explained in 2.11
End-of-Line Handling, all #xD characters literally present in an XML
document are either removed or replaced by #xA characters before any
other processing is done. The only way to get a #xD character to match
this production is to use a character reference in an entity value
literal.


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: XML Parsing

2007-03-28 Thread harvey . thomas
On Mar 28, 10:51 am, Diez B. Roggisch [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
  I want to parse this XML file:

  ?xml version=1.0 ?

  text

  text:one
  filefilename/file
  contents
  Hello
  /contents
  /text:one

  text:two
  filefilename2/file
  contents
  Hello2
  /contents
  /text:two

  /text

  This XML will be in a file called filecreate.xml

  As you might have guessed, I want to create files from this XML file
  contents, so how can I do this?
  What modules should I use? What options do I have? Where can I find
  tutorials? Will I be able to put
  this on the internet (on a googlepages server)?

  Thanks in advance to everyone who helps me.
  And yes I have used Google but I am unsure what to use.

 The above file is not valid XML. It misses a xmlns:text namespace
 declaration. So you won't be able to parse it regardless of what parser you
 use.

 Diez- Hide quoted text -

 - Show quoted text -

The example is valid well-formed XML. It is permitted to use the :
character in element names. Whether one should in a non namespace
context is a different matter.

Harvey

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match 2 words in a line of file

2007-01-19 Thread harvey . thomas

Rickard Lindberg wrote:

 I see two potential problems with the non regex solutions.

 1) Consider a line: foo (bar). When you split it you will only get
 two strings, as split by default only splits the string on white space
 characters. Thus 'bar' in words will return false, even though bar is
 a word in that line.

 2) If you have a line something like this: foobar hello then 'foo'
 in line will return true, even though foo is not a word (it is part of
 a word).

Here's a solution using re.split:

import re
import StringIO

wordsplit = re.compile('\W+').split
def matchlines(fh, w1, w2):
w1 = w1.lower()
w2 = w2.lower()
for line in fh:
words = [x.lower() for x in wordsplit(line)]
if w1 in words and w2 in words:
print line.rstrip()

test = 1st line of text (not matched)
2nd line of words (not matched)
3rd line (Word test) should match (case insensitivity)
4th line simple test of word's (matches)
5th line simple test of words not found (plural words)
6th line tests produce strange words (no match - plural)
7th line word test should find this

matchlines(StringIO.StringIO(test), 'test', 'word')

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Match 2 words in a line of file

2007-01-19 Thread harvey . thomas

Rickard Lindberg wrote:

 I see two potential problems with the non regex solutions.

 1) Consider a line: foo (bar). When you split it you will only get
 two strings, as split by default only splits the string on white space
 characters. Thus 'bar' in words will return false, even though bar is
 a word in that line.

 2) If you have a line something like this: foobar hello then 'foo'
 in line will return true, even though foo is not a word (it is part of
 a word).

Here's a solution using re.split:

import re
import StringIO

wordsplit = re.compile('\W+').split
def matchlines(fh, w1, w2):
w1 = w1.lower()
w2 = w2.lower()
for line in fh:
words = [x.lower() for x in wordsplit(line)]
if w1 in words and w2 in words:
print line.rstrip()

test = 1st line of text (not matched)
2nd line of words (not matched)
3rd line (Word test) should match (case insensitivity)
4th line simple test of word's (matches)
5th line simple test of words not found (plural words)
6th line tests produce strange words (no match - plural)
7th line word test should find this

matchlines(StringIO.StringIO(test), 'test', 'word')

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: One more regular expressions question

2007-01-18 Thread harvey . thomas
Victor Polukcht wrote:

 My pattern now is:

 (?Pvar1[^(]+)(?Pvar2\d+)\)\s+(?Pvar3\d+)

 And i expect to get:

 var1 = Unassigned Number 
 var2 = 1
 var3 = 32

 I'm sure my regexp is incorrect, but can't understand where exactly.

 Regex.debug shows that even the first block is incorrect.

 Thanks in advance.

 On Jan 18, 1:15 pm, Roberto Bonvallet [EMAIL PROTECTED]
 wrote:
  Victor Polukcht wrote:
   My actual problem is i can't get how to include space, comma, slash.Post 
   here what you have written already, so we can tell you what the
  problem is.
 
  --
  Roberto Bonvallet

You are missing \( after the first group. The RE should be:

'(?Pvar1[^(]+)\((?Pvar2\d+)\)\s+(?Pvar3\d+)'

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: re.sub and empty groups

2007-01-16 Thread harvey . thomas

Hugo Ferreira wrote:

 Hi!

 I'm trying to do a search-replace in places where some groups are
 optional... Here's an example:

  re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola).groups()
 ('ola', None)

  re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|).groups()
 ('ola', '')

  re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|ole).groups()
 ('ola', 'ole')

 The second and third results are right, but not the first one, where
 it should be equal to the second (i.e., it should be an empty string
 instead of None). This is because I want to use re.sub() and when the
 group is None, it blows up with a stack trace...

 Maybe I'm not getting the essence of groups and non-grouping groups.
 Someone care to explain (and, give the correct solution :)) ?

 Thanks in advance,

 Hugo Ferreira

 --
 GPG Fingerprint: B0D7 1249 447D F5BB 22C5  5B9B 078C 2615 504B 7B85

From the documentation:
groups( [default])
Return a tuple containing all the subgroups of the match, from 1 up to
however many groups are in the pattern. The default argument is used
for groups that did not participate in the match; it defaults to None.

Your second group is optional and does not take part in the match in
your first example. You can, however, still use this regular expression
if you use groups('') rather than groups().

A better way probably is to use a simplified regular expression

re.match(rImage:([^\|]+)\|?(.*), Image:ola).groups()

i.e. match the text Image: followed by at least one character not
matching | followed by an optional | followed by any remaining
characters.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: re.sub and empty groups

2007-01-16 Thread harvey . thomas

Hugo Ferreira wrote:

 Hi!

 I'm trying to do a search-replace in places where some groups are
 optional... Here's an example:

  re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola).groups()
 ('ola', None)

  re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|).groups()
 ('ola', '')

  re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|ole).groups()
 ('ola', 'ole')

 The second and third results are right, but not the first one, where
 it should be equal to the second (i.e., it should be an empty string
 instead of None). This is because I want to use re.sub() and when the
 group is None, it blows up with a stack trace...

 Maybe I'm not getting the essence of groups and non-grouping groups.
 Someone care to explain (and, give the correct solution :)) ?

 Thanks in advance,

 Hugo Ferreira

 --
 GPG Fingerprint: B0D7 1249 447D F5BB 22C5  5B9B 078C 2615 504B 7B85

From the documentation:
groups( [default])
Return a tuple containing all the subgroups of the match, from 1 up to
however many groups are in the pattern. The default argument is used
for groups that did not participate in the match; it defaults to None.

Your second group is optional and does not take part in the match in
your first example. You can, however, still use this regular expression
if you use groups('') rather than groups().

A better way probably is to use a simplified regular expression

re.match(rImage:([^\|]+)\|?(.*), Image:ola).groups()

i.e. match the text Image: followed by at least one character not
matching | followed by an optional | followed by any remaining
characters.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Insert characters into string based on re ?

2006-10-13 Thread harvey . thomas

Matt wrote:
 I am attempting to reformat a string, inserting newlines before certain
 phrases. For example, in formatting SQL, I want to start a new line at
 each JOIN condition. Noting that strings are immutable, I thought it
 best to spllit the string at the key points, then join with '\n'.

 Regexps can seem the best way to identify the points in the string
 ('LEFT.*JOIN' to cover 'LEFT OUTER JOIN' and 'LEFT JOIN'), since I need
 to identify multiple locationg in the string. However, the re.split
 method returns the list without the split phrases, and re.findall does
 not seem useful for this operation.

 Suggestions?

I think that re.sub is a more appropriate method rather than split and
join

trivial example (non SQL):

 addnlre = re.compile('LEFT\s.*?\s*JOIN|RIGHT\s.*?\s*JOIN', re.DOTALL + 
 re.IGNORECASE).sub
 addnlre(lambda x: x.group() + '\n', '... LEFT JOIN x RIGHT OUTER join y')
'... LEFT JOIN\n x RIGHT OUTER join\n y'

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Need a Regular expression to remove a char for Unicode text

2006-10-13 Thread harvey . thomas

శ్రీనివాస wrote:
 Hai friends,
 Can any one tell me how can i remove a character from a unocode text.
 కల్‌హార is a Telugu word in Unicode. Here i want to
 remove '' but not replace with a zero width char. And one more thing,
 if any whitespaces are there before and after '' char, the text should
 be kept as it is. Please tell me how can i workout this with regular
 expressions.

 Thanks and regards
 Srinivasa Raju Datla

Don't know anything about Telugu, but is this the approach you want?

 x=u'\xfe\xff  \xfe\xff \xfe\xff\xfe\xff'
 noampre = re.compile('(?!\s)(?!\s)', re.UNICODE).sub
 noampre('', x)
u'\xfe\xff  \xfe\xff \xfe\xff\xfe\xff'

The regular expression has negative look behind and look ahead
assertions to check that there is no whitespace surrounding the ''
character. Each match then found is then  replaced with the empty string

-- 
http://mail.python.org/mailman/listinfo/python-list