Re: Issue with regular expressions
On Apr 29, 2:46 pm, Julien [EMAIL PROTECTED] wrote: Hi, I'm fairly new in Python and I haven't used the regular expressions enough to be able to achieve what I want. I'd like to select terms in a string, so I can then do a search in my database. query = ' some words with and without quotes ' p = re.compile(magic_regular_expression) $ --- the magic happens m = p.match(query) I'd like m.groups() to return: ('some words', 'with', 'and', 'without quotes') Is that achievable with a single regular expression, and if so, what would it be? Any help would be much appreciated. Thanks!! Julien You can't do it simply and completely with regular expressions alone because of the requirement to strip the quotes and normalize whitespace, but its not too hard to write a function to do it. Viz: import re wordre = re.compile('[^]+|[a-zA-Z]+').findall def findwords(src): ret = [] for x in wordre(src): if x[0] == '': #strip off the quotes and normalise spaces ret.append(' '.join(x[1:-1].split())) else: ret.append(x) return ret query = ' Some words withand withoutquotes ' print findwords(query) Running this gives ['Some words', 'with', 'and', 'without quotes'] HTH Harvey -- http://mail.python.org/mailman/listinfo/python-list
Re: Matching XML Tag Contents with Regex
On Dec 11, 4:05 pm, Chris [EMAIL PROTECTED] wrote: I'm trying to find the contents of an XML tag. Nothing fancy. I don't care about parsing child tags or anything. I just want to get the raw text. Here's my script: import re data = ?xml version='1.0'? body div class='default' hereapos;s some text#33; /div div class='default' hereapos;s some text#33; /div div class='default' hereapos;s some text#33; /div /body tagName = 'div' pattern = re.compile('%(tagName)s\s[^]*[.\n\r\w\s\d\D\S\W]*[^(% (tagName)s)]*' % dict(tagName=tagName)) matches = pattern.finditer(data) for m in matches: contents = data[m.start():m.end()] print repr(contents) assert tagName not in contents The problem I'm running into is that the [^%(tagName)s]* portion of my regex is being ignored, so only one match is being returned, starting at the first div and ending at the end of the text, when it should end at the first /div. For this example, it should return three matches, one for each div. Is what I'm trying to do possible with Python's Regex library? Is there an error in my Regex? Thanks, Chris print re.findall(r'%s(?=[\s/])[^]*' % 'div', r) [div class='default', div class='default', div class='default'] HTH Harvey -- http://mail.python.org/mailman/listinfo/python-list
Re: just a bug (was: xml.dom.minidom: how to preserve CRLF's inside CDATA?)
On May 25, 12:03 pm, sim.sim [EMAIL PROTECTED] wrote: On 25 ÍÁÊ, 12:45, Marc 'BlackJack' Rintsch [EMAIL PROTECTED] wrote: In [EMAIL PROTECTED], sim.sim wrote: Below the code that tryes to parse an well-formed xml, but it fails with error message: not well-formed (invalid token): line 3, column 85 How did you verified that it is well formed? `xmllint` barf on it too. you can try to write iMessage to file and open it using Mozilla Firefox (web-browser) The problem within CDATA-section: it consists a part of utf-8 encoded string wich was splited (widely used for memory limited devices). When minidom parses the xml-string, it fails becouse it tryes to convert into unicode the data within CDATA-section, insted of just to return the value of the section as is. The convertion contradicts the specificationhttp://www.w3.org/TR/REC-xml/#sec-cdata-sect An XML document contains unicode characters, so does the CDTATA section. CDATA is not meant to put arbitrary bytes into a document. It must contain valid characters of this typehttp://www.w3.org/TR/REC-xml/#NT-Char(linkedfrom the grammar of CDATA in your link above). Ciao, Marc 'BlackJack' Rintsch my CDATA-section contains only symbols in the range specified for Char: Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x1-#x10] filter(lambda x: ord(x) not in range(0x20, 0xD7FF), iMessage)- Hide quoted text - - Show quoted text - You need to explicitly convert the string of UTF8 encoded bytes to a Unicode string before parsing e.g. unicodestring = unicode(encodedbytes, 'utf8') Unless I messed up copying and pasting, your original string had an erroneous byte immediately before ]]. With that corrected I was able to process the string correctly - the CDATA marked section consits entirely of spaces and Cyrillic characters. As I noted earlier you will lose \r characters as part of the basic XML processing. HTH Harvey -- http://mail.python.org/mailman/listinfo/python-list
Re: xml.dom.minidom: how to preserve CRLF's inside CDATA?
On May 22, 2:45 pm, sim.sim [EMAIL PROTECTED] wrote: Hi all. i'm faced to trouble using minidom: #i have a string (xml) within CDATA section, and the section includes \r\n: iInStr = '?xml version=1.0?\nData![CDATA[BEGIN:VCALENDAR\r \nEND:VCALENDAR\r\n]]/Data\n' #After i create DOM-object, i get the value of Data without \r\n from xml.dom import minidom iDoc = minidom.parseString(iInStr) iDoc.childNodes[0].childNodes[0].data # it gives u'BEGIN:VCALENDAR \nEND:VCALENDAR\n' according tohttp://www.w3.org/TR/REC-xml/#sec-line-ends it looks normal, but another part of the documentation says that only the CDEnd string is recognized as markup:http://www.w3.org/TR/REC-xml/#sec-cdata-sect so parser must (IMHO) give the value of CDATA-section as is (neither both of parts of the document do not contradicts to each other). How to get the value of CDATA-section with preserved all symbols within? (perhaps use another parser - which one?) Many thanks for any help. You will lose the \r characters. From the document you referred to This section defines some symbols used widely in the grammar. S (white space) consists of one or more space (#x20) characters, carriage returns, line feeds, or tabs. White Space [3]S::=(#x20 | #x9 | #xD | #xA)+ Note: The presence of #xD in the above production is maintained purely for backward compatibility with the First Edition. As explained in 2.11 End-of-Line Handling, all #xD characters literally present in an XML document are either removed or replaced by #xA characters before any other processing is done. The only way to get a #xD character to match this production is to use a character reference in an entity value literal. -- http://mail.python.org/mailman/listinfo/python-list
Re: XML Parsing
On Mar 28, 10:51 am, Diez B. Roggisch [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: I want to parse this XML file: ?xml version=1.0 ? text text:one filefilename/file contents Hello /contents /text:one text:two filefilename2/file contents Hello2 /contents /text:two /text This XML will be in a file called filecreate.xml As you might have guessed, I want to create files from this XML file contents, so how can I do this? What modules should I use? What options do I have? Where can I find tutorials? Will I be able to put this on the internet (on a googlepages server)? Thanks in advance to everyone who helps me. And yes I have used Google but I am unsure what to use. The above file is not valid XML. It misses a xmlns:text namespace declaration. So you won't be able to parse it regardless of what parser you use. Diez- Hide quoted text - - Show quoted text - The example is valid well-formed XML. It is permitted to use the : character in element names. Whether one should in a non namespace context is a different matter. Harvey -- http://mail.python.org/mailman/listinfo/python-list
Re: Match 2 words in a line of file
Rickard Lindberg wrote: I see two potential problems with the non regex solutions. 1) Consider a line: foo (bar). When you split it you will only get two strings, as split by default only splits the string on white space characters. Thus 'bar' in words will return false, even though bar is a word in that line. 2) If you have a line something like this: foobar hello then 'foo' in line will return true, even though foo is not a word (it is part of a word). Here's a solution using re.split: import re import StringIO wordsplit = re.compile('\W+').split def matchlines(fh, w1, w2): w1 = w1.lower() w2 = w2.lower() for line in fh: words = [x.lower() for x in wordsplit(line)] if w1 in words and w2 in words: print line.rstrip() test = 1st line of text (not matched) 2nd line of words (not matched) 3rd line (Word test) should match (case insensitivity) 4th line simple test of word's (matches) 5th line simple test of words not found (plural words) 6th line tests produce strange words (no match - plural) 7th line word test should find this matchlines(StringIO.StringIO(test), 'test', 'word') -- http://mail.python.org/mailman/listinfo/python-list
Re: Match 2 words in a line of file
Rickard Lindberg wrote: I see two potential problems with the non regex solutions. 1) Consider a line: foo (bar). When you split it you will only get two strings, as split by default only splits the string on white space characters. Thus 'bar' in words will return false, even though bar is a word in that line. 2) If you have a line something like this: foobar hello then 'foo' in line will return true, even though foo is not a word (it is part of a word). Here's a solution using re.split: import re import StringIO wordsplit = re.compile('\W+').split def matchlines(fh, w1, w2): w1 = w1.lower() w2 = w2.lower() for line in fh: words = [x.lower() for x in wordsplit(line)] if w1 in words and w2 in words: print line.rstrip() test = 1st line of text (not matched) 2nd line of words (not matched) 3rd line (Word test) should match (case insensitivity) 4th line simple test of word's (matches) 5th line simple test of words not found (plural words) 6th line tests produce strange words (no match - plural) 7th line word test should find this matchlines(StringIO.StringIO(test), 'test', 'word') -- http://mail.python.org/mailman/listinfo/python-list
Re: One more regular expressions question
Victor Polukcht wrote: My pattern now is: (?Pvar1[^(]+)(?Pvar2\d+)\)\s+(?Pvar3\d+) And i expect to get: var1 = Unassigned Number var2 = 1 var3 = 32 I'm sure my regexp is incorrect, but can't understand where exactly. Regex.debug shows that even the first block is incorrect. Thanks in advance. On Jan 18, 1:15 pm, Roberto Bonvallet [EMAIL PROTECTED] wrote: Victor Polukcht wrote: My actual problem is i can't get how to include space, comma, slash.Post here what you have written already, so we can tell you what the problem is. -- Roberto Bonvallet You are missing \( after the first group. The RE should be: '(?Pvar1[^(]+)\((?Pvar2\d+)\)\s+(?Pvar3\d+)' -- http://mail.python.org/mailman/listinfo/python-list
Re: re.sub and empty groups
Hugo Ferreira wrote: Hi! I'm trying to do a search-replace in places where some groups are optional... Here's an example: re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola).groups() ('ola', None) re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|).groups() ('ola', '') re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|ole).groups() ('ola', 'ole') The second and third results are right, but not the first one, where it should be equal to the second (i.e., it should be an empty string instead of None). This is because I want to use re.sub() and when the group is None, it blows up with a stack trace... Maybe I'm not getting the essence of groups and non-grouping groups. Someone care to explain (and, give the correct solution :)) ? Thanks in advance, Hugo Ferreira -- GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85 From the documentation: groups( [default]) Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None. Your second group is optional and does not take part in the match in your first example. You can, however, still use this regular expression if you use groups('') rather than groups(). A better way probably is to use a simplified regular expression re.match(rImage:([^\|]+)\|?(.*), Image:ola).groups() i.e. match the text Image: followed by at least one character not matching | followed by an optional | followed by any remaining characters. -- http://mail.python.org/mailman/listinfo/python-list
Re: re.sub and empty groups
Hugo Ferreira wrote: Hi! I'm trying to do a search-replace in places where some groups are optional... Here's an example: re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola).groups() ('ola', None) re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|).groups() ('ola', '') re.match(rImage:([^\|]+)(?:\|(.*))?, Image:ola|ole).groups() ('ola', 'ole') The second and third results are right, but not the first one, where it should be equal to the second (i.e., it should be an empty string instead of None). This is because I want to use re.sub() and when the group is None, it blows up with a stack trace... Maybe I'm not getting the essence of groups and non-grouping groups. Someone care to explain (and, give the correct solution :)) ? Thanks in advance, Hugo Ferreira -- GPG Fingerprint: B0D7 1249 447D F5BB 22C5 5B9B 078C 2615 504B 7B85 From the documentation: groups( [default]) Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern. The default argument is used for groups that did not participate in the match; it defaults to None. Your second group is optional and does not take part in the match in your first example. You can, however, still use this regular expression if you use groups('') rather than groups(). A better way probably is to use a simplified regular expression re.match(rImage:([^\|]+)\|?(.*), Image:ola).groups() i.e. match the text Image: followed by at least one character not matching | followed by an optional | followed by any remaining characters. -- http://mail.python.org/mailman/listinfo/python-list
Re: Insert characters into string based on re ?
Matt wrote: I am attempting to reformat a string, inserting newlines before certain phrases. For example, in formatting SQL, I want to start a new line at each JOIN condition. Noting that strings are immutable, I thought it best to spllit the string at the key points, then join with '\n'. Regexps can seem the best way to identify the points in the string ('LEFT.*JOIN' to cover 'LEFT OUTER JOIN' and 'LEFT JOIN'), since I need to identify multiple locationg in the string. However, the re.split method returns the list without the split phrases, and re.findall does not seem useful for this operation. Suggestions? I think that re.sub is a more appropriate method rather than split and join trivial example (non SQL): addnlre = re.compile('LEFT\s.*?\s*JOIN|RIGHT\s.*?\s*JOIN', re.DOTALL + re.IGNORECASE).sub addnlre(lambda x: x.group() + '\n', '... LEFT JOIN x RIGHT OUTER join y') '... LEFT JOIN\n x RIGHT OUTER join\n y' -- http://mail.python.org/mailman/listinfo/python-list
Re: Need a Regular expression to remove a char for Unicode text
శ్రీనివాస wrote: Hai friends, Can any one tell me how can i remove a character from a unocode text. కల్హార is a Telugu word in Unicode. Here i want to remove '' but not replace with a zero width char. And one more thing, if any whitespaces are there before and after '' char, the text should be kept as it is. Please tell me how can i workout this with regular expressions. Thanks and regards Srinivasa Raju Datla Don't know anything about Telugu, but is this the approach you want? x=u'\xfe\xff \xfe\xff \xfe\xff\xfe\xff' noampre = re.compile('(?!\s)(?!\s)', re.UNICODE).sub noampre('', x) u'\xfe\xff \xfe\xff \xfe\xff\xfe\xff' The regular expression has negative look behind and look ahead assertions to check that there is no whitespace surrounding the '' character. Each match then found is then replaced with the empty string -- http://mail.python.org/mailman/listinfo/python-list