Re: [Tutor] making a custom file parser?
> IIRC, Python's only non-regular feature is backreferences though Probably. I'm not too familiar with a couple other features or how their semantics work, in particular the (?(id)yes|no) syntax. > I'm not calling bs or anything, I don't know anything about .net > regexes and I'll readily believe it can be done (I just want to see > the code for myself). They add the ability to push and pop from a stack, which turns their regular expressions into at-least-as-powerful as push-down automata, which are equivalent in power to context-free-grammars, which means they can match XML. I think this has been well-known in the .NET community for years, but nobody had ever done it, and nobody ever mentioned it. It's a dirty secret you don't tell the newbies because then they think regexps are fine to use for everything. It's also why I don't like the "this isn't regular so don't use regular expressions" spiel. We call things regular expressions even when they're context-free parsing expressions! The term has meaning, but it's no longer tied to finite state automata, and any argument along that lines is just waiting to be broken by the next feature addition to the re module. Anyway, I found the reference I was thinking of: http://porg.es/blog/so-it-turns-out-that-dot-nets-regex-are-more-powerful-than-i-originally-thought > Quite right. We haven't seen enough of it to be sure, but that little > bite seems parseable enough with some basic string methods and one or > two regexes. That's really all you need, and trying to do the whole > thing with pure regex is just needlessly overcomplicating things (I'm > pretty sure we all actually agree on that). Oh I dunno. If the regex would be simple, it'd be the simplest solution. As soon as you have order-independence though... > You mean like flex/bison? May be overkill, but then again, maybe not. > So much depends on the data. Flex/Bison are a little old-school / difficult to deal with. I'm more thinking LEPL or PyMeta or something. -- Devin On Sun, Jan 8, 2012 at 9:06 PM, Hugo Arts wrote: > On Mon, Jan 9, 2012 at 2:19 AM, Devin Jeanpierre > wrote: >>> Parsing XML with regular expressions is generally very bad idea. In >>> the general case, it's actually impossible. XML is not what is called >>> a regular language, and therefore cannot be parsed with regular >>> expressions. You can use regular expressions to grab a limited amount >>> of data from a limited set of XML files, but this is dangerous, hard, >>> and error-prone. >> >> Python regexes aren't regular, and this isn't XML. >> >> A working XML parser has been written using .NET regexes (sorry, no >> citation -- can't find it), and they only have one extra feature >> (recursion, of course). And it was dreadfully ugly and nasty and >> probably terrible to maintain -- that's the real cost of regexes. >> > > IIRC, Python's only non-regular feature is backreferences though; I'm > pretty sure that isn't enough to parse XML. It does not make it > powerful enough to parse context-free languages. I really would like > that citation though, tried googling for it but not much turned up. > I'm not calling bs or anything, I don't know anything about .net > regexes and I'll readily believe it can be done (I just want to see > the code for myself). But really I still wouldn't dare try without a > feature set like perl 6's regexes. And even then.. > > You're technically correct (it's the best kind), but I feel like it > doesn't really take away the general correctness of my advice ;) > >> In particular, his data actually does look regular. >> > > Quite right. We haven't seen enough of it to be sure, but that little > bite seems parseable enough with some basic string methods and one or > two regexes. That's really all you need, and trying to do the whole > thing with pure regex is just needlessly overcomplicating things (I'm > pretty sure we all actually agree on that). > >>> I'll assume that said "(.*)". There's still a few problems: < and > >>> shouldn't be escaped, which is why you're not getting any matches. >>> Also you shouldn't use * because it is greedy, matching as much as >>> possible. So it would match everything in between the first and >>> the last tag in the file, including other tags >>> that might show up. >> >> On the "can you do work with this with regexes" angle: if units can be >> nested, then neither greedy nor non-greedy matching will work. That's >> a particular case where regular expressions can't work for your data. >> >>> Test it carefully, ditch elementtree, use as little regexes as >>> possible (string functions are your friends! startswith, split, strip, >>> et cetera) and you might end up with something that is only slightly >>> ugly and mostly works. That said, I'd still advise against it. turning >>> the files into valid XML and then using whatever XML parser you fancy >>> will probably be easier. >> >> He'd probably do that using regexes. >> > > Yeah, that's what I was thinking when I said
Re: [Tutor] making a custom file parser?
On Mon, Jan 9, 2012 at 2:19 AM, Devin Jeanpierre wrote: >> Parsing XML with regular expressions is generally very bad idea. In >> the general case, it's actually impossible. XML is not what is called >> a regular language, and therefore cannot be parsed with regular >> expressions. You can use regular expressions to grab a limited amount >> of data from a limited set of XML files, but this is dangerous, hard, >> and error-prone. > > Python regexes aren't regular, and this isn't XML. > > A working XML parser has been written using .NET regexes (sorry, no > citation -- can't find it), and they only have one extra feature > (recursion, of course). And it was dreadfully ugly and nasty and > probably terrible to maintain -- that's the real cost of regexes. > IIRC, Python's only non-regular feature is backreferences though; I'm pretty sure that isn't enough to parse XML. It does not make it powerful enough to parse context-free languages. I really would like that citation though, tried googling for it but not much turned up. I'm not calling bs or anything, I don't know anything about .net regexes and I'll readily believe it can be done (I just want to see the code for myself). But really I still wouldn't dare try without a feature set like perl 6's regexes. And even then.. You're technically correct (it's the best kind), but I feel like it doesn't really take away the general correctness of my advice ;) > In particular, his data actually does look regular. > Quite right. We haven't seen enough of it to be sure, but that little bite seems parseable enough with some basic string methods and one or two regexes. That's really all you need, and trying to do the whole thing with pure regex is just needlessly overcomplicating things (I'm pretty sure we all actually agree on that). >> I'll assume that said "(.*)". There's still a few problems: < and > >> shouldn't be escaped, which is why you're not getting any matches. >> Also you shouldn't use * because it is greedy, matching as much as >> possible. So it would match everything in between the first and >> the last tag in the file, including other tags >> that might show up. > > On the "can you do work with this with regexes" angle: if units can be > nested, then neither greedy nor non-greedy matching will work. That's > a particular case where regular expressions can't work for your data. > >> Test it carefully, ditch elementtree, use as little regexes as >> possible (string functions are your friends! startswith, split, strip, >> et cetera) and you might end up with something that is only slightly >> ugly and mostly works. That said, I'd still advise against it. turning >> the files into valid XML and then using whatever XML parser you fancy >> will probably be easier. > > He'd probably do that using regexes. > Yeah, that's what I was thinking when I said it too. Something like, one regex to quote attributes, and one that adds close tags at the earliest opportunity. Like right before a newline? It looks okay based on just that sample, but it's really hard to say. The viability of regexes depends so much on the dataset you have. If you can make the dataset valid XML with just three regexes (quotes, end tags, comments) then just parse it that way, that sounds like the simplest possible option. > Easiest way is probably to write a real parser using some PEG or CFG > thingy. Less error-prone. > You mean like flex/bison? May be overkill, but then again, maybe not. So much depends on the data. > Overall agree with advice, though. Just being picky. Sorry. > > -- Devin > > I love being picky myself, so I don't mind, as long as there is a disclaimer somewhere ;) Cheers, Hugo ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] making a custom file parser?
> Parsing XML with regular expressions is generally very bad idea. In > the general case, it's actually impossible. XML is not what is called > a regular language, and therefore cannot be parsed with regular > expressions. You can use regular expressions to grab a limited amount > of data from a limited set of XML files, but this is dangerous, hard, > and error-prone. Python regexes aren't regular, and this isn't XML. A working XML parser has been written using .NET regexes (sorry, no citation -- can't find it), and they only have one extra feature (recursion, of course). And it was dreadfully ugly and nasty and probably terrible to maintain -- that's the real cost of regexes. In particular, his data actually does look regular. > I'll assume that said "(.*)". There's still a few problems: < and > > shouldn't be escaped, which is why you're not getting any matches. > Also you shouldn't use * because it is greedy, matching as much as > possible. So it would match everything in between the first and > the last tag in the file, including other tags > that might show up. On the "can you do work with this with regexes" angle: if units can be nested, then neither greedy nor non-greedy matching will work. That's a particular case where regular expressions can't work for your data. > Test it carefully, ditch elementtree, use as little regexes as > possible (string functions are your friends! startswith, split, strip, > et cetera) and you might end up with something that is only slightly > ugly and mostly works. That said, I'd still advise against it. turning > the files into valid XML and then using whatever XML parser you fancy > will probably be easier. He'd probably do that using regexes. Easiest way is probably to write a real parser using some PEG or CFG thingy. Less error-prone. Overall agree with advice, though. Just being picky. Sorry. -- Devin On Sat, Jan 7, 2012 at 3:15 PM, Hugo Arts wrote: > On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall wrote: >> I had planned to parse myself, but am not sure how to go about it. I >> assume regular expressions, but I couldn't even find the amount of >> units in the file by using: >> unitReg=re.compile(r"\(*)\") >> unitCount=unitReg.search(fileContents) >> print "number of units: "+unitCount.len(groups()) >> >> I just get an exception that "None type object has no attribute >> groups", meaning that the search was unsuccessful. What I was hoping >> to do was to grab everything between the opening and closing unit >> tags, then read it one at a time and parse further. There is a tag >> inside a unit tag called AttackTable which also terminates, so I would >> need to pull that out and work with it separately. I probably just >> have misunderstood how regular expressions and groups work... >> > > Parsing XML with regular expressions is generally very bad idea. In > the general case, it's actually impossible. XML is not what is called > a regular language, and therefore cannot be parsed with regular > expressions. You can use regular expressions to grab a limited amount > of data from a limited set of XML files, but this is dangerous, hard, > and error-prone. > > As long as you realize this, though, you could possibly give it a shot > (here be dragons, you have been warned). > >> unitReg=re.compile(r"\(*)\") > > This is probably not what you actually did, because it fails with a > different error: > a = re.compile(r"\(*)\") > Traceback (most recent call last): > File "", line 1, in > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", > line 188, in compile > File > "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", > line 243, in _compile > sre_constants.error: nothing to repeat > > I'll assume that said "(.*)". There's still a few problems: < and > > shouldn't be escaped, which is why you're not getting any matches. > Also you shouldn't use * because it is greedy, matching as much as > possible. So it would match everything in between the first and > the last tag in the file, including other tags > that might show up. What you want is more like this: > > unit_reg = re.compile(r"(.*?)") > > Test it carefully, ditch elementtree, use as little regexes as > possible (string functions are your friends! startswith, split, strip, > et cetera) and you might end up with something that is only slightly > ugly and mostly works. That said, I'd still advise against it. turning > the files into valid XML and then using whatever XML parser you fancy > will probably be easier. Adding quotes and closing tags and removing > comments with regexes is still bad, but easier than parsing the whole > thing with regexes. > > HTH, > Hugo > ___ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubs
Re: [Tutor] making a custom file parser?
On 01/08/2012 04:53 AM, Alex Hall wrote: Hello all, I have a file with xml-ish code in it, the definitions for units in a real-time strategy game. I say xml-ish because the tags are like xml, but no quotes are used and most tags do not have to end. Also, comments in this file are prefaced by an apostrophe, and there is no multi-line commenting syntax. For example: 'this line is a comment The format is closer to sgml than to xml, except for the tag being able to have values. I'd say you probably would have a better chance of transforming this into sgml than transforming it to xml. Try this re: s = re.sub('<([a-zA-Z]+)=([^>]+)>', r'<\1 __attribute__="\2">', s) and use an SGML parser to parse the result. I find Fredrik Lundh's sgmlop to be easier to use for this one, just use easy_install or pip to install sgmlop. import sgmlop class Unit(object): pass class handler: def __init__(self): self.units = {} def finish_starttag(self, tag, attrs): attrs = dict(attrs) if tag == 'unit': self.current = Unit() elif tag == 'number': self.current.number = int(attrs['__attribute__']) elif tag == 'canmove': self.current.canmove = attrs['__attribute__'] == 'True' elif tag in ('name', 'cancarry'): setattr(self.current, tag, attrs['__attribute__']) else: print 'unknown tag', tag, attrs def finish_endtag(self, tag): if tag == 'unit': self.units[self.current.name] = self.current del self.current def handle_data(self, data): if not data.isspace(): print data.strip() s = ''' 'this line is a comment 'this line is a comment 'this line is a comment 'this line is a comment ''' s = re.sub('<([a-zA-Z]+)=([^>]+)>', r'<\1 __attribute__="\2">', s) parser = sgmlop.SGMLParser() h = handler() parser.register(h) parser.parse(s) print h.units ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] making a custom file parser?
On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall wrote: > I had planned to parse myself, but am not sure how to go about it. I > assume regular expressions, but I couldn't even find the amount of > units in the file by using: > unitReg=re.compile(r"\(*)\") > unitCount=unitReg.search(fileContents) > print "number of units: "+unitCount.len(groups()) > > I just get an exception that "None type object has no attribute > groups", meaning that the search was unsuccessful. What I was hoping > to do was to grab everything between the opening and closing unit > tags, then read it one at a time and parse further. There is a tag > inside a unit tag called AttackTable which also terminates, so I would > need to pull that out and work with it separately. I probably just > have misunderstood how regular expressions and groups work... > Parsing XML with regular expressions is generally very bad idea. In the general case, it's actually impossible. XML is not what is called a regular language, and therefore cannot be parsed with regular expressions. You can use regular expressions to grab a limited amount of data from a limited set of XML files, but this is dangerous, hard, and error-prone. As long as you realize this, though, you could possibly give it a shot (here be dragons, you have been warned). > unitReg=re.compile(r"\(*)\") This is probably not what you actually did, because it fails with a different error: >>> a = re.compile(r"\(*)\") Traceback (most recent call last): File "", line 1, in File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 188, in compile File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py", line 243, in _compile sre_constants.error: nothing to repeat I'll assume that said "(.*)". There's still a few problems: < and > shouldn't be escaped, which is why you're not getting any matches. Also you shouldn't use * because it is greedy, matching as much as possible. So it would match everything in between the first and the last tag in the file, including other tags that might show up. What you want is more like this: unit_reg = re.compile(r"(.*?)") Test it carefully, ditch elementtree, use as little regexes as possible (string functions are your friends! startswith, split, strip, et cetera) and you might end up with something that is only slightly ugly and mostly works. That said, I'd still advise against it. turning the files into valid XML and then using whatever XML parser you fancy will probably be easier. Adding quotes and closing tags and removing comments with regexes is still bad, but easier than parsing the whole thing with regexes. HTH, Hugo ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] making a custom file parser?
I had planned to parse myself, but am not sure how to go about it. I assume regular expressions, but I couldn't even find the amount of units in the file by using: unitReg=re.compile(r"\(*)\") unitCount=unitReg.search(fileContents) print "number of units: "+unitCount.len(groups()) I just get an exception that "None type object has no attribute groups", meaning that the search was unsuccessful. What I was hoping to do was to grab everything between the opening and closing unit tags, then read it one at a time and parse further. There is a tag inside a unit tag called AttackTable which also terminates, so I would need to pull that out and work with it separately. I probably just have misunderstood how regular expressions and groups work... On 1/7/12, Chris Fuller wrote: > > If it's unambiguous as to which tags are closed and which are not, then it's > pretty easy to preprocess the file into valid XML. Scan for the naughty > bits > (single quotes) and insert escape characters, replace with something else, > etc., then scan for the unterminated tags and throw in a "/" at the end. > > Anyhow, if there's no tree structure, or its only one level deep, using > ElementTree is probably overkill and just gives you lots of leaking > abstractions to plug for little benefit. Why not just scan the file > directly? > > Cheers > > On Saturday 07 January 2012, Alex Hall wrote: >> Hello all, >> I have a file with xml-ish code in it, the definitions for units in a >> real-time strategy game. I say xml-ish because the tags are like xml, >> but no quotes are used and most tags do not have to end. Also, >> comments in this file are prefaced by an apostrophe, and there is no >> multi-line commenting syntax. For example: >> >> >> >> >> >> >> 'this line is a comment >> >> >> The game is not mine, but I would like to put together a python >> interface to more easily manage custom units for it. To do that, I >> have to be able to parse these files, but elementtree does not seem to >> like them very much. I imagine it is due to the lack of quotes, the >> non-standard commenting method, and the lack of closing tags. I think >> my only recourse here is to create my own parser and tell elementtree >> to use that. The docs say this is possible, but they also seem to >> indicate that the parser has to already exist in the elementtree >> package and there is no mention of making one's own method for >> parsing. Even if this were possible, though, I am not sure how to go >> about it. I can of course strip comments, but that is as far as I have >> gotten. >> >> Bottom line: can I create a method and tell elementtree to parse using >> it, and what would such a function look like (generally) if I can? >> Thanks! > > -- Have a great day, Alex (msg sent from GMail website) mehg...@gmail.com; http://www.facebook.com/mehgcap ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] making a custom file parser?
If it's unambiguous as to which tags are closed and which are not, then it's pretty easy to preprocess the file into valid XML. Scan for the naughty bits (single quotes) and insert escape characters, replace with something else, etc., then scan for the unterminated tags and throw in a "/" at the end. Anyhow, if there's no tree structure, or its only one level deep, using ElementTree is probably overkill and just gives you lots of leaking abstractions to plug for little benefit. Why not just scan the file directly? Cheers On Saturday 07 January 2012, Alex Hall wrote: > Hello all, > I have a file with xml-ish code in it, the definitions for units in a > real-time strategy game. I say xml-ish because the tags are like xml, > but no quotes are used and most tags do not have to end. Also, > comments in this file are prefaced by an apostrophe, and there is no > multi-line commenting syntax. For example: > > > > > > > 'this line is a comment > > > The game is not mine, but I would like to put together a python > interface to more easily manage custom units for it. To do that, I > have to be able to parse these files, but elementtree does not seem to > like them very much. I imagine it is due to the lack of quotes, the > non-standard commenting method, and the lack of closing tags. I think > my only recourse here is to create my own parser and tell elementtree > to use that. The docs say this is possible, but they also seem to > indicate that the parser has to already exist in the elementtree > package and there is no mention of making one's own method for > parsing. Even if this were possible, though, I am not sure how to go > about it. I can of course strip comments, but that is as far as I have > gotten. > > Bottom line: can I create a method and tell elementtree to parse using > it, and what would such a function look like (generally) if I can? > Thanks! ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
[Tutor] making a custom file parser?
Hello all, I have a file with xml-ish code in it, the definitions for units in a real-time strategy game. I say xml-ish because the tags are like xml, but no quotes are used and most tags do not have to end. Also, comments in this file are prefaced by an apostrophe, and there is no multi-line commenting syntax. For example: 'this line is a comment The game is not mine, but I would like to put together a python interface to more easily manage custom units for it. To do that, I have to be able to parse these files, but elementtree does not seem to like them very much. I imagine it is due to the lack of quotes, the non-standard commenting method, and the lack of closing tags. I think my only recourse here is to create my own parser and tell elementtree to use that. The docs say this is possible, but they also seem to indicate that the parser has to already exist in the elementtree package and there is no mention of making one's own method for parsing. Even if this were possible, though, I am not sure how to go about it. I can of course strip comments, but that is as far as I have gotten. Bottom line: can I create a method and tell elementtree to parse using it, and what would such a function look like (generally) if I can? Thanks! -- Have a great day, Alex (msg sent from GMail website) mehg...@gmail.com; http://www.facebook.com/mehgcap ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor