Re: [Tutor] making a custom file parser?

2012-01-09 Thread Devin Jeanpierre
> IIRC, Python's only non-regular feature is backreferences though

Probably. I'm not too familiar with a couple other features or how
their semantics work, in particular the (?(id)yes|no) syntax.

> I'm not calling bs or anything, I don't know anything about .net
> regexes and I'll readily believe it can be done (I just want to see
> the code for myself).

They add the ability to push and pop from a stack, which turns their
regular expressions into at-least-as-powerful as push-down automata,
which are equivalent in power to context-free-grammars, which means
they can match XML. I think this has been well-known in the .NET
community for years, but nobody had ever done it, and nobody ever
mentioned it. It's a dirty secret you don't tell the newbies because
then they think regexps are fine to use for everything.

It's also why I don't like the "this isn't regular so don't use
regular expressions" spiel. We call things regular expressions even
when they're context-free parsing expressions! The term has meaning,
but it's no longer tied to finite state automata, and any argument
along that lines is just waiting to be broken by the next feature
addition to the re module.

Anyway, I found the reference I was thinking of:
http://porg.es/blog/so-it-turns-out-that-dot-nets-regex-are-more-powerful-than-i-originally-thought

> Quite right. We haven't seen enough of it to be sure, but that little
> bite seems parseable enough with some basic string methods and one or
> two regexes. That's really all you need, and trying to do the whole
> thing with pure regex is just needlessly overcomplicating things (I'm
> pretty sure we all actually agree on that).

Oh I dunno. If the regex would be simple, it'd be the simplest
solution. As soon as you have order-independence though...

> You mean like flex/bison? May be overkill, but then again, maybe not.
> So much depends on the data.

Flex/Bison are a little old-school / difficult to deal with. I'm more
thinking LEPL or PyMeta or something.

-- Devin

On Sun, Jan 8, 2012 at 9:06 PM, Hugo Arts  wrote:
> On Mon, Jan 9, 2012 at 2:19 AM, Devin Jeanpierre  
> wrote:
>>> Parsing XML with regular expressions is generally very bad idea. In
>>> the general case, it's actually impossible. XML is not what is called
>>> a regular language, and therefore cannot be parsed with regular
>>> expressions. You can use regular expressions to grab a limited amount
>>> of data from a limited set of XML files, but this is dangerous, hard,
>>> and error-prone.
>>
>> Python regexes aren't regular, and this isn't XML.
>>
>> A working XML parser has been written using .NET regexes (sorry, no
>> citation -- can't find it), and they only have one extra feature
>> (recursion, of course). And it was dreadfully ugly and nasty and
>> probably terrible to maintain -- that's the real cost of regexes.
>>
>
> IIRC, Python's only non-regular feature is backreferences though; I'm
> pretty sure that isn't enough to parse XML. It does not make it
> powerful enough to parse context-free languages. I really would like
> that citation though, tried googling for it but not much turned up.
> I'm not calling bs or anything, I don't know anything about .net
> regexes and I'll readily believe it can be done (I just want to see
> the code for myself). But really I still wouldn't dare try without a
> feature set like perl 6's regexes. And even then..
>
> You're technically correct (it's the best kind), but I feel like it
> doesn't really take away the general correctness of my advice ;)
>
>> In particular, his data actually does look regular.
>>
>
> Quite right. We haven't seen enough of it to be sure, but that little
> bite seems parseable enough with some basic string methods and one or
> two regexes. That's really all you need, and trying to do the whole
> thing with pure regex is just needlessly overcomplicating things (I'm
> pretty sure we all actually agree on that).
>
>>> I'll assume that said "(.*)". There's still a few problems: < and >
>>> shouldn't be escaped, which is why you're not getting any matches.
>>> Also you shouldn't use * because it is greedy, matching as much as
>>> possible. So it would match everything in between the first  and
>>> the last  tag in the file, including other  tags
>>> that might show up.
>>
>> On the "can you do work with this with regexes" angle: if units can be
>> nested, then neither greedy nor non-greedy matching will work. That's
>> a particular case where regular expressions can't work for your data.
>>
>>> Test it carefully, ditch elementtree, use as little regexes as
>>> possible (string functions are your friends! startswith, split, strip,
>>> et cetera) and you might end up with something that is only slightly
>>> ugly and mostly works. That said, I'd still advise against it. turning
>>> the files into valid XML and then using whatever XML parser you fancy
>>> will probably be easier.
>>
>> He'd probably do that using regexes.
>>
>
> Yeah, that's what I was thinking when I said 

Re: [Tutor] making a custom file parser?

2012-01-08 Thread Hugo Arts
On Mon, Jan 9, 2012 at 2:19 AM, Devin Jeanpierre  wrote:
>> Parsing XML with regular expressions is generally very bad idea. In
>> the general case, it's actually impossible. XML is not what is called
>> a regular language, and therefore cannot be parsed with regular
>> expressions. You can use regular expressions to grab a limited amount
>> of data from a limited set of XML files, but this is dangerous, hard,
>> and error-prone.
>
> Python regexes aren't regular, and this isn't XML.
>
> A working XML parser has been written using .NET regexes (sorry, no
> citation -- can't find it), and they only have one extra feature
> (recursion, of course). And it was dreadfully ugly and nasty and
> probably terrible to maintain -- that's the real cost of regexes.
>

IIRC, Python's only non-regular feature is backreferences though; I'm
pretty sure that isn't enough to parse XML. It does not make it
powerful enough to parse context-free languages. I really would like
that citation though, tried googling for it but not much turned up.
I'm not calling bs or anything, I don't know anything about .net
regexes and I'll readily believe it can be done (I just want to see
the code for myself). But really I still wouldn't dare try without a
feature set like perl 6's regexes. And even then..

You're technically correct (it's the best kind), but I feel like it
doesn't really take away the general correctness of my advice ;)

> In particular, his data actually does look regular.
>

Quite right. We haven't seen enough of it to be sure, but that little
bite seems parseable enough with some basic string methods and one or
two regexes. That's really all you need, and trying to do the whole
thing with pure regex is just needlessly overcomplicating things (I'm
pretty sure we all actually agree on that).

>> I'll assume that said "(.*)". There's still a few problems: < and >
>> shouldn't be escaped, which is why you're not getting any matches.
>> Also you shouldn't use * because it is greedy, matching as much as
>> possible. So it would match everything in between the first  and
>> the last  tag in the file, including other  tags
>> that might show up.
>
> On the "can you do work with this with regexes" angle: if units can be
> nested, then neither greedy nor non-greedy matching will work. That's
> a particular case where regular expressions can't work for your data.
>
>> Test it carefully, ditch elementtree, use as little regexes as
>> possible (string functions are your friends! startswith, split, strip,
>> et cetera) and you might end up with something that is only slightly
>> ugly and mostly works. That said, I'd still advise against it. turning
>> the files into valid XML and then using whatever XML parser you fancy
>> will probably be easier.
>
> He'd probably do that using regexes.
>

Yeah, that's what I was thinking when I said it too. Something like,
one regex to quote attributes, and one that adds close tags at the
earliest opportunity. Like right before a newline? It looks okay based
on just that sample, but it's really hard to say. The viability of
regexes depends so much on the dataset you have. If you can make the
dataset valid XML with just three regexes (quotes, end tags, comments)
then just parse it that way, that sounds like the simplest possible
option.

> Easiest way is probably to write a real parser using some PEG or CFG
> thingy. Less error-prone.
>

You mean like flex/bison? May be overkill, but then again, maybe not.
So much depends on the data.

> Overall agree with advice, though. Just being picky. Sorry.
>
> -- Devin
>
>

I love being picky myself, so I don't mind, as long as there is a
disclaimer somewhere ;) Cheers,
Hugo
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] making a custom file parser?

2012-01-08 Thread Devin Jeanpierre
> Parsing XML with regular expressions is generally very bad idea. In
> the general case, it's actually impossible. XML is not what is called
> a regular language, and therefore cannot be parsed with regular
> expressions. You can use regular expressions to grab a limited amount
> of data from a limited set of XML files, but this is dangerous, hard,
> and error-prone.

Python regexes aren't regular, and this isn't XML.

A working XML parser has been written using .NET regexes (sorry, no
citation -- can't find it), and they only have one extra feature
(recursion, of course). And it was dreadfully ugly and nasty and
probably terrible to maintain -- that's the real cost of regexes.

In particular, his data actually does look regular.

> I'll assume that said "(.*)". There's still a few problems: < and >
> shouldn't be escaped, which is why you're not getting any matches.
> Also you shouldn't use * because it is greedy, matching as much as
> possible. So it would match everything in between the first  and
> the last  tag in the file, including other  tags
> that might show up.

On the "can you do work with this with regexes" angle: if units can be
nested, then neither greedy nor non-greedy matching will work. That's
a particular case where regular expressions can't work for your data.

> Test it carefully, ditch elementtree, use as little regexes as
> possible (string functions are your friends! startswith, split, strip,
> et cetera) and you might end up with something that is only slightly
> ugly and mostly works. That said, I'd still advise against it. turning
> the files into valid XML and then using whatever XML parser you fancy
> will probably be easier.

He'd probably do that using regexes.

Easiest way is probably to write a real parser using some PEG or CFG
thingy. Less error-prone.

Overall agree with advice, though. Just being picky. Sorry.

-- Devin


On Sat, Jan 7, 2012 at 3:15 PM, Hugo Arts  wrote:
> On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall  wrote:
>> I had planned to parse myself, but am not sure how to go about it. I
>> assume regular expressions, but I couldn't even find the amount of
>> units in the file by using:
>> unitReg=re.compile(r"\(*)\")
>> unitCount=unitReg.search(fileContents)
>> print "number of units: "+unitCount.len(groups())
>>
>> I just get an exception that "None type object has no attribute
>> groups", meaning that the search was unsuccessful. What I was hoping
>> to do was to grab everything between the opening and closing unit
>> tags, then read it one at a time and parse further. There is a tag
>> inside a unit tag called AttackTable which also terminates, so I would
>> need to pull that out and work with it separately. I probably just
>> have misunderstood how regular expressions and groups work...
>>
>
> Parsing XML with regular expressions is generally very bad idea. In
> the general case, it's actually impossible. XML is not what is called
> a regular language, and therefore cannot be parsed with regular
> expressions. You can use regular expressions to grab a limited amount
> of data from a limited set of XML files, but this is dangerous, hard,
> and error-prone.
>
> As long as you realize this, though, you could possibly give it a shot
> (here be dragons, you have been warned).
>
>> unitReg=re.compile(r"\(*)\")
>
> This is probably not what you actually did, because it fails with a
> different error:
>
 a = re.compile(r"\(*)\")
> Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
> line 188, in compile
>  File 
> "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
> line 243, in _compile
> sre_constants.error: nothing to repeat
>
> I'll assume that said "(.*)". There's still a few problems: < and >
> shouldn't be escaped, which is why you're not getting any matches.
> Also you shouldn't use * because it is greedy, matching as much as
> possible. So it would match everything in between the first  and
> the last  tag in the file, including other  tags
> that might show up. What you want is more like this:
>
> unit_reg = re.compile(r"(.*?)")
>
> Test it carefully, ditch elementtree, use as little regexes as
> possible (string functions are your friends! startswith, split, strip,
> et cetera) and you might end up with something that is only slightly
> ugly and mostly works. That said, I'd still advise against it. turning
> the files into valid XML and then using whatever XML parser you fancy
> will probably be easier. Adding quotes and closing tags and removing
> comments with regexes is still bad, but easier than parsing the whole
> thing with regexes.
>
> HTH,
> Hugo
> ___
> Tutor maillist  -  Tutor@python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
___
Tutor maillist  -  Tutor@python.org
To unsubs

Re: [Tutor] making a custom file parser?

2012-01-07 Thread Lie Ryan

On 01/08/2012 04:53 AM, Alex Hall wrote:

Hello all,
I have a file with xml-ish code in it, the definitions for units in a
real-time strategy game. I say xml-ish because the tags are like xml,
but no quotes are used and most tags do not have to end. Also,
comments in this file are prefaced by an apostrophe, and there is no
multi-line commenting syntax. For example:






'this line is a comment




The format is closer to sgml than to xml, except for the tag being able 
to have values. I'd say you probably would have a better chance of 
transforming this into sgml than transforming it to xml.


Try this re:

s = re.sub('<([a-zA-Z]+)=([^>]+)>', r'<\1 __attribute__="\2">', s)

and use an SGML parser to parse the result. I find Fredrik Lundh's 
sgmlop to be easier to use for this one, just use easy_install or pip to 
install sgmlop.


import sgmlop

class Unit(object): pass

class handler:
def __init__(self):
self.units = {}
def finish_starttag(self, tag, attrs):
attrs = dict(attrs)
if tag == 'unit':
self.current = Unit()
elif tag == 'number':
self.current.number = int(attrs['__attribute__'])
elif tag == 'canmove':
self.current.canmove = attrs['__attribute__'] == 'True'
elif tag in ('name', 'cancarry'):
setattr(self.current, tag, attrs['__attribute__'])
else:
print 'unknown tag', tag, attrs
def finish_endtag(self, tag):
if tag == 'unit':
self.units[self.current.name] = self.current
del self.current
def handle_data(self, data):
if not data.isspace(): print data.strip()

s = '''





'this line is a comment






'this line is a comment






'this line is a comment






'this line is a comment

'''
s = re.sub('<([a-zA-Z]+)=([^>]+)>', r'<\1 __attribute__="\2">', s)
parser = sgmlop.SGMLParser()
h = handler()
parser.register(h)
parser.parse(s)
print h.units

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] making a custom file parser?

2012-01-07 Thread Hugo Arts
On Sat, Jan 7, 2012 at 8:22 PM, Alex Hall  wrote:
> I had planned to parse myself, but am not sure how to go about it. I
> assume regular expressions, but I couldn't even find the amount of
> units in the file by using:
> unitReg=re.compile(r"\(*)\")
> unitCount=unitReg.search(fileContents)
> print "number of units: "+unitCount.len(groups())
>
> I just get an exception that "None type object has no attribute
> groups", meaning that the search was unsuccessful. What I was hoping
> to do was to grab everything between the opening and closing unit
> tags, then read it one at a time and parse further. There is a tag
> inside a unit tag called AttackTable which also terminates, so I would
> need to pull that out and work with it separately. I probably just
> have misunderstood how regular expressions and groups work...
>

Parsing XML with regular expressions is generally very bad idea. In
the general case, it's actually impossible. XML is not what is called
a regular language, and therefore cannot be parsed with regular
expressions. You can use regular expressions to grab a limited amount
of data from a limited set of XML files, but this is dangerous, hard,
and error-prone.

As long as you realize this, though, you could possibly give it a shot
(here be dragons, you have been warned).

> unitReg=re.compile(r"\(*)\")

This is probably not what you actually did, because it fails with a
different error:

>>> a = re.compile(r"\(*)\")
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
line 188, in compile
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/re.py",
line 243, in _compile
sre_constants.error: nothing to repeat

I'll assume that said "(.*)". There's still a few problems: < and >
shouldn't be escaped, which is why you're not getting any matches.
Also you shouldn't use * because it is greedy, matching as much as
possible. So it would match everything in between the first  and
the last  tag in the file, including other  tags
that might show up. What you want is more like this:

unit_reg = re.compile(r"(.*?)")

Test it carefully, ditch elementtree, use as little regexes as
possible (string functions are your friends! startswith, split, strip,
et cetera) and you might end up with something that is only slightly
ugly and mostly works. That said, I'd still advise against it. turning
the files into valid XML and then using whatever XML parser you fancy
will probably be easier. Adding quotes and closing tags and removing
comments with regexes is still bad, but easier than parsing the whole
thing with regexes.

HTH,
Hugo
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] making a custom file parser?

2012-01-07 Thread Alex Hall
I had planned to parse myself, but am not sure how to go about it. I
assume regular expressions, but I couldn't even find the amount of
units in the file by using:
unitReg=re.compile(r"\(*)\")
unitCount=unitReg.search(fileContents)
print "number of units: "+unitCount.len(groups())

I just get an exception that "None type object has no attribute
groups", meaning that the search was unsuccessful. What I was hoping
to do was to grab everything between the opening and closing unit
tags, then read it one at a time and parse further. There is a tag
inside a unit tag called AttackTable which also terminates, so I would
need to pull that out and work with it separately. I probably just
have misunderstood how regular expressions and groups work...


On 1/7/12, Chris Fuller  wrote:
>
> If it's unambiguous as to which tags are closed and which are not, then it's
> pretty easy to preprocess the file into valid XML.  Scan for the naughty
> bits
> (single quotes) and insert escape characters, replace with something else,
> etc., then scan for the unterminated tags and throw in a "/" at the end.
>
> Anyhow, if there's no tree structure, or its only one level deep, using
> ElementTree is probably overkill and just gives you lots of leaking
> abstractions to plug for little benefit.  Why not just scan the file
> directly?
>
> Cheers
>
> On Saturday 07 January 2012, Alex Hall wrote:
>> Hello all,
>> I have a file with xml-ish code in it, the definitions for units in a
>> real-time strategy game. I say xml-ish because the tags are like xml,
>> but no quotes are used and most tags do not have to end. Also,
>> comments in this file are prefaced by an apostrophe, and there is no
>> multi-line commenting syntax. For example:
>>
>> 
>> 
>> 
>> 
>> 
>> 'this line is a comment
>> 
>>
>> The game is not mine, but I would like to put together a python
>> interface to more easily manage custom units for it. To do that, I
>> have to be able to parse these files, but elementtree does not seem to
>> like them very much. I imagine it is due to the lack of quotes, the
>> non-standard commenting method, and the lack of closing tags. I think
>> my only recourse here is to create my own parser and tell elementtree
>> to use that. The docs say this is possible, but they also seem to
>> indicate that the parser has to already exist in the elementtree
>> package and there is no mention of making one's own method for
>> parsing. Even if this were possible, though, I am not sure how to go
>> about it. I can of course strip comments, but that is as far as I have
>> gotten.
>>
>> Bottom line: can I create a method and tell elementtree to parse using
>> it, and what would such a function look like (generally) if I can?
>> Thanks!
>
>


-- 
Have a great day,
Alex (msg sent from GMail website)
mehg...@gmail.com; http://www.facebook.com/mehgcap
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] making a custom file parser?

2012-01-07 Thread Chris Fuller

If it's unambiguous as to which tags are closed and which are not, then it's 
pretty easy to preprocess the file into valid XML.  Scan for the naughty bits 
(single quotes) and insert escape characters, replace with something else, 
etc., then scan for the unterminated tags and throw in a "/" at the end.

Anyhow, if there's no tree structure, or its only one level deep, using 
ElementTree is probably overkill and just gives you lots of leaking 
abstractions to plug for little benefit.  Why not just scan the file directly?

Cheers

On Saturday 07 January 2012, Alex Hall wrote:
> Hello all,
> I have a file with xml-ish code in it, the definitions for units in a
> real-time strategy game. I say xml-ish because the tags are like xml,
> but no quotes are used and most tags do not have to end. Also,
> comments in this file are prefaced by an apostrophe, and there is no
> multi-line commenting syntax. For example:
> 
> 
> 
> 
> 
> 
> 'this line is a comment
> 
> 
> The game is not mine, but I would like to put together a python
> interface to more easily manage custom units for it. To do that, I
> have to be able to parse these files, but elementtree does not seem to
> like them very much. I imagine it is due to the lack of quotes, the
> non-standard commenting method, and the lack of closing tags. I think
> my only recourse here is to create my own parser and tell elementtree
> to use that. The docs say this is possible, but they also seem to
> indicate that the parser has to already exist in the elementtree
> package and there is no mention of making one's own method for
> parsing. Even if this were possible, though, I am not sure how to go
> about it. I can of course strip comments, but that is as far as I have
> gotten.
> 
> Bottom line: can I create a method and tell elementtree to parse using
> it, and what would such a function look like (generally) if I can?
> Thanks!

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


[Tutor] making a custom file parser?

2012-01-07 Thread Alex Hall
Hello all,
I have a file with xml-ish code in it, the definitions for units in a
real-time strategy game. I say xml-ish because the tags are like xml,
but no quotes are used and most tags do not have to end. Also,
comments in this file are prefaced by an apostrophe, and there is no
multi-line commenting syntax. For example:






'this line is a comment


The game is not mine, but I would like to put together a python
interface to more easily manage custom units for it. To do that, I
have to be able to parse these files, but elementtree does not seem to
like them very much. I imagine it is due to the lack of quotes, the
non-standard commenting method, and the lack of closing tags. I think
my only recourse here is to create my own parser and tell elementtree
to use that. The docs say this is possible, but they also seem to
indicate that the parser has to already exist in the elementtree
package and there is no mention of making one's own method for
parsing. Even if this were possible, though, I am not sure how to go
about it. I can of course strip comments, but that is as far as I have
gotten.

Bottom line: can I create a method and tell elementtree to parse using
it, and what would such a function look like (generally) if I can?
Thanks!

-- 
Have a great day,
Alex (msg sent from GMail website)
mehg...@gmail.com; http://www.facebook.com/mehgcap
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor