Re: [Tutor] module to parse XMLish text?

2011-01-15 Thread Karim


Hello,

I did not see the XML code in details before I gave the code with 
ElementTree.
In fact with unclosing tags you will get errors at parse time and it 
will give you

the location of errors.
You could use the module from Stefan which is way way superior than 
ElementTree
which can validate against DTD or XSD and many many other features 
(speed, etc...)


Regards
Karim

On 01/15/2011 07:53 AM, Stefan Behnel wrote:

Wayne Werner, 15.01.2011 03:25:

On Fri, Jan 14, 2011 at 4:42 PM, Terry Carroll wrote:

On Fri, 14 Jan 2011, Karim wrote:

  from xml.etree.ElementTree import ElementTree

I don't think straight XML parsing will work on this, as it's not valid
XML; it just looks XML-like enough to cause confusion.


It's worth trying out - most (good) parsers do the right thing even 
when
they don't have strictly valid code. I don't know if xml.etree is 
one, but

I'm fairly sure both lxml and BeautifulSoup would probably parse it
correctly.


They wouldn't. For the first tags, the text values would either not 
come out at all or they would be read as attributes and thus loose 
their order and potentially their whitespace as well. The other tags 
would likely get parsed properly, but the parser may end up nesting 
them as it hasn't found a closing tag for the previous tags yet.


So, in any case, you'd end up with data loss and/or a structure that 
would be much harder to handle than the (relatively) simple file 
structure.


Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] module to parse XMLish text?

2011-01-14 Thread Karim


Hello,

*from xml.etree.ElementTree import ElementTree

_/#Parsing:/_
doc = ElementTree()
doc.parse(xmlFile)
*
/_*#Find tag element:*_/
*doc.find('mytag')*

*_/#iteration over tag element:/_
lname = []
for lib in doc.iter('LibTag'):
 libName = lib.attrib['name']
 lname.append(libName)
*
Regards
Karim

On 01/14/2011 03:55 AM, Terry Carroll wrote:
Does anyone know of a module that can parse out text with XML-like 
tags as in the example below?  I emphasize the -like in XML-like.  
I don't think I can parse this as XML (can I?).


Sample text between the dashed lines::

-
Blah, blah, blah
AAA
BING ZEBRA
BANG ROOSTER
BOOM GARBONZO BEAN
BLIPSOMETHING ELSE/BLIP
BASHSOMETHING DIFFERENT/BASH
/AAA
-

I'd like to be able to have a dictionary (or any other structure, 
really; as long as I can get to the parsed-out pieces) that would look 
smoothing like:


 {BING : ZEBRA,
  BANG : ROOSTER
  BOOM : GARBONZO BEAN
  BLIP : SOMETHING ELSE
  BASH : SOMETHING DIFFERENT}

The Blah, blah, blah can be tossed away, for all I care.

The basic rule is that the tag either has an operand (e.g., BING 
ZEBRA), in which case the name is the first word and the content is 
everything else that follows in the tag; or else the tag has no 
operand, in which case it is matched to a corresponding closing tag 
(e.g., BLIPSOMETHING ELSE/BLIP), and the content is the material 
between the two tags.


I think I can assume there are no nested tags.

I could write a state machine to do this, I suppose, but life's short, 
and I'd rather not re-invent the wheel, if there's a wheel laying 
around somewhere.


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] module to parse XMLish text?

2011-01-14 Thread Stefan Behnel

Terry Carroll, 14.01.2011 03:55:

Does anyone know of a module that can parse out text with XML-like tags as
in the example below? I emphasize the -like in XML-like. I don't think
I can parse this as XML (can I?).

Sample text between the dashed lines::

-
Blah, blah, blah
AAA
BING ZEBRA
BANG ROOSTER
BOOM GARBONZO BEAN
BLIPSOMETHING ELSE/BLIP
BASHSOMETHING DIFFERENT/BASH
/AAA
-


You can't parse this as XML because it's not XML. The three initial child 
tags are not properly closed.


If the format is really as you describe, i.e. one line per tag, regular 
expressions will work nicely. Something like (untested)


  import re
  parse_tag_and_text = re.compile(
# accept a tag name and then either space+tag or ''+text+'/...'
'^([^ ]+)(?: ([^]+)\s*|([^]+)/.*)$').match

  special_tags = set(['AAA'])

  result = {}
  for line in the_file:
  match = parse_tag_and_text(line)
  if match:
  if match.group(1) in special_tags:
  pass # do something special?
  else:
  # don't care which format, take whatever text group matched
  result[match.group(1)] = match.group(2) or match.group(3)

Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] module to parse XMLish text?

2011-01-14 Thread Terry Carroll

On Fri, 14 Jan 2011, Stefan Behnel wrote:


Terry Carroll, 14.01.2011 03:55:

Does anyone know of a module that can parse out text with XML-like tags as
in the example below? I emphasize the -like in XML-like. I don't think
I can parse this as XML (can I?).

Sample text between the dashed lines::

-
Blah, blah, blah
AAA
BING ZEBRA
BANG ROOSTER
BOOM GARBONZO BEAN
BLIPSOMETHING ELSE/BLIP
BASHSOMETHING DIFFERENT/BASH
/AAA
-


You can't parse this as XML because it's not XML. The three initial child 
tags are not properly closed.


Yeah, that's what I figured.

If the format is really as you describe, i.e. one line per tag, regular 
expressions will work nicely.


Now there's an idea!  I hadn't thought of using regexs, probably because 
I'm terrible at all but the most simple ones.


As it happens, I'm only interested in four of the tags' contents, so I
could probably manage to write a seried of regexes that even I could 
maintain, one for each of the pieces of data I want to extract; if I try 
to write a grand unified regex, I'm bound to shoot myself in the foot.


Thanks very much.

On Fri, 14 Jan 2011, Karim wrote:


from xml.etree.ElementTree import ElementTree


I don't think straight XML parsing will work on this, as it's not valid 
XML; it just looks XML-like enough to cause confusion.

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] module to parse XMLish text?

2011-01-14 Thread Wayne Werner
On Fri, Jan 14, 2011 at 4:42 PM, Terry Carroll carr...@tjc.com wrote:

 snip

 On Fri, 14 Jan 2011, Karim wrote:

  from xml.etree.ElementTree import ElementTree


 I don't think straight XML parsing will work on this, as it's not valid
 XML; it just looks XML-like enough to cause confusion.


It's worth trying out - most (good) parsers do the right thing even when
they don't have strictly valid code. I don't know if xml.etree is one, but
I'm fairly sure both lxml and BeautifulSoup would probably parse it
correctly. Only one way to find out ;)

-Wayne
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] module to parse XMLish text?

2011-01-14 Thread Stefan Behnel

Wayne Werner, 15.01.2011 03:25:

On Fri, Jan 14, 2011 at 4:42 PM, Terry Carroll wrote:

On Fri, 14 Jan 2011, Karim wrote:

  from xml.etree.ElementTree import ElementTree

I don't think straight XML parsing will work on this, as it's not valid
XML; it just looks XML-like enough to cause confusion.


It's worth trying out - most (good) parsers do the right thing even when
they don't have strictly valid code. I don't know if xml.etree is one, but
I'm fairly sure both lxml and BeautifulSoup would probably parse it
correctly.


They wouldn't. For the first tags, the text values would either not come 
out at all or they would be read as attributes and thus loose their order 
and potentially their whitespace as well. The other tags would likely get 
parsed properly, but the parser may end up nesting them as it hasn't found 
a closing tag for the previous tags yet.


So, in any case, you'd end up with data loss and/or a structure that would 
be much harder to handle than the (relatively) simple file structure.


Stefan

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor