Not clear from your question whether your goal is to learn to parse
XML in python
or to solve a particular problem. If your goal is to learn python XML
processing,
then go right ahead -- however, it looks like you are using SAX below,
and the sort
of thing you describe might be done better using a DOM parser ( or
maybe etree )
If what you want is not just to select some info from the xml file,
but to get it
into a Python object so that you can then manipulate it further, then
DOM or etree
is also probably a better model. It will parse the XML ( likely using
SAX underneath )
and give you an object that encodes the whole file.
[ Not that it can't be done in SAX -- it's just that, as you
discovered, low level
SAX parsing requires that you keep track of the containment
hierarchy yourself,
which is a lot of work to solve a simple problem. ]
If you're just trying to work with XML, then most folks don't write
XML parsers for
that sort of thing, but use higher level tools: XSLT, XPATH and or
XQUERY.
The Mac has xsltproc as a built-in xslt (1.0) processor.
There is a xpath program written in perl in Leopard/10.5. ( /usr/bin/
xpath )
And Saxon is easily downloaded and does xslt 2.0 and xquery 1.0 .
The following XSLT 1.0 stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:apply-templates select="/topalbums/alb...@rank < 6]"/>
<!-- just select the top 5 albums -->
</xsl:template>
<xsl:template match="/topalbums/album" >
album: <xsl:value-of select="name"/>
artist: <xsl:value-of select="artist/name"/>
count=<xsl:value-of select="playcount"/>
<xsl:text>
</xsl:text> <!-- this is here to insert the blank line break -->
</xsl:template>
</xsl:stylesheet>
Will, when run on that file, produce this output:
~$ xsltproc Untitled1.xsl topalbums.xml
album: Vheissu
artist: Thrice
count=332
album: The Artist in the Ambulance
artist: Thrice
count=289
album: Appeal To Reason
artist: Rise Against
count=286
album: Favourite Worst Nightmare
artist: Arctic Monkeys
count=210
album: The Sufferer & The Witness
artist: Rise Against
count=206
[ Not sure if that's anything like what you want. ]
I'm sure that the whole thing would reduce to an even more concise
XQuery request.
I was trying to do the whole thing as an xpath one liner, but it
didn't like
my attempts to include alternates in parenthesis. I think this is an
xpath 1.0
vs. xpath 2.0 issue. Saxon is the only thing that supports 2.0. The
perl, python
and java libraries only support xpath 1.0.
This sort of expression did work using xpath 2.0 (in oxygen editor):
//alb...@rank < 6]/(name|playcount|artist/name)
But I couldn't figure out a 1.0 syntax that would grab all three fields.
( and the perl xpath seems to have a bug that interprets '@rank < 6'
as less-than-or-equal! )
-- Steve Majewski
On Feb 6, 2009, at 11:00 PM, Bryan Smith wrote:
Hi everyone,
I have another question I'm hoping someone would be kind enough to
answer. I am new to parsing XML (not to mention much of Python
itself) and I am trying to parse an XML file. The file I am trying
to parse is this one: http://ws.audioscrobbler.com/2.0/user/bryansmith/topalbums.xml
.
So far, I have written up a class for parsing this file in my
attempts to present to the user a list of top albums on their
last.fm profile. If you note, the artist name and album name are
both signified by the <name> tag which makes my job harder. If the
tag names were different, I wouldn't have a problem. Listed below is
the class I have written to parse the file. My question then is
this: is there a way I can say something like "if tag_name == album
name tag then....elif tag_name == artist name tag....". I hope this
is clear.
As it stands right now, if I parse this file and print the results,
this is what I get (understandably) if I try to print out in the
following fashion - album (playcount): Vheissu (332), Thrice (289),
The Artist in the Ambulance (286), Thrice (210) and so on. Thrice is
the artist name. I want to be able to differentiate between the
"artist" name tag and the "album" name tag.
Class as it stands right now:
class GetTopAlbums(ContentHandler):
in_album_tag = False
in_playcount_tag = False
def __init__(self, album, playcount):
ContentHandler.__init__(self)
self.album = album
self.playcount = playcount
self.data = []
def startElement(self, tag_name, attr):
if tag_name == "name":
self.in_album_tag = True
elif tag_name == "playcount":
self.in_playcount_tag = True
def endElement(self, tag_name):
if tag_name == "name":
content = "".join(self.data)
self.data = []
self.album.append(content)
self.in_album_tag = False
elif tag_name == "playcount":
content = "".join(self.data)
self.data = []
self.playcount.append(content)
self.in_playcount_tag = False
def characters(self, string):
if self.in_album_tag == True:
self.data.append(string)
elif self.in_playcount_tag == True:
self.data.append(string)
Thanks in advance!
Bryan
_______________________________________________
Pythonmac-SIG maillist - Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig
_______________________________________________
Pythonmac-SIG maillist - Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig