Not clear from your question whether your goal is to learn to parse XML in python or to solve a particular problem. If your goal is to learn python XML processing, then go right ahead -- however, it looks like you are using SAX below, and the sort of thing you describe might be done better using a DOM parser ( or maybe etree )

If what you want is not just to select some info from the xml file, but to get it into a Python object so that you can then manipulate it further, then DOM or etree is also probably a better model. It will parse the XML ( likely using SAX underneath )
and give you an object that encodes the whole file.

[ Not that it can't be done in SAX -- it's just that, as you discovered, low level SAX parsing requires that you keep track of the containment hierarchy yourself,
  which is a lot of work to solve a simple problem. ]


If you're just trying to work with XML, then most folks don't write XML parsers for that sort of thing, but use higher level tools: XSLT, XPATH and or XQUERY.

The Mac has xsltproc as a built-in xslt (1.0) processor.
There is a xpath program written in perl in Leopard/10.5. ( /usr/bin/ xpath )
And Saxon is easily downloaded and does xslt 2.0 and xquery 1.0 .


The following XSLT 1.0 stylesheet:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="1.0">
<xsl:output method="text"/>

<xsl:template match="/">
    <xsl:apply-templates select="/topalbums/alb...@rank &lt; 6]"/>
    <!-- just select the top 5 albums -->
</xsl:template>

<xsl:template match="/topalbums/album" >
  album: <xsl:value-of select="name"/>
  artist: <xsl:value-of select="artist/name"/>
  count=<xsl:value-of select="playcount"/>
  <xsl:text>
  </xsl:text> <!-- this is here to insert the blank line break -->
</xsl:template>

</xsl:stylesheet>


Will, when run on that file, produce this output:
~$ xsltproc Untitled1.xsl  topalbums.xml

  album: Vheissu
  artist: Thrice
  count=332

  album: The Artist in the Ambulance
  artist: Thrice
  count=289

  album: Appeal To Reason
  artist: Rise Against
  count=286

  album: Favourite Worst Nightmare
  artist: Arctic Monkeys
  count=210

  album: The Sufferer & The Witness
  artist: Rise Against
  count=206

[ Not sure if that's anything like what you want. ]


I'm sure that the whole thing would reduce to an even more concise XQuery request.

I was trying to do the whole thing as an xpath one liner, but it didn't like my attempts to include alternates in parenthesis. I think this is an xpath 1.0 vs. xpath 2.0 issue. Saxon is the only thing that supports 2.0. The perl, python
and java libraries only support xpath 1.0.

This sort of expression did work using xpath 2.0 (in oxygen editor):

        //alb...@rank < 6]/(name|playcount|artist/name)

But I couldn't figure out a 1.0 syntax that would grab all three fields.

( and the perl xpath seems to have a bug that interprets '@rank < 6' as less-than-or-equal! )


-- Steve Majewski



On Feb 6, 2009, at 11:00 PM, Bryan Smith wrote:

Hi everyone,

I have another question I'm hoping someone would be kind enough to answer. I am new to parsing XML (not to mention much of Python itself) and I am trying to parse an XML file. The file I am trying to parse is this one: http://ws.audioscrobbler.com/2.0/user/bryansmith/topalbums.xml .

So far, I have written up a class for parsing this file in my attempts to present to the user a list of top albums on their last.fm profile. If you note, the artist name and album name are both signified by the <name> tag which makes my job harder. If the tag names were different, I wouldn't have a problem. Listed below is the class I have written to parse the file. My question then is this: is there a way I can say something like "if tag_name == album name tag then....elif tag_name == artist name tag....". I hope this is clear.

As it stands right now, if I parse this file and print the results, this is what I get (understandably) if I try to print out in the following fashion - album (playcount): Vheissu (332), Thrice (289), The Artist in the Ambulance (286), Thrice (210) and so on. Thrice is the artist name. I want to be able to differentiate between the "artist" name tag and the "album" name tag.


Class as it stands right now:

class GetTopAlbums(ContentHandler):

    in_album_tag = False
    in_playcount_tag = False

    def __init__(self, album, playcount):
        ContentHandler.__init__(self)
        self.album = album
        self.playcount = playcount
        self.data = []

    def startElement(self, tag_name, attr):
        if tag_name == "name":
            self.in_album_tag = True
        elif tag_name == "playcount":
            self.in_playcount_tag = True

    def endElement(self, tag_name):
        if tag_name == "name":
            content = "".join(self.data)
            self.data = []
            self.album.append(content)
            self.in_album_tag = False
        elif tag_name == "playcount":
            content = "".join(self.data)
            self.data = []
            self.playcount.append(content)
            self.in_playcount_tag = False

    def characters(self, string):
        if self.in_album_tag == True:
            self.data.append(string)
        elif self.in_playcount_tag == True:
            self.data.append(string)

Thanks in advance!
Bryan
_______________________________________________
Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig

_______________________________________________
Pythonmac-SIG maillist  -  Pythonmac-SIG@python.org
http://mail.python.org/mailman/listinfo/pythonmac-sig

Reply via email to