This is the first of what should be several posts centered around parsing
nif.xml

https://raw.github.com/amorilia/nifxml/master/nif.xml

I have a variety of reasons for choosing this particular xml file, and
maybe at some point we can talk about some of them on the chat forum.

But, first, I'd like to walk through the use of J's xml/sax addon package.

sax is an xml handling library (http://www.saxproject.org/) which exists
outside of J, and knowledge of the api can be used in a variety of
languages and contexts. If you want a portable understanding of sax you
would probably implement projects with it in several different languages.
I'm not going to go that far, here. I just want to provide some basic
information about using it in J.

So here's a code sample:

require'xml/sax'

saxclass 'nifxml'

  startDocument=: 3 :0
    Interest=: 0
    Result=: ''
  )

    startElement=: 4 :0
      if. 'compound'-:y do.
        if. 'Header'-:x getAttribute 'name' do.
          Interest=: 1
        end.
      end.
      if. Interest do.
        atrs=. ([,'="',],'"'"_)&.>/"1 }.attributes x
        Result=: Result,'<',(;:inv (<y),atrs),'>'
      end.
    )

      characters=: 3 :0
        if. Interest do.
          Result=: Result,y
        end.
      )

    endElement=: 3 :0
      if. Interest do.
        Result=:Result,'</',y,'>'
        if. 'compound'-:y do. Interest=: 0 end.
      end.
    )

  endDocument=: 3 :0
    Result
  )

  extract=: 3 :0
    process fread y
  )

-----------------------------------------------------------

Notice how I have indented the code.

At the top level is this command:

saxclass 'nifxml'

saxclass erases the definition of nifxml and then begins a new definition
for that locale. If you are used to working in object oriented languages,
you should think of a named locale as a class (and a numbered locale as an
object). Otherwise you should probably think of a locale as being something
like a directory - it contains named things (but, unlike directories,
cannot contain other locales). The difference between a saxclass and a
regular class is that a saxclass automatically uses (or "inherits")
commands from the sax xml package.

Indented within this class definition are several words:
startDocument, endDocument, and extract.

extract is just a cover function for the sax routine "process" which
processes an xml document. Each time process starts, it will run
startDocument once - so that is a good place to put initialization
commands. Each time process finishes, it calls endDocument to determine its
result.

Note that sax is defined to operate sequentially - it parses the xml file
and returns. This is very different from the usual J style of coding, so
experienced J programmers might be uncomfortable with it. On the other
hand, this is very much like how other many other (but not all) programming
languages work so this aspect should be comfortable to people with a
background in any of a variety of languages.

That said, in many cases the time to parse an xml file with sax is trivial.
It's usually better to concern yourself with code simplicity than with
time, at least for an initial draft of the code. In typical cases your
computer will take more time reading the file than parsing it.

Indented one level deeper in the class definition are two
verbs: startElement and endElement. These run once each for each xml
element. (An xml element is a word that follows a < character and end
element runs for the corresponding name following a </ character sequence).

What I have done here is rig up some code to extract everything inside the
<compound> element if it has a name="header" attribute. (I am not going to
go into great depth about xml syntax rules - but I wanted to cover some
basics for the occasional person who has not worked with them before.)

Finally, indented deepest is a single verb definition: characters. This
captures the characters between xml elements.

If I run this code against the nif.xml file, I get this result:

   extract_nifxml_ 'c:\users\rdmiller\desktop\furniture\nif.xml'
<compound>
        The NIF file header.
        <add type="HeaderString">'NetImmerse File Format x.x.x.x' (versions
<= 10.0.1.2) or 'Gamebryo File Format x.x.x.x' (versions >= 10.1.0.0), with
x.x.x.x the version written out. Ends with a newline character (0x0A).</add>
        <add type="LineString" arr1="3" ver2="3.1"></add>
        <add type="FileVersion" default="0x04000002" ver1="3.3.0.13">The
NIF version, in hexadecimal notation: 0x04000002, 0x0401000C, 0x04020002,
0x04020100, 0x04020200, 0x0A000100, 0x0A010000, 0x0A020000, 0x14000004,
...</add>
        <add type="EndianType" default="ENDIAN_LITTLE"
ver1="20.0.0.4">Determines the endianness of the data in the file.</add>
        <add type="ulittle32" ver1="10.1.0.0">An extra version number, for
companies that decide to modify the file format.</add>
        <add type="ulittle32" ver1="3.3.0.13">Number of file objects.</add>
        <add type="ulittle32" default="0" cond="(User Version >= 10) ||
((User Version == 1) && (Version != 10.2.0.0))" ver1="10.1.0.0">This also
appears to be the extra user version number and must be set in some
circumstances. Probably used by Bethesda to denote the Havok version.</add>
        <add type="uint" default="0" ver1="30.0.0.2">Unknown. Possibly User
Version 2?</add>
        <add type="ExportInfo" ver1="10.0.1.2" ver2="10.0.1.2"></add>
        <add type="ExportInfo" ver1="10.1.0.0" cond="(User Version >= 10)
|| ((User Version == 1) && (Version != 10.2.0.0))"></add>
        <add type="ushort" ver1="10.0.1.0">Number of object types in this
NIF file.</add>
        <add type="SizedString" arr1="Num Block Types" ver1="10.0.1.0">List
of all object types used in this NIF file.</add>
        <add type="BlockTypeIndex" arr1="Num Blocks" ver1="10.0.1.0">Maps
file objects on their corresponding type: first file object is of type
object_types[object_type_index[0]], the second of
object_types[object_type_index[1]], etc.</add>
        <add type="uint" arr1="Num Blocks" ver1="20.2.0.7">Array of block
sizes?</add>
        <add type="uint" ver1="20.1.0.3">Number of strings.</add>
        <add type="uint" ver1="20.1.0.3">Maximum string length.</add>
        <add type="SizedString" arr1="Num Strings"
ver1="20.1.0.3">Strings.</add>
        <add type="uint" default="0" ver1="10.0.1.0">Unknown.</add>
    </compound>

Notice how the opening element is not indented - this is because I was not
capturing characters before the opening element. This should not bother
anyone.

For more information on the structure of this particular xml file, see:
http://niftools.sourceforge.net/wiki/Nif_Format/NifTools_XML_Format

But this is enough, for today.

Thanks,

-- 
Raul
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to