This looks promising. I'd probably want to represent that sort of information differently in J (parallel lists for starting offset, length, nesting depth and token type), but other than that minor detail, it's very much in the direction of what I was trying to conceptualize.
That said, I'm puzzling over the token type table http://vtd-xml.sourceforge.net/userGuide/6.html -- why, for example, do they not distinguish between the "element" name of a processing instruction and the "attribute" name of a processing instruction (or whatever those are called)? Also, I've an analogous question about the value of a namespace. But I can probably ignore those issues for my current project, since it uses neither (and perhaps the VTD developers also do not use need those missing token types). Thanks! -- Raul On Fri, Dec 5, 2014 at 8:44 PM, Joe Bogner <[email protected]> wrote: > I found VTD-xml while researching this. Looks like an interesting > alternative and reminds me of the work you did with segmented strings > > http://vtd-xml.sourceforge.net/VTD.html > > http://jsoftware.2058.n7.nabble.com/quot-Segmented-Strings-quot-td59863.html > On Dec 5, 2014 4:28 PM, "Raul Miller" <[email protected]> wrote: > >> I would like to revisit the idea of using J to parse xml. >> >> The xml/sax addon was a nice idea, but not very stable. It represented >> xml as a series of events (function calls), and left it up to the user >> how they would structure the result. Unfortunately, it also rather >> reliably crashes J. >> >> This can be mitigated in various ways. If what you are parsing is >> simple enough, and you can live with 32 bit j602, xml/sax can work >> great. But those are not always ideal constraints to work with. >> >> But... what's a good data structure in J, to represent xml? >> >> A problem is that xml is something of a living example of "the nice >> thing about standards is that there are so many to choose from". The >> standards documents describing xml are voluminous, and there are many >> alternatives which are physically different but logically similar to >> wade through. >> >> Still, at a basic level, xml is something of a nested sequence type of >> a thing. So one approach might leverage boxed character arrays. This >> will not be particularly efficient, but it's a start. >> >> For example, this xml snippet: >> >> <ab cd="ef" gh="ijk">lmnop</a> >> >> Might be represented in J as: >> 'ab';<('cd';'ef'),('gh';'ijk'),:'';<<'lmnop' >> >> (The extra boxing on the text is because that might in the general >> case actually be a sequence of elements). >> >> Another approach might be: >> 'ab';(('cd';'ef'),:('gh';'ijk'));<<'lmnop' >> >> Here, the [textual, in this case] content of the element is stored in >> a separate box from the attributes, instead of treating it as a >> blank-named attribute. >> >> But perhaps there are good non-boxed ways of representing the structure? >> >> Has anyone else been working with xml in J? >> >> Thanks, >> >> -- >> Raul >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
