This looks promising.

I'd probably want to represent that sort of information differently in
J (parallel lists for starting offset, length, nesting depth and token
type), but other than that minor detail, it's very much in the
direction of what I was trying to conceptualize.

That said, I'm puzzling over the token type table
http://vtd-xml.sourceforge.net/userGuide/6.html -- why, for example,
do they not distinguish between the "element" name of a processing
instruction and the "attribute" name of a processing instruction (or
whatever those are called)? Also, I've an analogous question about the
value of a namespace. But I can probably ignore those issues for my
current project, since it uses neither (and perhaps the VTD developers
also do not use need those missing token types).

Thanks!

-- 
Raul


On Fri, Dec 5, 2014 at 8:44 PM, Joe Bogner <[email protected]> wrote:
> I found VTD-xml while researching this. Looks like an interesting
> alternative and reminds me of the work you did with segmented strings
>
> http://vtd-xml.sourceforge.net/VTD.html
>
> http://jsoftware.2058.n7.nabble.com/quot-Segmented-Strings-quot-td59863.html
>  On Dec 5, 2014 4:28 PM, "Raul Miller" <[email protected]> wrote:
>
>> I would like to revisit the idea of using J to parse xml.
>>
>> The xml/sax addon was a nice idea, but not very stable. It represented
>> xml as a series of events (function calls), and left it up to the user
>> how they would structure the result. Unfortunately, it also rather
>> reliably crashes J.
>>
>> This can be mitigated in various ways. If what you are parsing is
>> simple enough, and you can live with 32 bit j602, xml/sax can work
>> great. But those are not always ideal constraints to work with.
>>
>> But... what's a good data structure in J, to represent xml?
>>
>> A problem is that xml is something of a living example of "the nice
>> thing about standards is that there are so many to choose from". The
>> standards documents describing xml are voluminous, and there are many
>> alternatives which are physically different but logically similar to
>> wade through.
>>
>> Still, at a basic level, xml is something of a nested sequence type of
>> a thing. So one approach might leverage boxed character arrays. This
>> will not be particularly efficient, but it's a start.
>>
>> For example, this xml snippet:
>>
>> <ab cd="ef" gh="ijk">lmnop</a>
>>
>> Might be represented in J as:
>>    'ab';<('cd';'ef'),('gh';'ijk'),:'';<<'lmnop'
>>
>> (The extra boxing on the text is because that might in the general
>> case actually be a sequence of elements).
>>
>> Another approach might be:
>>    'ab';(('cd';'ef'),:('gh';'ijk'));<<'lmnop'
>>
>> Here, the [textual, in this case] content of the element is stored in
>> a separate box from the attributes, instead of treating it as a
>> blank-named attribute.
>>
>> But perhaps there are good non-boxed ways of representing the structure?
>>
>> Has anyone else been working with xml in J?
>>
>> Thanks,
>>
>> --
>> Raul
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to