Thanks everyone.
I've done a first cut at the parser. I've made the following
assumptions:
1. The input is a sequence of XML elements
2. There is no "recursion" of elements
3. Fixed depth of 1
My design is fairly simple - it simply reads records 1 by 1 and puts
them into Strings (almost identical to TextInputFormat &
LineRecordReader).
I decided to skip the whole FileSplit issue for now. The user
basically needs to specify what the element name that denotes a
record. I've got a couple of questions for the community:
1. Do we see a lot of performance benefits by using FileSplit for
text files ?
2. What StAX parser do people consider the fastest ?
3. Does it make sense to "assume" that for an xml file, the first
"sequence" is the sequence of records ? If so, I'm thinking about
putting in a convenience function that will "detect" what the element
name is for records.
I'm going to do some performance testing as well. To the yahoo guys -
Is there a point of contact that I can get my changes integrated into
the trunk ?
Thanks,
Alan Ho
On Nov 12, 2007, at 11:11 AM, Arkady Borkovsky wrote:
Alan,
Can you tell a little more about specific needs you try to cover?
Do you deal with full XML? Correct XML?
A pretty common situation is
-- the input is a sequence of XML elements ("records"), and the
application does not care about the "top element" that covers the
whole file
-- there is no "recursion" -- that is an element <A>...</A> never
appears inside another <A> element.
-- as a special case, the tree is has fixed depth, (often 1)
-Arkady
On Nov 11, 2007, at 11:24 PM, Alan Ho wrote:
After looking long and hard for a good way to process XML. I've
looked at the Streaming XML Record reader, and frankly - it
doesn't look good.
Here's how far I got prototyping:
I've been using a StAX parser (the one that comes with J2EE 5).
DOM and SAX doesn't cut it cause the RecordReader interface needs
the ability to "pull" record by record.
I've also been using JAXB 2.0 in order to bind the XML to real
java objects.
Here are some of my dilemmas:
1. FileSplit - I'm not sure if I should even try to implement this
capability. I'm working off the LineRecordReader example, and the
low level manipulation of bytes seem really tricky. With StAX, I'm
not able to track where in the file I've read up to, so I'm unable
to figure out when to stop parsing a section of the file. The only
way that I can see this work is to "extend" my own version of
BufferInputStream to track how many bytes have been read.
2. Should I even bother with JAXB ? If its cumbersome, then I'd
rather not use it. Alternatively, when calling "next", the
application returns a single record represented by XML.
3. Is a StAX parser adequate ? I'm not sure that the speed would
be fast enough.
4. I'm I re-inventing the wheel - has someone else done this ?
Please let me know.
If someone is interested in my work, I could contribute back to
the community.
Thanks,
Alan Ho