[ 
https://issues.apache.org/jira/browse/HADOOP-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673445#action_12673445
 ] 

Alan Ho commented on HADOOP-2439:
---------------------------------

The approach described doesn't use a StaX parser, and probably isn't going to 
be as robust to failure or as extensible as using a StaX parser. If you look at 
the code, my patch allows you to specify the XML element name, 
"namespace_prefix", and namespace_URI when identifying the correct tag.  My 
patch also makes it easier to massage the XML too when reading in data.

Initially when I tried to create a XML parser, I tried to hack something up 
like the previous approach described. But after trying to parse real-world data 
(e.g. a dump of wikipedia), I threw up my arms and decided to use a proper 
pull-parser.



> Hadoop needs a better XML Input
> -------------------------------
>
>                 Key: HADOOP-2439
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2439
>             Project: Hadoop Core
>          Issue Type: New Feature
>          Components: mapred
>    Affects Versions: 0.15.1
>            Reporter: Alan Ho
>            Priority: Minor
>         Attachments: HADOOP-2439Patch.patch
>
>
> Hadoop does not have a good XML parser for XML input. The XML parser in the 
> streaming class is fairly difficult to work with and doesn't have proper test 
> cases around it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to