[
https://issues.apache.org/jira/browse/HADOOP-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12673445#action_12673445
]
Alan Ho commented on HADOOP-2439:
---------------------------------
The approach described doesn't use a StaX parser, and probably isn't going to
be as robust to failure or as extensible as using a StaX parser. If you look at
the code, my patch allows you to specify the XML element name,
"namespace_prefix", and namespace_URI when identifying the correct tag. My
patch also makes it easier to massage the XML too when reading in data.
Initially when I tried to create a XML parser, I tried to hack something up
like the previous approach described. But after trying to parse real-world data
(e.g. a dump of wikipedia), I threw up my arms and decided to use a proper
pull-parser.
> Hadoop needs a better XML Input
> -------------------------------
>
> Key: HADOOP-2439
> URL: https://issues.apache.org/jira/browse/HADOOP-2439
> Project: Hadoop Core
> Issue Type: New Feature
> Components: mapred
> Affects Versions: 0.15.1
> Reporter: Alan Ho
> Priority: Minor
> Attachments: HADOOP-2439Patch.patch
>
>
> Hadoop does not have a good XML parser for XML input. The XML parser in the
> streaming class is fairly difficult to work with and doesn't have proper test
> cases around it.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.