[
https://issues.apache.org/jira/browse/PIG-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ahmed Eldawy updated PIG-3865:
------------------------------
Attachment: XMLLoader.java
PIG-3865.patch
A patch for the test to add one more test case to make sure XMLLoader works
when a matching element spans two blocks.
The new code of XMLLoader is attached. It is not provided as a patch since it
is a complete remodel of the old code not a modification of it.
> Remodel the XMLLoader to work to be faster and more maintainable
> ----------------------------------------------------------------
>
> Key: PIG-3865
> URL: https://issues.apache.org/jira/browse/PIG-3865
> Project: Pig
> Issue Type: Improvement
> Components: piggybank
> Reporter: Ahmed Eldawy
> Assignee: Ahmed Eldawy
> Priority: Minor
> Attachments: PIG-3865.patch, XMLLoader.java
>
>
> I recreated the XMLLoader in PiggyBank to work line by line instead of
> character by character. This makes it more efficient as it uses precompiled
> regular expressions on each line instead of doing checks on a character by
> character basis. The code is also significantly smaller which makes it more
> maintainable.
> Just to put you in perspective. I'm a PhD student in University of Minnesota.
> I built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension
> to Hadoop that adds spatial data types and indexes in HDFS. The system is
> open source and have been downloads more than 75,000 times so far. Part of it
> is to provide a simple high level language that works with spatial data.
> I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial
> extension to Pig. My case study is the planet file from OpenStreetMap. This
> is a 450GB XML file that contains all the information about the whole planet.
> I previously used XMLLoader to parse it. I found some bugs and fixed it in
> previous issues. Now, I found that it takes a lot of time to parse the XML
> file. To be a good citizen, I remodeled the XMLLoader to work line by line
> and use precompiled regular expressions which makes it faster. The parsing
> time of the compressed OSM planet file drops from 5:30 hours to 3:30 hours in
> my cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE
> 2014 [http://ieee-icde2014.eecs.northwestern.edu/program.html], a top
> conference in data engineering.
> The code is now more maintainable. For example, I can easily modify it to add
> to accept a regular expression for the XML identifier so that it matches all
> tags that satisfy the regular expression instead of just returning a fixed
> static tag. In this version, I didn't add any new features but they can be
> added in the future.
--
This message was sent by Atlassian JIRA
(v6.2#6252)