Ahmed Eldawy created PIG-3865:
---------------------------------
Summary: Remodel the XMLLoader to work to be faster and more
maintainable
Key: PIG-3865
URL: https://issues.apache.org/jira/browse/PIG-3865
Project: Pig
Issue Type: Improvement
Components: piggybank
Reporter: Ahmed Eldawy
Assignee: Ahmed Eldawy
Priority: Minor
I recreated the XMLLoader in PiggyBank to work line by line instead of
character by character. This makes it more efficient as it uses precompiled
regular expressions on each line instead of doing checks on a character by
character basis. The code is also significantly smaller which makes it more
maintainable.
Just to put you in perspective. I'm a PhD student in University of Minnesota. I
built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension to
Hadoop that adds spatial data types and indexes in HDFS. The system is open
source and have been downloads more than 75,000 times so far. Part of it is to
provide a simple high level language that works with spatial data.
I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial
extension to Pig. My case study is the planet file from OpenStreetMap. This is
a 450GB XML file that contains all the information about the whole planet. I
previously used XMLLoader to parse it. I found some bugs and fixed it in
previous issues. Now, I found that it takes a lot of time to parse the XML
file. To be a good citizen, I remodeled the XMLLoader to work line by line and
use precompiled regular expressions which makes it faster. By the way, Pigeon
was presented in ICDE 2014
[http://ieee-icde2014.eecs.northwestern.edu/program.html], a top conference in
data engineering.
The code is now more maintainable. For example, I can easily modify it to add
to accept a regular expression for the XML identifier so that it matches all
tags that satisfy the regular expression instead of just returning a fixed
static tag. In this version, I didn't add any new features but they can be
added in the future.
--
This message was sent by Atlassian JIRA
(v6.2#6252)