[jira] [Updated] (PIG-3865) Remodel the XMLLoader to work to be faster and more maintainable

Ahmed Eldawy (JIRA) Thu, 03 Apr 2014 09:08:35 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ahmed Eldawy updated PIG-3865:
------------------------------

    Status: Patch Available  (was: Open)

A patch is attached for the test which adds a new test for when a matching 
element spans two blocks.

The new code of XMLLoader is attached. It is not provided as a patch since it 
is a complete remodel.

> Remodel the XMLLoader to work to be faster and more maintainable
> ----------------------------------------------------------------
>
>                 Key: PIG-3865
>                 URL: https://issues.apache.org/jira/browse/PIG-3865
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Ahmed Eldawy
>            Assignee: Ahmed Eldawy
>            Priority: Minor
>         Attachments: PIG-3865.patch, XMLLoader.java
>
>
> I recreated the XMLLoader in PiggyBank to work line by line instead of 
> character by character. This makes it more efficient as it uses precompiled 
> regular expressions on each line instead of doing checks on a character by 
> character basis. The code is also significantly smaller which makes it more 
> maintainable.
> Just to put you in perspective. I'm a PhD student in University of Minnesota. 
> I built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension 
> to Hadoop that adds spatial data types and indexes in HDFS. The system is 
> open source and have been downloads more than 75,000 times so far. Part of it 
> is to provide a simple high level language that works with spatial data.
> I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial 
> extension to Pig. My case study is the planet file from OpenStreetMap. This 
> is a 450GB XML file that contains all the information about the whole planet. 
> I previously used XMLLoader to parse it. I found some bugs and fixed it in 
> previous issues. Now, I found that it takes a lot of time to parse the XML 
> file. To be a good citizen, I remodeled the XMLLoader to work line by line 
> and use precompiled regular expressions which makes it faster. The parsing 
> time of the compressed OSM planet file drops from 5:30 hours to 3:30 hours in 
> my cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE 
> 2014 [http://ieee-icde2014.eecs.northwestern.edu/program.html], a top 
> conference in data engineering.
> The code is now more maintainable. For example, I can easily modify it to add 
> to accept a regular expression for the XML identifier so that it matches all 
> tags that satisfy the regular expression instead of just returning a fixed 
> static tag. In this version, I didn't add any new features but they can be 
> added in the future.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PIG-3865) Remodel the XMLLoader to work to be faster and more maintainable

Reply via email to