[ 
https://issues.apache.org/jira/browse/PIG-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ahmed Eldawy updated PIG-3865:
------------------------------

    Attachment: PIG-3865-test.txt

Sorry. The previous patch was for XMLLoader not the test. This is the correct 
patch for the test. I made three main changes.
1- I removed the attribute 'patternString'. It turns out to be unneeded.
2- I fixed an error in tests testShouldReturn0TupleCountIfNoEndTagIsFound and 
testShouldReturn1ForIntermediateTagData. The identifier was passed as 
'</ignoreProperty>' and I fixed it to 'ignoreProperty'.
3- I added a new test that I needed for the new design of XMLLoader when a 
multiline matching tag spans two blocks.

> Remodel the XMLLoader to work to be faster and more maintainable
> ----------------------------------------------------------------
>
>                 Key: PIG-3865
>                 URL: https://issues.apache.org/jira/browse/PIG-3865
>             Project: Pig
>          Issue Type: Improvement
>          Components: piggybank
>            Reporter: Ahmed Eldawy
>            Assignee: Ahmed Eldawy
>            Priority: Minor
>         Attachments: PIG-3865-test.txt, XMLLoader.java
>
>
> I recreated the XMLLoader in PiggyBank to work line by line instead of 
> character by character. This makes it more efficient as it uses precompiled 
> regular expressions on each line instead of doing checks on a character by 
> character basis. The code is also significantly smaller which makes it more 
> maintainable.
> Just to put you in perspective. I'm a PhD student in University of Minnesota. 
> I built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension 
> to Hadoop that adds spatial data types and indexes in HDFS. The system is 
> open source and have been downloads more than 75,000 times so far. Part of it 
> is to provide a simple high level language that works with spatial data.
> I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial 
> extension to Pig. My case study is the planet file from OpenStreetMap. This 
> is a 450GB XML file that contains all the information about the whole planet. 
> I previously used XMLLoader to parse it. I found some bugs and fixed it in 
> previous issues. Now, I found that it takes a lot of time to parse the XML 
> file. To be a good citizen, I remodeled the XMLLoader to work line by line 
> and use precompiled regular expressions which makes it faster. The parsing 
> time of the compressed OSM planet file drops from 5:30 hours to 3:30 hours in 
> my cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE 
> 2014 [http://ieee-icde2014.eecs.northwestern.edu/program.html], a top 
> conference in data engineering.
> The code is now more maintainable. For example, I can easily modify it to add 
> to accept a regular expression for the XML identifier so that it matches all 
> tags that satisfy the regular expression instead of just returning a fixed 
> static tag. In this version, I didn't add any new features but they can be 
> added in the future.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to