[ https://issues.apache.org/jira/browse/PIG-3865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ahmed Eldawy updated PIG-3865: ------------------------------ Status: Patch Available (was: Open) A patch is attached for the test which adds a new test for when a matching element spans two blocks. The new code of XMLLoader is attached. It is not provided as a patch since it is a complete remodel. > Remodel the XMLLoader to work to be faster and more maintainable > ---------------------------------------------------------------- > > Key: PIG-3865 > URL: https://issues.apache.org/jira/browse/PIG-3865 > Project: Pig > Issue Type: Improvement > Components: piggybank > Reporter: Ahmed Eldawy > Assignee: Ahmed Eldawy > Priority: Minor > Attachments: PIG-3865.patch, XMLLoader.java > > > I recreated the XMLLoader in PiggyBank to work line by line instead of > character by character. This makes it more efficient as it uses precompiled > regular expressions on each line instead of doing checks on a character by > character basis. The code is also significantly smaller which makes it more > maintainable. > Just to put you in perspective. I'm a PhD student in University of Minnesota. > I built SpatialHadoop [http://spatialhadoop.cs.umn.edu] which is an extension > to Hadoop that adds spatial data types and indexes in HDFS. The system is > open source and have been downloads more than 75,000 times so far. Part of it > is to provide a simple high level language that works with spatial data. > I proposed Pigeon [http://spatialhadoop.cs.umn.edu/pigeon] as a spatial > extension to Pig. My case study is the planet file from OpenStreetMap. This > is a 450GB XML file that contains all the information about the whole planet. > I previously used XMLLoader to parse it. I found some bugs and fixed it in > previous issues. Now, I found that it takes a lot of time to parse the XML > file. To be a good citizen, I remodeled the XMLLoader to work line by line > and use precompiled regular expressions which makes it faster. The parsing > time of the compressed OSM planet file drops from 5:30 hours to 3:30 hours in > my cluster setup with Hadoop 1.2.1. By the way, Pigeon was presented in ICDE > 2014 [http://ieee-icde2014.eecs.northwestern.edu/program.html], a top > conference in data engineering. > The code is now more maintainable. For example, I can easily modify it to add > to accept a regular expression for the XML identifier so that it matches all > tags that satisfy the regular expression instead of just returning a fixed > static tag. In this version, I didn't add any new features but they can be > added in the future. -- This message was sent by Atlassian JIRA (v6.2#6252)