Awesome. This would be a great addition to Pig. Please create a JIRA. Russell Jurney http://datasyndrome.com
On May 29, 2013, at 8:51 AM, Ahmed Eldawy <[email protected]> wrote: > Hi all, > > Nick has pointed out to me an alternative GIS package that can replace JTS. > ESRI has recently released a GIS > package<https://github.com/Esri/geometry-api-java>under Apache > license. I changed Pigeon to work with that new package. I > think it could be easier now to integrate this work with main branch of > Apache Pig. I will go on with the current project and add more spatial > functionality. We can then add a new datatype to Apache and link it to > those functions. > > ESRI package contains a class OGCGeometry > <http://esri.github.io/geometry-api-java/javadoc/com/esri/core/geometry/ogc/OGCGeometry.html>which > can be linked to a new datatype 'Geometry'. Do you think we can rely on the > new package and integrate the work with Apache Pig? > > On May 23, 2013 11:40 PM, "Ahmed Eldawy" <[email protected]> wrote: > >> Hi all, >> Thanks for your help. I've started the project with a minimal >> functionality as a start. It's currently hosted in github. It is licensed >> under the Apache public license to make it easier to merge with Pig. >> Currently it has only a very few functions. I implemented a function from >> different types of functions (e.g., Aggregate and create). I'll keep adding >> functions and any contributions to the project are welcome. As a beginning, >> I need an ANT build file that runs the tests, compiles and generates a jar >> file. I'm not familiar with ANT so any help in this is encouraged. >> Here's the project home page >> https://github.com/aseldawy/pigeon >> >> >> If you have any comments or suggestion please contact me. >> >> >> Best regards, >> Ahmed Eldawy >> >> >> On Mon, May 6, 2013 at 3:09 PM, Jonathan Coveney <[email protected]>wrote: >> >>> Nick: the only issue is that the way types are implemented in Pig don't >>> allow us to easily "plug-in" types externally. Adding support for that >>> would be cool, but a fair bit of work. >>> >>> >>> 2013/5/6 Nick Dimiduk <[email protected]> >>> >>>> I'm to a lawyer, but I see no reason why this cannot be an external >>>> extension to Pig. It would behave the same way PostGIS is an external >>>> extension to Postgres. Any Apache issues would be toward general >>>> purpose enhancements, not specific to your project. >>>> >>>> Good on you! >>>> -n >>>> >>>> On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy <[email protected]> >>> wrote: >>>> >>>>> I contacted solr developers to see how JTS can be included in an >>> Apache >>>>> project. See >>> http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/ >>>>> As far as I understand, they did not include it in the main solr >>> project, >>>>> rather, they created a separate project (spatial 4j) which is still >>>>> licensed under Apache license and refers to JTS. Users will have to >>>>> download JTS libraries separately to make it run. That's pretty much >>> the >>>>> same plan that Jonathan mentioned. We will still have the overhead of >>>>> serializing/deserializing the shapes each time a function is called. >>>> Also, >>>>> we will have to use the ugly bytearray data type for spatial data >>> instead >>>>> of creating its own data type (e.g., Geometry). >>>>> I think using spatial 4j instead of JTS will not be sufficient for our >>>> case >>>>> as we need to provide an access to all spatial functions of JTS such >>> as >>>>> Union, Intersection, Difference, ... etc. This way we can claim >>>> conformity >>>>> with OGC standards which gives visibility and appreciations of the >>>> spatial >>>>> community. >>>>> I think also that this means I will not add any issues to JIRA as it >>> is >>>> now >>>>> a separate project. I'm planning to host it on github and have all the >>>>> issues there. >>>>> Let me know if you have any suggestions or comments. >>>>> >>>>> Thanks >>>>> Ahmed >>>>> >>>>> >>>>> Best regards, >>>>> Ahmed Eldawy >>>>> >>>>> >>>>> On Mon, May 6, 2013 at 9:53 AM, Jonathan Coveney <[email protected]> >>>>> wrote: >>>>> >>>>>> You can give them all the same label or tag and filter on that later >>>> on. >>>>>> >>>>>> >>>>>> 2013/5/6 Ahmed Eldawy <[email protected]> >>>>>> >>>>>>> Thanks all for taking the time to respond. Danial, I didn't know >>> that >>>>>> Solr >>>>>>> uses JTS. This is a good finding and we can definitely ask them to >>>> see >>>>> if >>>>>>> there is a work around we can do. Jonathan, I thought of the same >>>> idea >>>>> of >>>>>>> serializing/deserializing a bytearray each time a UDF is called. >>> The >>>>>>> deserialization part is good for letting Pig auto detect spatial >>>> types >>>>> if >>>>>>> not set explicitly in the schema. What is the best way to start >>>> this? I >>>>>>> want to add an initial set of JIRA issues and start working on >>> them >>>>> but I >>>>>>> also need to keep the work grouped in some sense just for >>>> organization. >>>>>>> >>>>>>> Thanks >>>>>>> Ahmed >>>>>>> >>>>>>> Best regards, >>>>>>> Ahmed Eldawy >>>>>>> >>>>>>> >>>>>>> On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney < >>> [email protected] >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> I agree that this is cool, and if other projects are using JTS >>> it >>>> is >>>>>>> worth >>>>>>>> talking them to see how. I also agree that licensing is very >>>>>> frustrating. >>>>>>>> >>>>>>>> In the short term, however, while it is annoying to have to >>> manage >>>>> the >>>>>>>> serialization and deserialization yourself, you can have the >>>> geometry >>>>>>> type >>>>>>>> be passed around as a bytearray type. Your UDF's will have to >>> know >>>>> this >>>>>>> and >>>>>>>> treat it accordingly, but if you did this then all of the tools >>>> could >>>>>> be >>>>>>> in >>>>>>>> an external project on github instead of a branch in Pig. Then, >>> if >>>> we >>>>>> can >>>>>>>> get the licensing done, we could add the Geometry type to Pig. >>>> Adding >>>>>>>> types, honestly, is kind of tedious but not super difficult, so >>>> once >>>>>> the >>>>>>>> rest is done, that shouldn't be too difficult. >>>>>>>> >>>>>>>> >>>>>>>> 2013/5/4 Russell Jurney <[email protected]> >>>>>>>> >>>>>>>>> If a way could be found, this would be an awesome addition to >>>> Pig. >>>>>>>>> >>>>>>>>> Russell Jurney http://datasyndrome.com >>>>>>>>> >>>>>>>>> On May 3, 2013, at 4:09 PM, Daniel Dai <[email protected] >>>> >>>>>> wrote: >>>>>>>>> >>>>>>>>>> I am not sure how other Apache projects dealing with it? >>> Seems >>>>> Solr >>>>>>>> also >>>>>>>>>> has some connector to JTS? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Daniel >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy < >>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Alan for your interest. It's too bad that an open >>>> source >>>>>>>>> licensing >>>>>>>>>>> issue is holding me back from doing some open source work. >>> I >>>>>>>> understand >>>>>>>>> the >>>>>>>>>>> issue and your workarounds make sense. However, as I >>> mentioned >>>>> in >>>>>>> the >>>>>>>>>>> beginning, I don't want to have my own branch of Pig >>> because >>>> it >>>>>>> makes >>>>>>>> my >>>>>>>>>>> extension less portable. I'll think of another way to do >>> it. >>>>> I'll >>>>>>> ask >>>>>>>>> vivid >>>>>>>>>>> solutions if they can double license their code although I >>>> think >>>>>> the >>>>>>>>> answer >>>>>>>>>>> will be no. I'll also think of a way to ship my extension >>> as a >>>>> set >>>>>>> of >>>>>>>>> jar >>>>>>>>>>> files without the need to change the core of Pig. This >>> way, it >>>>> can >>>>>>> be >>>>>>>>>>> easily ported to newer versions of Pig. >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Ahmed >>>>>>>>>>> >>>>>>>>>>> Best regards, >>>>>>>>>>> Ahmed Eldawy >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, May 2, 2013 at 12:33 PM, Alan Gates < >>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> I know this is frustrating, but the different licenses do >>>> have >>>>>>>>> different >>>>>>>>>>>> requirements that make it so that Apache can't ship GPL >>> code. >>>>> A >>>>>>>> legal >>>>>>>>>>>> explanation is at >>>> http://www.apache.org/licenses/GPL-compatibility.htmlForadditional >>>>>>> info >>>>>>>>> on the LGPL specific questions see >>>>>>>>>>>> http://www.apache.org/legal/3party.html >>>>>>>>>>>> >>>>>>>>>>>> As far as pulling it in via ivy, the issue isn't so much >>>> where >>>>>> the >>>>>>>> code >>>>>>>>>>>> lives as much as what code we are requiring to make Pig >>> work. >>>>> If >>>>>>>>>>> something >>>>>>>>>>>> that is [L]GPL is required for Pig it violates Apache >>> rules >>>> as >>>>>>>> outlined >>>>>>>>>>>> above. It also would be a show stopper for a lot of >>>> companies >>>>>> that >>>>>>>>>>>> redistribute Pig and that are allergic to GPL software. >>>>>>>>>>>> >>>>>>>>>>>> So, as I said before, if you wanted to continue with that >>>>> library >>>>>>> and >>>>>>>>>>> they >>>>>>>>>>>> are not willing to relicense it then it would have to be >>>> bolted >>>>>> on >>>>>>>>> after >>>>>>>>>>>> Apache Pig is built. Nothing stops you from doing this by >>>>>>>> downloading >>>>>>>>>>>> Apache Pig, adding this library and your code, and >>>>>> redistributing, >>>>>>>>> though >>>>>>>>>>>> it wouldn't then be open to all Pig users. >>>>>>>>>>>> >>>>>>>>>>>> Alan. >>>>>>>>>>>> >>>>>>>>>>>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Thanks for your response. I was never good at >>>> differentiating >>>>>> all >>>>>>>>> those >>>>>>>>>>>>> open source licenses. I mean what is the point making >>> open >>>>>> source >>>>>>>>>>>> licenses >>>>>>>>>>>>> if it blocks me from using a library in an open source >>>>> project. >>>>>>> Any >>>>>>>>>>> way, >>>>>>>>>>>>> I'm not going into debate here. Just one question, if we >>> use >>>>> JTS >>>>>>> as >>>>>>>> a >>>>>>>>>>>>> library (jar file) without adding the code in Pig, is it >>>>> still a >>>>>>>>>>>> violation? >>>>>>>>>>>>> We'll use ivy, for example, to download the jar file when >>>>>>> compiling. >>>>>>>>>>>>> On May 1, 2013 7:50 PM, "Alan Gates" < >>> [email protected] >>>>> >>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Passing on the technical details for a moment, I see a >>>>>> licensing >>>>>>>>>>> issue. >>>>>>>>>>>>>> JTS is licensed under LGPL. Apache projects cannot >>> contain >>>>> or >>>>>>> ship >>>>>>>>>>>>>> [L]GPL. Apache does not meet the requirements of GPL >>> and >>>>> thus >>>>>> we >>>>>>>>>>> cannot >>>>>>>>>>>>>> repackage their code. If you wanted to go forward using >>>> that >>>>>>> class >>>>>>>>>>> this >>>>>>>>>>>>>> would have to be packaged as an add on that was >>> downloaded >>>>>>>> separately >>>>>>>>>>>> and >>>>>>>>>>>>>> not from Apache. Another option is to work with the JTS >>>>>>> community >>>>>>>>> and >>>>>>>>>>>> see >>>>>>>>>>>>>> if they are willing to dual license their code under >>> BSD or >>>>>>> Apache >>>>>>>>>>>> license >>>>>>>>>>>>>> so that Pig could include it. If neither of those are >>> an >>>>>> option >>>>>>>> you >>>>>>>>>>>> would >>>>>>>>>>>>>> need to come up with a new class to contain your spatial >>>>> data. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Alan. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>>> First, sorry for the long email. I wanted to put all my >>>>>> thoughts >>>>>>>>> here >>>>>>>>>>>>>> and >>>>>>>>>>>>>>> get your feedback. >>>>>>>>>>>>>>> I'm proposing a major addition to Pig that will greatly >>>>>> increase >>>>>>>> its >>>>>>>>>>>>>>> functionality and user base. It is simply to add >>> spatial >>>>>> support >>>>>>>> to >>>>>>>>>>> the >>>>>>>>>>>>>>> language and the framework. I've already started >>> working >>>> on >>>>>> that >>>>>>>> but >>>>>>>>>>> I >>>>>>>>>>>>>>> don't want it to be just another branch. I want it, >>>>>> eventually, >>>>>>> to >>>>>>>>> be >>>>>>>>>>>>>>> merged with the trunk of Apache Pig. So, I'm sending >>> this >>>>>> email >>>>>>>>>>> mainly >>>>>>>>>>>> to >>>>>>>>>>>>>>> reach out the main contributors of Pig to see the >>>>> feasibility >>>>>> of >>>>>>>>>>> this. >>>>>>>>>>>>>>> This addition is a part of a big project we have been >>>>> working >>>>>> on >>>>>>>> in >>>>>>>>>>>>>>> University of Minnesota; the project is called Spatial >>>>> Hadoop. >>>>>>>>>>>>>>> http://spatialhadoop.cs.umn.edu. It's about building a >>>>>>> MapReduce >>>>>>>>>>>>>> framework >>>>>>>>>>>>>>> (Hadoop) that is capable of maintaining and analyzing >>>>> spatial >>>>>>> data >>>>>>>>>>>>>>> efficiently. I'm the main guy behind that project and >>>> since >>>>> we >>>>>>>>>>> released >>>>>>>>>>>>>> its >>>>>>>>>>>>>>> first version, we received very encouraging responses >>> from >>>>>>>> different >>>>>>>>>>>>>> groups >>>>>>>>>>>>>>> in the research and industrial community. I'm sure the >>>>>> addition >>>>>>> we >>>>>>>>>>> want >>>>>>>>>>>>>> to >>>>>>>>>>>>>>> make to Pig Latin will be widely accepted by the >>> people in >>>>> the >>>>>>>>>>> spatial >>>>>>>>>>>>>>> community. >>>>>>>>>>>>>>> I'm proposing a plan here while we're still in the >>> early >>>>>> phases >>>>>>> of >>>>>>>>>>> this >>>>>>>>>>>>>>> task to be able to discuss it with the main >>> contributors >>>> and >>>>>> see >>>>>>>> its >>>>>>>>>>>>>>> feasibility. First of all, I think that we need to >>> change >>>>> the >>>>>>> core >>>>>>>>> of >>>>>>>>>>>> Pig >>>>>>>>>>>>>>> to be able to support spatial data. Providing a set of >>>> UDFs >>>>>> only >>>>>>>> is >>>>>>>>>>> not >>>>>>>>>>>>>>> enough. The main reason is that Pig Latin does not >>>> provide a >>>>>> way >>>>>>>> to >>>>>>>>>>>>>> create >>>>>>>>>>>>>>> a new data type which is needed for spatial data. Once >>> we >>>>> have >>>>>>> the >>>>>>>>>>>>>> spatial >>>>>>>>>>>>>>> data types we need, the functionality can be expanded >>>> using >>>>>> more >>>>>>>>>>> UDFs. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Here's the plan as I see it. >>>>>>>>>>>>>>> 1- Introduce a new primitive data type Geometry which >>>>>> represents >>>>>>>> all >>>>>>>>>>>>>>> spatial data types. In the underlying system, this will >>>> map >>>>> to >>>>>>>>>>>>>>> com.vividsolutions.jts.geom.Geometry. This is a class >>> from >>>>>> Java >>>>>>>>>>>> Topology >>>>>>>>>>>>>>> Suite (JTS) [ >>>> http://www.vividsolutions.com/jts/JTSHome.htm >>>>> ], >>>>>> a >>>>>>>>>>> stable >>>>>>>>>>>>>> and >>>>>>>>>>>>>>> efficient open source Java library for spatial data >>> types >>>>> and >>>>>>>>>>>> algorithms. >>>>>>>>>>>>>>> It is very popular in the spatial community and a C++ >>> port >>>>> of >>>>>> it >>>>>>>> is >>>>>>>>>>>> used >>>>>>>>>>>>>> in >>>>>>>>>>>>>>> PostGIS [http://postgis.net/] (a spatial library for >>>>>> Postgres). >>>>>>>> JTS >>>>>>>>>>>> also >>>>>>>>>>>>>>> conforms with Open Geospatial Consortium (OGC) [ >>>>>>>>>>>>>>> http://www.opengeospatial.org/] which is an open >>> standard >>>>> for >>>>>>> the >>>>>>>>>>>>>> spatial >>>>>>>>>>>>>>> data types. The Geometry data type is read from and >>>> written >>>>> to >>>>>>>> text >>>>>>>>>>>> files >>>>>>>>>>>>>>> using the Well Known Text (WKT) format. There is also a >>>> way >>>>> to >>>>>>>>>>> convert >>>>>>>>>>>> it >>>>>>>>>>>>>>> to/from binary so that it can work with binary files >>> and >>>>>>> streams. >>>>>>>>>>>>>>> 2- Add functions that manipulate spatial data types. >>> These >>>>>> will >>>>>>> be >>>>>>>>>>>> added >>>>>>>>>>>>>> as >>>>>>>>>>>>>>> UDFs and we will not need to mess with the internals of >>>> Pig. >>>>>>> Most >>>>>>>>>>>>>> probably, >>>>>>>>>>>>>>> there will be one new class for each operation (e.g., >>>> union >>>>> or >>>>>>>>>>>>>>> intersection). I think it will be good to put these new >>>>>>> operations >>>>>>>>>>>> inside >>>>>>>>>>>>>>> the core of Pig so that users can use it without >>> having to >>>>>> write >>>>>>>> the >>>>>>>>>>>>>> fully >>>>>>>>>>>>>>> qualified class name. Also, since there is no way to >>>>>> implicitly >>>>>>>> cast >>>>>>>>>>> a >>>>>>>>>>>>>>> spatial data type to a non-spatial data types, there >>> will >>>>> not >>>>>> be >>>>>>>> any >>>>>>>>>>>>>>> conflicts in existing operations or new operations. All >>>> new >>>>>>>>>>> operations, >>>>>>>>>>>>>> and >>>>>>>>>>>>>>> only the new operations, will be working on spatial >>> data >>>>>> types. >>>>>>>> Here >>>>>>>>>>> is >>>>>>>>>>>>>> an >>>>>>>>>>>>>>> initial list of operations that can be added. All those >>>>>>> operations >>>>>>>>>>> are >>>>>>>>>>>>>>> already implemented in JTS and the UDFs added to Pig >>> will >>>> be >>>>>>> just >>>>>>>>>>>>>> wrappers >>>>>>>>>>>>>>> around them. >>>>>>>>>>>>>>> **Predicates (used for spatial filtering) >>>>>>>>>>>>>>> Equals >>>>>>>>>>>>>>> Disjoint >>>>>>>>>>>>>>> Intersects >>>>>>>>>>>>>>> Touches >>>>>>>>>>>>>>> Crosses >>>>>>>>>>>>>>> Within >>>>>>>>>>>>>>> Contains >>>>>>>>>>>>>>> Overlaps >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> **Operations >>>>>>>>>>>>>>> Envelope >>>>>>>>>>>>>>> Area >>>>>>>>>>>>>>> Length >>>>>>>>>>>>>>> Buffer >>>>>>>>>>>>>>> ConvexHull >>>>>>>>>>>>>>> Intersection >>>>>>>>>>>>>>> Union >>>>>>>>>>>>>>> Difference >>>>>>>>>>>>>>> SymDifference >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> **Aggregate functions >>>>>>>>>>>>>>> Accum >>>>>>>>>>>>>>> ConvexHull >>>>>>>>>>>>>>> Union >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 3- The third step is to implement spatial indexes >>> (e.g., >>>>> Grid >>>>>> or >>>>>>>>>>>>>> R-tree). A >>>>>>>>>>>>>>> Pig loader and Pig output classes will be created for >>>> those >>>>>>>> indexes. >>>>>>>>>>>> Note >>>>>>>>>>>>>>> that currently we have SpatialOutputFormat and >>>>>>> SpatialInputFormat >>>>>>>>> for >>>>>>>>>>>>>> those >>>>>>>>>>>>>>> indexes inside the Spatial Hadoop project, but we need >>> to >>>>>> tweak >>>>>>>> them >>>>>>>>>>> to >>>>>>>>>>>>>>> work with Pig. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 4- (Advanced) Implement more sophisticated algorithms >>> for >>>>>>> spatial >>>>>>>>>>>>>>> operations that utilize the indexes. For example, we >>> can >>>>> have >>>>>> a >>>>>>>>>>>> specific >>>>>>>>>>>>>>> algorithm for spatial range query or spatial join. >>> Again, >>>> we >>>>>>>> already >>>>>>>>>>>> have >>>>>>>>>>>>>>> algorithms built for different operations implemented >>> in >>>>>> Spatial >>>>>>>>>>> Hadoop >>>>>>>>>>>>>> as >>>>>>>>>>>>>>> MapReduce programs, but they will need to be modified >>> to >>>>> work >>>>>> in >>>>>>>> Pig >>>>>>>>>>>>>>> environment and get to work with other operations. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> This is my whole plan for the spatial extension to Pig. >>>> I've >>>>>>>> already >>>>>>>>>>>>>>> started with the first step but as I mentioned >>> earlier, I >>>>>> don't >>>>>>>> want >>>>>>>>>>> to >>>>>>>>>>>>>> do >>>>>>>>>>>>>>> the work for our project and then the work gets >>>> forgotten. I >>>>>>> want >>>>>>>> to >>>>>>>>>>>>>>> contribute to Pig and do my research at the same time. >>> If >>>>> you >>>>>>>> think >>>>>>>>>>> the >>>>>>>>>>>>>>> plan is plausible, I'll open JIRA issues for the above >>>> tasks >>>>>> and >>>>>>>>>>> start >>>>>>>>>>>>>>> shipping patches to do the stuff. I'll conform with the >>>>>>> standards >>>>>>>> of >>>>>>>>>>>> the >>>>>>>>>>>>>>> project such as adding tests and well commenting the >>> code. >>>>>>>>>>>>>>> Sorry for the long email and hope to hear back from >>> you. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Best regards, >>>>>>>>>>>>>>> Ahmed Eldawy >> >>
