Nick: the only issue is that the way types are implemented in Pig don't allow us to easily "plug-in" types externally. Adding support for that would be cool, but a fair bit of work.
2013/5/6 Nick Dimiduk <ndimi...@gmail.com> > I'm to a lawyer, but I see no reason why this cannot be an external > extension to Pig. It would behave the same way PostGIS is an external > extension to Postgres. Any Apache issues would be toward general > purpose enhancements, not specific to your project. > > Good on you! > -n > > On Mon, May 6, 2013 at 10:12 AM, Ahmed Eldawy <aseld...@gmail.com> wrote: > > > I contacted solr developers to see how JTS can be included in an Apache > > project. See > > > > > http://mail-archives.apache.org/mod_mbox/lucene-dev/201305.mbox/raw/%3C1367815102914-4060969.post%40n3.nabble.com%3E/ > > As far as I understand, they did not include it in the main solr project, > > rather, they created a separate project (spatial 4j) which is still > > licensed under Apache license and refers to JTS. Users will have to > > download JTS libraries separately to make it run. That's pretty much the > > same plan that Jonathan mentioned. We will still have the overhead of > > serializing/deserializing the shapes each time a function is called. > Also, > > we will have to use the ugly bytearray data type for spatial data instead > > of creating its own data type (e.g., Geometry). > > I think using spatial 4j instead of JTS will not be sufficient for our > case > > as we need to provide an access to all spatial functions of JTS such as > > Union, Intersection, Difference, ... etc. This way we can claim > conformity > > with OGC standards which gives visibility and appreciations of the > spatial > > community. > > I think also that this means I will not add any issues to JIRA as it is > now > > a separate project. I'm planning to host it on github and have all the > > issues there. > > Let me know if you have any suggestions or comments. > > > > Thanks > > Ahmed > > > > > > Best regards, > > Ahmed Eldawy > > > > > > On Mon, May 6, 2013 at 9:53 AM, Jonathan Coveney <jcove...@gmail.com> > > wrote: > > > > > You can give them all the same label or tag and filter on that later > on. > > > > > > > > > 2013/5/6 Ahmed Eldawy <aseld...@gmail.com> > > > > > > > Thanks all for taking the time to respond. Danial, I didn't know that > > > Solr > > > > uses JTS. This is a good finding and we can definitely ask them to > see > > if > > > > there is a work around we can do. Jonathan, I thought of the same > idea > > of > > > > serializing/deserializing a bytearray each time a UDF is called. The > > > > deserialization part is good for letting Pig auto detect spatial > types > > if > > > > not set explicitly in the schema. What is the best way to start > this? I > > > > want to add an initial set of JIRA issues and start working on them > > but I > > > > also need to keep the work grouped in some sense just for > organization. > > > > > > > > Thanks > > > > Ahmed > > > > > > > > Best regards, > > > > Ahmed Eldawy > > > > > > > > > > > > On Sat, May 4, 2013 at 4:47 PM, Jonathan Coveney <jcove...@gmail.com > > > > > > wrote: > > > > > > > > > I agree that this is cool, and if other projects are using JTS it > is > > > > worth > > > > > talking them to see how. I also agree that licensing is very > > > frustrating. > > > > > > > > > > In the short term, however, while it is annoying to have to manage > > the > > > > > serialization and deserialization yourself, you can have the > geometry > > > > type > > > > > be passed around as a bytearray type. Your UDF's will have to know > > this > > > > and > > > > > treat it accordingly, but if you did this then all of the tools > could > > > be > > > > in > > > > > an external project on github instead of a branch in Pig. Then, if > we > > > can > > > > > get the licensing done, we could add the Geometry type to Pig. > Adding > > > > > types, honestly, is kind of tedious but not super difficult, so > once > > > the > > > > > rest is done, that shouldn't be too difficult. > > > > > > > > > > > > > > > 2013/5/4 Russell Jurney <russell.jur...@gmail.com> > > > > > > > > > > > If a way could be found, this would be an awesome addition to > Pig. > > > > > > > > > > > > Russell Jurney http://datasyndrome.com > > > > > > > > > > > > On May 3, 2013, at 4:09 PM, Daniel Dai <da...@hortonworks.com> > > > wrote: > > > > > > > > > > > > > I am not sure how other Apache projects dealing with it? Seems > > Solr > > > > > also > > > > > > > has some connector to JTS? > > > > > > > > > > > > > > Thanks, > > > > > > > Daniel > > > > > > > > > > > > > > > > > > > > > On Thu, May 2, 2013 at 11:59 AM, Ahmed Eldawy < > > aseld...@gmail.com> > > > > > > wrote: > > > > > > > > > > > > > >> Thanks Alan for your interest. It's too bad that an open > source > > > > > > licensing > > > > > > >> issue is holding me back from doing some open source work. I > > > > > understand > > > > > > the > > > > > > >> issue and your workarounds make sense. However, as I mentioned > > in > > > > the > > > > > > >> beginning, I don't want to have my own branch of Pig because > it > > > > makes > > > > > my > > > > > > >> extension less portable. I'll think of another way to do it. > > I'll > > > > ask > > > > > > vivid > > > > > > >> solutions if they can double license their code although I > think > > > the > > > > > > answer > > > > > > >> will be no. I'll also think of a way to ship my extension as a > > set > > > > of > > > > > > jar > > > > > > >> files without the need to change the core of Pig. This way, it > > can > > > > be > > > > > > >> easily ported to newer versions of Pig. > > > > > > >> > > > > > > >> Thanks > > > > > > >> Ahmed > > > > > > >> > > > > > > >> Best regards, > > > > > > >> Ahmed Eldawy > > > > > > >> > > > > > > >> > > > > > > >> On Thu, May 2, 2013 at 12:33 PM, Alan Gates < > > > ga...@hortonworks.com> > > > > > > wrote: > > > > > > >> > > > > > > >>> I know this is frustrating, but the different licenses do > have > > > > > > different > > > > > > >>> requirements that make it so that Apache can't ship GPL code. > > A > > > > > legal > > > > > > >>> explanation is at > > > > > > > http://www.apache.org/licenses/GPL-compatibility.htmlForadditional > > > > info > > > > > > on the LGPL specific questions see > > > > > > >>> http://www.apache.org/legal/3party.html > > > > > > >>> > > > > > > >>> As far as pulling it in via ivy, the issue isn't so much > where > > > the > > > > > code > > > > > > >>> lives as much as what code we are requiring to make Pig work. > > If > > > > > > >> something > > > > > > >>> that is [L]GPL is required for Pig it violates Apache rules > as > > > > > outlined > > > > > > >>> above. It also would be a show stopper for a lot of > companies > > > that > > > > > > >>> redistribute Pig and that are allergic to GPL software. > > > > > > >>> > > > > > > >>> So, as I said before, if you wanted to continue with that > > library > > > > and > > > > > > >> they > > > > > > >>> are not willing to relicense it then it would have to be > bolted > > > on > > > > > > after > > > > > > >>> Apache Pig is built. Nothing stops you from doing this by > > > > > downloading > > > > > > >>> Apache Pig, adding this library and your code, and > > > redistributing, > > > > > > though > > > > > > >>> it wouldn't then be open to all Pig users. > > > > > > >>> > > > > > > >>> Alan. > > > > > > >>> > > > > > > >>> On May 1, 2013, at 6:08 PM, Ahmed Eldawy wrote: > > > > > > >>> > > > > > > >>>> Thanks for your response. I was never good at > differentiating > > > all > > > > > > those > > > > > > >>>> open source licenses. I mean what is the point making open > > > source > > > > > > >>> licenses > > > > > > >>>> if it blocks me from using a library in an open source > > project. > > > > Any > > > > > > >> way, > > > > > > >>>> I'm not going into debate here. Just one question, if we use > > JTS > > > > as > > > > > a > > > > > > >>>> library (jar file) without adding the code in Pig, is it > > still a > > > > > > >>> violation? > > > > > > >>>> We'll use ivy, for example, to download the jar file when > > > > compiling. > > > > > > >>>> On May 1, 2013 7:50 PM, "Alan Gates" <ga...@hortonworks.com > > > > > > wrote: > > > > > > >>>> > > > > > > >>>>> Passing on the technical details for a moment, I see a > > > licensing > > > > > > >> issue. > > > > > > >>>>> JTS is licensed under LGPL. Apache projects cannot contain > > or > > > > ship > > > > > > >>>>> [L]GPL. Apache does not meet the requirements of GPL and > > thus > > > we > > > > > > >> cannot > > > > > > >>>>> repackage their code. If you wanted to go forward using > that > > > > class > > > > > > >> this > > > > > > >>>>> would have to be packaged as an add on that was downloaded > > > > > separately > > > > > > >>> and > > > > > > >>>>> not from Apache. Another option is to work with the JTS > > > > community > > > > > > and > > > > > > >>> see > > > > > > >>>>> if they are willing to dual license their code under BSD or > > > > Apache > > > > > > >>> license > > > > > > >>>>> so that Pig could include it. If neither of those are an > > > option > > > > > you > > > > > > >>> would > > > > > > >>>>> need to come up with a new class to contain your spatial > > data. > > > > > > >>>>> > > > > > > >>>>> Alan. > > > > > > >>>>> > > > > > > >>>>> On May 1, 2013, at 5:40 PM, Ahmed Eldawy wrote: > > > > > > >>>>> > > > > > > >>>>>> Hi all, > > > > > > >>>>>> First, sorry for the long email. I wanted to put all my > > > thoughts > > > > > > here > > > > > > >>>>> and > > > > > > >>>>>> get your feedback. > > > > > > >>>>>> I'm proposing a major addition to Pig that will greatly > > > increase > > > > > its > > > > > > >>>>>> functionality and user base. It is simply to add spatial > > > support > > > > > to > > > > > > >> the > > > > > > >>>>>> language and the framework. I've already started working > on > > > that > > > > > but > > > > > > >> I > > > > > > >>>>>> don't want it to be just another branch. I want it, > > > eventually, > > > > to > > > > > > be > > > > > > >>>>>> merged with the trunk of Apache Pig. So, I'm sending this > > > email > > > > > > >> mainly > > > > > > >>> to > > > > > > >>>>>> reach out the main contributors of Pig to see the > > feasibility > > > of > > > > > > >> this. > > > > > > >>>>>> This addition is a part of a big project we have been > > working > > > on > > > > > in > > > > > > >>>>>> University of Minnesota; the project is called Spatial > > Hadoop. > > > > > > >>>>>> http://spatialhadoop.cs.umn.edu. It's about building a > > > > MapReduce > > > > > > >>>>> framework > > > > > > >>>>>> (Hadoop) that is capable of maintaining and analyzing > > spatial > > > > data > > > > > > >>>>>> efficiently. I'm the main guy behind that project and > since > > we > > > > > > >> released > > > > > > >>>>> its > > > > > > >>>>>> first version, we received very encouraging responses from > > > > > different > > > > > > >>>>> groups > > > > > > >>>>>> in the research and industrial community. I'm sure the > > > addition > > > > we > > > > > > >> want > > > > > > >>>>> to > > > > > > >>>>>> make to Pig Latin will be widely accepted by the people in > > the > > > > > > >> spatial > > > > > > >>>>>> community. > > > > > > >>>>>> I'm proposing a plan here while we're still in the early > > > phases > > > > of > > > > > > >> this > > > > > > >>>>>> task to be able to discuss it with the main contributors > and > > > see > > > > > its > > > > > > >>>>>> feasibility. First of all, I think that we need to change > > the > > > > core > > > > > > of > > > > > > >>> Pig > > > > > > >>>>>> to be able to support spatial data. Providing a set of > UDFs > > > only > > > > > is > > > > > > >> not > > > > > > >>>>>> enough. The main reason is that Pig Latin does not > provide a > > > way > > > > > to > > > > > > >>>>> create > > > > > > >>>>>> a new data type which is needed for spatial data. Once we > > have > > > > the > > > > > > >>>>> spatial > > > > > > >>>>>> data types we need, the functionality can be expanded > using > > > more > > > > > > >> UDFs. > > > > > > >>>>>> > > > > > > >>>>>> Here's the plan as I see it. > > > > > > >>>>>> 1- Introduce a new primitive data type Geometry which > > > represents > > > > > all > > > > > > >>>>>> spatial data types. In the underlying system, this will > map > > to > > > > > > >>>>>> com.vividsolutions.jts.geom.Geometry. This is a class from > > > Java > > > > > > >>> Topology > > > > > > >>>>>> Suite (JTS) [ > http://www.vividsolutions.com/jts/JTSHome.htm > > ], > > > a > > > > > > >> stable > > > > > > >>>>> and > > > > > > >>>>>> efficient open source Java library for spatial data types > > and > > > > > > >>> algorithms. > > > > > > >>>>>> It is very popular in the spatial community and a C++ port > > of > > > it > > > > > is > > > > > > >>> used > > > > > > >>>>> in > > > > > > >>>>>> PostGIS [http://postgis.net/] (a spatial library for > > > Postgres). > > > > > JTS > > > > > > >>> also > > > > > > >>>>>> conforms with Open Geospatial Consortium (OGC) [ > > > > > > >>>>>> http://www.opengeospatial.org/] which is an open standard > > for > > > > the > > > > > > >>>>> spatial > > > > > > >>>>>> data types. The Geometry data type is read from and > written > > to > > > > > text > > > > > > >>> files > > > > > > >>>>>> using the Well Known Text (WKT) format. There is also a > way > > to > > > > > > >> convert > > > > > > >>> it > > > > > > >>>>>> to/from binary so that it can work with binary files and > > > > streams. > > > > > > >>>>>> 2- Add functions that manipulate spatial data types. These > > > will > > > > be > > > > > > >>> added > > > > > > >>>>> as > > > > > > >>>>>> UDFs and we will not need to mess with the internals of > Pig. > > > > Most > > > > > > >>>>> probably, > > > > > > >>>>>> there will be one new class for each operation (e.g., > union > > or > > > > > > >>>>>> intersection). I think it will be good to put these new > > > > operations > > > > > > >>> inside > > > > > > >>>>>> the core of Pig so that users can use it without having to > > > write > > > > > the > > > > > > >>>>> fully > > > > > > >>>>>> qualified class name. Also, since there is no way to > > > implicitly > > > > > cast > > > > > > >> a > > > > > > >>>>>> spatial data type to a non-spatial data types, there will > > not > > > be > > > > > any > > > > > > >>>>>> conflicts in existing operations or new operations. All > new > > > > > > >> operations, > > > > > > >>>>> and > > > > > > >>>>>> only the new operations, will be working on spatial data > > > types. > > > > > Here > > > > > > >> is > > > > > > >>>>> an > > > > > > >>>>>> initial list of operations that can be added. All those > > > > operations > > > > > > >> are > > > > > > >>>>>> already implemented in JTS and the UDFs added to Pig will > be > > > > just > > > > > > >>>>> wrappers > > > > > > >>>>>> around them. > > > > > > >>>>>> **Predicates (used for spatial filtering) > > > > > > >>>>>> Equals > > > > > > >>>>>> Disjoint > > > > > > >>>>>> Intersects > > > > > > >>>>>> Touches > > > > > > >>>>>> Crosses > > > > > > >>>>>> Within > > > > > > >>>>>> Contains > > > > > > >>>>>> Overlaps > > > > > > >>>>>> > > > > > > >>>>>> **Operations > > > > > > >>>>>> Envelope > > > > > > >>>>>> Area > > > > > > >>>>>> Length > > > > > > >>>>>> Buffer > > > > > > >>>>>> ConvexHull > > > > > > >>>>>> Intersection > > > > > > >>>>>> Union > > > > > > >>>>>> Difference > > > > > > >>>>>> SymDifference > > > > > > >>>>>> > > > > > > >>>>>> **Aggregate functions > > > > > > >>>>>> Accum > > > > > > >>>>>> ConvexHull > > > > > > >>>>>> Union > > > > > > >>>>>> > > > > > > >>>>>> 3- The third step is to implement spatial indexes (e.g., > > Grid > > > or > > > > > > >>>>> R-tree). A > > > > > > >>>>>> Pig loader and Pig output classes will be created for > those > > > > > indexes. > > > > > > >>> Note > > > > > > >>>>>> that currently we have SpatialOutputFormat and > > > > SpatialInputFormat > > > > > > for > > > > > > >>>>> those > > > > > > >>>>>> indexes inside the Spatial Hadoop project, but we need to > > > tweak > > > > > them > > > > > > >> to > > > > > > >>>>>> work with Pig. > > > > > > >>>>>> > > > > > > >>>>>> 4- (Advanced) Implement more sophisticated algorithms for > > > > spatial > > > > > > >>>>>> operations that utilize the indexes. For example, we can > > have > > > a > > > > > > >>> specific > > > > > > >>>>>> algorithm for spatial range query or spatial join. Again, > we > > > > > already > > > > > > >>> have > > > > > > >>>>>> algorithms built for different operations implemented in > > > Spatial > > > > > > >> Hadoop > > > > > > >>>>> as > > > > > > >>>>>> MapReduce programs, but they will need to be modified to > > work > > > in > > > > > Pig > > > > > > >>>>>> environment and get to work with other operations. > > > > > > >>>>>> > > > > > > >>>>>> This is my whole plan for the spatial extension to Pig. > I've > > > > > already > > > > > > >>>>>> started with the first step but as I mentioned earlier, I > > > don't > > > > > want > > > > > > >> to > > > > > > >>>>> do > > > > > > >>>>>> the work for our project and then the work gets > forgotten. I > > > > want > > > > > to > > > > > > >>>>>> contribute to Pig and do my research at the same time. If > > you > > > > > think > > > > > > >> the > > > > > > >>>>>> plan is plausible, I'll open JIRA issues for the above > tasks > > > and > > > > > > >> start > > > > > > >>>>>> shipping patches to do the stuff. I'll conform with the > > > > standards > > > > > of > > > > > > >>> the > > > > > > >>>>>> project such as adding tests and well commenting the code. > > > > > > >>>>>> Sorry for the long email and hope to hear back from you. > > > > > > >>>>>> > > > > > > >>>>>> > > > > > > >>>>>> Best regards, > > > > > > >>>>>> Ahmed Eldawy > > > > > > >>>>> > > > > > > >>>>> > > > > > > >>> > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > >