Re: [DISCUSS] Put all GeoTools jars into a package on Maven Central

2021-02-11 Thread Jia Yu
OSGeos LocationTech owns GeoTools. I am thinking whether I should have my
wrapper on Maven Central to bring those Sedona required GeoTools jars to
Maven Central. Since it is LGPL, it might be OK to do so.

On Thu, Feb 11, 2021 at 5:18 PM Felix Cheung  wrote:

> Who owns or manages GeoTools if it is LGPL?
>
> On Thu, Feb 11, 2021 at 12:01 PM Jia Yu  wrote:
>
>> Pawel,
>>
>> Python-adapter module is always being used by users. But it does not come
>> with GeoTools. To use it, users have to (1) compile the source code of
>> Python-adapter, or (2) add GeoTools coordiantes from OSGEO repo via
>> config(""), or (3) download and copy GeoTools jars to SPARK_HOME/jars/
>>
>> The easiest is 2, but it looks like it may not work in all environments
>> since it needs to search OSGEO repo.
>>
>> What I am saying is that we can "move" GeoTools jars to Maven Central,
>> Method 2 will 100% work, users just need to add
>> "sedona-python-adapter-1.0.0-incubating" and "geotools-24-wrapper-1.0.0"
>> coordinates in code.
>>
>> Do you think this is necessary?
>>
>> On Thu, Feb 11, 2021 at 11:40 AM Paweł Kociński <
>> pawel93kocin...@gmail.com> wrote:
>>
>>> Both options seems good to me, but we have to remember that not all of
>>> Sedona users using cloud solutions, some of them are using Spark with
>>> hadoop. What about python-adapter module within sedona project, am I
>>> missing sth ?
>>> Regards,
>>> Paweł
>>>
>>> czw., 11 lut 2021 o 14:40 Netanel Malka 
>>> napisał(a):
>>>
 I think that we can make it work on Databricks without any changes.
 After creating a cluster on Databricks, the user can install the
 geotools packages and provide the osego *(or any other repo)
 explicitly.*

 As you can see in the picture:

 [image: image.png]
 I can provide the details on how to install it.

 I think it will solve the problem.
 What do you think?


 On Thu, 11 Feb 2021 at 12:24, Jia Yu  wrote:

> Hi folks,
>
> As you can see from the recent discussion in the mailing list
> <[Bug][Python] Missing Java class>, in Sedona 1.0.0, because those LGPL
> GeoTools jars are not on Maven Central (only in OSGEO repo), Databricks
> cannot get GeoTools jars.
>
> I believe this will cause lots of trouble to our future Python users.
> Reading Shapefiles and do CRS transformation are big selling points for
> Sedona.
>
> The easiest way to fix this, without violating ASF policy, is that I
> will publish a GeoTools wrapper on Maven Central using the old GeoSpark
> group ID: https://mvnrepository.com/artifact/org.datasyslab
>
> For example, org.datasyslab:geotools-24-wrapper:1.0.0
>
> 1. This GeoTools wrapper does nothing but brings the GeoTools jars
> needed by Sedona to Maven Central.
> 2. When the Python user calls Sedona, they can add one more
> package: org.datasyslab:geotools-24-wrapper:1.0.0
>
> Another good thing is that: this does not require a new source code
> release from Sedona. We only need to update the website and let the users
> know how to call it.
>
> Any better ideas?
>
> Thanks,
> Jia
>
>
>

 --
 Best regards,
 Netanel Malka.

>>>


Re: [DISCUSS] Put all GeoTools jars into a package on Maven Central

2021-02-11 Thread Felix Cheung
Who owns or manages GeoTools if it is LGPL?

On Thu, Feb 11, 2021 at 12:01 PM Jia Yu  wrote:

> Pawel,
>
> Python-adapter module is always being used by users. But it does not come
> with GeoTools. To use it, users have to (1) compile the source code of
> Python-adapter, or (2) add GeoTools coordiantes from OSGEO repo via
> config(""), or (3) download and copy GeoTools jars to SPARK_HOME/jars/
>
> The easiest is 2, but it looks like it may not work in all environments
> since it needs to search OSGEO repo.
>
> What I am saying is that we can "move" GeoTools jars to Maven Central,
> Method 2 will 100% work, users just need to add
> "sedona-python-adapter-1.0.0-incubating" and "geotools-24-wrapper-1.0.0"
> coordinates in code.
>
> Do you think this is necessary?
>
> On Thu, Feb 11, 2021 at 11:40 AM Paweł Kociński 
> wrote:
>
>> Both options seems good to me, but we have to remember that not all of
>> Sedona users using cloud solutions, some of them are using Spark with
>> hadoop. What about python-adapter module within sedona project, am I
>> missing sth ?
>> Regards,
>> Paweł
>>
>> czw., 11 lut 2021 o 14:40 Netanel Malka 
>> napisał(a):
>>
>>> I think that we can make it work on Databricks without any changes.
>>> After creating a cluster on Databricks, the user can install the
>>> geotools packages and provide the osego *(or any other repo)
>>> explicitly.*
>>>
>>> As you can see in the picture:
>>>
>>> [image: image.png]
>>> I can provide the details on how to install it.
>>>
>>> I think it will solve the problem.
>>> What do you think?
>>>
>>>
>>> On Thu, 11 Feb 2021 at 12:24, Jia Yu  wrote:
>>>
 Hi folks,

 As you can see from the recent discussion in the mailing list
 <[Bug][Python] Missing Java class>, in Sedona 1.0.0, because those LGPL
 GeoTools jars are not on Maven Central (only in OSGEO repo), Databricks
 cannot get GeoTools jars.

 I believe this will cause lots of trouble to our future Python users.
 Reading Shapefiles and do CRS transformation are big selling points for
 Sedona.

 The easiest way to fix this, without violating ASF policy, is that I
 will publish a GeoTools wrapper on Maven Central using the old GeoSpark
 group ID: https://mvnrepository.com/artifact/org.datasyslab

 For example, org.datasyslab:geotools-24-wrapper:1.0.0

 1. This GeoTools wrapper does nothing but brings the GeoTools jars
 needed by Sedona to Maven Central.
 2. When the Python user calls Sedona, they can add one more
 package: org.datasyslab:geotools-24-wrapper:1.0.0

 Another good thing is that: this does not require a new source code
 release from Sedona. We only need to update the website and let the users
 know how to call it.

 Any better ideas?

 Thanks,
 Jia



>>>
>>> --
>>> Best regards,
>>> Netanel Malka.
>>>
>>


Re: [DISCUSS] Put all GeoTools jars into a package on Maven Central

2021-02-11 Thread Jia Yu
Pawel,

Python-adapter module is always being used by users. But it does not come
with GeoTools. To use it, users have to (1) compile the source code of
Python-adapter, or (2) add GeoTools coordiantes from OSGEO repo via
config(""), or (3) download and copy GeoTools jars to SPARK_HOME/jars/

The easiest is 2, but it looks like it may not work in all environments
since it needs to search OSGEO repo.

What I am saying is that we can "move" GeoTools jars to Maven Central,
Method 2 will 100% work, users just need to add
"sedona-python-adapter-1.0.0-incubating" and "geotools-24-wrapper-1.0.0"
coordinates in code.

Do you think this is necessary?

On Thu, Feb 11, 2021 at 11:40 AM Paweł Kociński 
wrote:

> Both options seems good to me, but we have to remember that not all of
> Sedona users using cloud solutions, some of them are using Spark with
> hadoop. What about python-adapter module within sedona project, am I
> missing sth ?
> Regards,
> Paweł
>
> czw., 11 lut 2021 o 14:40 Netanel Malka  napisał(a):
>
>> I think that we can make it work on Databricks without any changes.
>> After creating a cluster on Databricks, the user can install the geotools
>> packages and provide the osego *(or any other repo) explicitly.*
>>
>> As you can see in the picture:
>>
>> [image: image.png]
>> I can provide the details on how to install it.
>>
>> I think it will solve the problem.
>> What do you think?
>>
>>
>> On Thu, 11 Feb 2021 at 12:24, Jia Yu  wrote:
>>
>>> Hi folks,
>>>
>>> As you can see from the recent discussion in the mailing list
>>> <[Bug][Python] Missing Java class>, in Sedona 1.0.0, because those LGPL
>>> GeoTools jars are not on Maven Central (only in OSGEO repo), Databricks
>>> cannot get GeoTools jars.
>>>
>>> I believe this will cause lots of trouble to our future Python users.
>>> Reading Shapefiles and do CRS transformation are big selling points for
>>> Sedona.
>>>
>>> The easiest way to fix this, without violating ASF policy, is that I
>>> will publish a GeoTools wrapper on Maven Central using the old GeoSpark
>>> group ID: https://mvnrepository.com/artifact/org.datasyslab
>>>
>>> For example, org.datasyslab:geotools-24-wrapper:1.0.0
>>>
>>> 1. This GeoTools wrapper does nothing but brings the GeoTools jars
>>> needed by Sedona to Maven Central.
>>> 2. When the Python user calls Sedona, they can add one more
>>> package: org.datasyslab:geotools-24-wrapper:1.0.0
>>>
>>> Another good thing is that: this does not require a new source code
>>> release from Sedona. We only need to update the website and let the users
>>> know how to call it.
>>>
>>> Any better ideas?
>>>
>>> Thanks,
>>> Jia
>>>
>>>
>>>
>>
>> --
>> Best regards,
>> Netanel Malka.
>>
>


Re: [Bug][Python] Missing Java Class?

2021-02-11 Thread Netanel Malka
Hi Gregory,
Can you please try to install the jars on the Databricks Cluster?

For example:
On clusters -> choose your cluster -> libraries -> install new:
1.Coordinates:  org.geotools:gt-main:24.0
2.repo: https://repo.osgeo.org/repository/release/

I successfully did it.
 
Please let me know if it solves your problem

On 2021/02/10 13:16:50, Grégory Dugernier  wrote: 
> Thank you for the quick reply!
> 
> It seems my particular situation is a bit more complex than that, since I'm
> running the notebook on a Databricks cluster, and the default spark config
> doesn't seem to allow for more jar repositories (GeoTools isn't on Maven
> Central), nor does creating a new SparkSession appears to work. I've tried
> to download the jars and add them manually to the cluster but it doesn't
> seem to work either. But at least I know where the issue's at!
> 
> Thanks again for your help,
> Regards
> 
> On Wed, 10 Feb 2021 at 12:22, Jia Yu  wrote:
> 
> > Hi Gregory,
> >
> > Thanks for letting us know. This is not a bug. We cannot include GeoTools
> > jars due to license issues. But indeed we forgot to update the docs and
> > jupyter notebook examples. I just updated them. Please read them here:
> >
> >
> > https://github.com/apache/incubator-sedona/blob/master/python/ApacheSedonaSQL.ipynb
> >
> > (Make sure you disable the browser cache or open it in an incognito
> > window)  http://sedona.apache.org/download/overview/#install-sedona-python
> >
> > In short, you need to add the following coordinates in the notebook:
> >
> > spark = SparkSession. \ builder. \ appName('appName'). \ config(
> > "spark.serializer", KryoSerializer.getName). \ config(
> > "spark.kryo.registrator", SedonaKryoRegistrator.getName). \ config(
> > "spark.jars.repositories", 'https://repo.osgeo.org/repository/release,' '
> > https://download.java.net/maven/2'). \ config('spark.jars.packages',
> > 'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating,'
> > 'org.geotools:gt-main:24.0,' 'org.geotools:gt-referencing:24.0,'
> > 'org.geotools:gt-epsg-hsql:24.0'). \ getOrCreate()
> >
> > On Wed, Feb 10, 2021 at 2:35 AM Grégory Dugernier  wrote:
> >
> >> Hello,
> >>
> >> I've been trying to run Sedona for Python on Databricks for 2 days and I
> >> think I've stumbled upon a bug.
> >>
> >> *Configuration*:
> >>
> >>- Spark 3.0.1
> >>- Scala 2.12
> >>- Python 3.7
> >>
> >> *Librairies*:
> >>
> >>- apache-sedona (from PyPi)
> >>- org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating
> >>(from Maven)
> >>
> >> *What I'm trying to do:*
> >>
> >> I'm trying to load a series of Shapefiles files into a dataframe for
> >> geospatial analysis. See code snippet below, based of your example
> >> notebook
> >> <
> >> https://github.com/apache/incubator-sedona/blob/master/python/ApacheSedonaCore.ipynb
> >> >
> >>
> >>
> >> > from sedona.core.formatMapper.shapefileParser import ShapefileReader
> >> > from sedona.register import SedonaRegistrator
> >> > from sedona.utils.adapter import Adapter
> >> >
> >> > SedonaRegistrator.registerAll(spark)
> >> > shape_rdd = ShapefileReader.readToGeometryRDD(spark.sparkContext,
> >> > file_name)
> >> > df = Adapter.toDf(shape_rdd, spark)
> >> >
> >>
> >> *Bug*:
> >>
> >> The ShapefileReader.readToGeometryRDD() currently throws the following
> >> error:
> >>
> >> > Py4JJavaError: An error occurred while calling
> >> >
> >> z:org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD.
> >> > : java.lang.NoClassDefFoundError:
> >> org/opengis/referencing/FactoryException
> >> > at
> >> >
> >> org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD(ShapefileReader.java:79)
> >> > at
> >> >
> >> org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD(ShapefileReader.java:66)
> >> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at
> >> >
> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >> > at
> >> >
> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >> > at java.lang.reflect.Method.invoke(Method.java:498) at
> >> > py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at
> >> > py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380) at
> >> > py4j.Gateway.invoke(Gateway.java:295) at
> >> > py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at
> >> > py4j.commands.CallCommand.execute(CallCommand.java:79) at
> >> > py4j.GatewayConnection.run(GatewayConnection.java:251) at
> >> > java.lang.Thread.run(Thread.java:748) Caused by:
> >> > java.lang.ClassNotFoundException:
> >> org.opengis.referencing.FactoryException
> >> > at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at
> >> > java.lang.ClassLoader.loadClass(ClassLoader.java:419) at
> >> >
> >> com.databricks.backend.daemon.driver.ClassLoaders$LibraryClassLoader.loadClass(ClassLoaders.scala:151)
> >> > at 

Re: [DISCUSS] Put all GeoTools jars into a package on Maven Central

2021-02-11 Thread Netanel Malka
I think that we can make it work on Databricks without any changes.
After creating a cluster on Databricks, the user can install the geotools
packages and provide the osego *(or any other repo) explicitly.*

As you can see in the picture:

[image: image.png]
I can provide the details on how to install it.

I think it will solve the problem.
What do you think?


On Thu, 11 Feb 2021 at 12:24, Jia Yu  wrote:

> Hi folks,
>
> As you can see from the recent discussion in the mailing list
> <[Bug][Python] Missing Java class>, in Sedona 1.0.0, because those LGPL
> GeoTools jars are not on Maven Central (only in OSGEO repo), Databricks
> cannot get GeoTools jars.
>
> I believe this will cause lots of trouble to our future Python users.
> Reading Shapefiles and do CRS transformation are big selling points for
> Sedona.
>
> The easiest way to fix this, without violating ASF policy, is that I will
> publish a GeoTools wrapper on Maven Central using the old GeoSpark group
> ID: https://mvnrepository.com/artifact/org.datasyslab
>
> For example, org.datasyslab:geotools-24-wrapper:1.0.0
>
> 1. This GeoTools wrapper does nothing but brings the GeoTools jars needed
> by Sedona to Maven Central.
> 2. When the Python user calls Sedona, they can add one more
> package: org.datasyslab:geotools-24-wrapper:1.0.0
>
> Another good thing is that: this does not require a new source code
> release from Sedona. We only need to update the website and let the users
> know how to call it.
>
> Any better ideas?
>
> Thanks,
> Jia
>
>
>

-- 
Best regards,
Netanel Malka.


[DISCUSS] Put all GeoTools jars into a package on Maven Central

2021-02-11 Thread Jia Yu
Hi folks,

As you can see from the recent discussion in the mailing list
<[Bug][Python] Missing Java class>, in Sedona 1.0.0, because those LGPL
GeoTools jars are not on Maven Central (only in OSGEO repo), Databricks
cannot get GeoTools jars.

I believe this will cause lots of trouble to our future Python users.
Reading Shapefiles and do CRS transformation are big selling points for
Sedona.

The easiest way to fix this, without violating ASF policy, is that I will
publish a GeoTools wrapper on Maven Central using the old GeoSpark group
ID: https://mvnrepository.com/artifact/org.datasyslab

For example, org.datasyslab:geotools-24-wrapper:1.0.0

1. This GeoTools wrapper does nothing but brings the GeoTools jars needed
by Sedona to Maven Central.
2. When the Python user calls Sedona, they can add one more
package: org.datasyslab:geotools-24-wrapper:1.0.0

Another good thing is that: this does not require a new source code release
from Sedona. We only need to update the website and let the users know how
to call it.

Any better ideas?

Thanks,
Jia


Re: [Bug][Python] Missing Java Class?

2021-02-11 Thread Grégory Dugernier
Honestly, not the worst thing I had to compile on Windows. Aside from the
Javadoc step and that weird issue with the dynamic versioning, it went
pretty smoothly; it just took a while because every try was taking a fair
bit of time.

I'll keep an eye on further announcements to see what comes next!

Thanks for your time

On Thu, 11 Feb 2021 at 11:06, Jia Yu  wrote:

> Thanks for letting us know. Yes, our source code is not supposed to be
> compiled on Windows. I didn't expect so much trouble to get this jar. We
> will figure a better way to solve this issue soon.
>
> On Thu, Feb 11, 2021 at 1:46 AM Grégory Dugernier  wrote:
>
>> In fact, you should let us know about your situation early on. In fact,
>>> you can download the GeoTools jars manually and copy to SPARK_HOME/jars/
>>> folder... You don't have to compile the code. Download links are given in
>>> the comments:
>>> http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240
>>
>>
>> I did copy the Geotools jars and added them to my cluster library, but
>> python-adapter didn't seem to find them in the FileStore. Placing the jars
>> inside SPARK_HOME on the cluster means trying to first determine where the
>> environment variable points to inside the DBFS architecture, then most
>> likely add them through CLI commands. This represented several short terms
>> obstacles, but also raised many issues down the line, because we are
>> deploying our clusters through Terraform and not all developers will have
>> the elevated permissions to perform CLI commands. A single, compiled jar
>> with all the dependencies within can easily be deployed at cluster creation
>> with a databricks_dbfs_file
>> 
>> resource and using the library.jar property of databricks_cluster
>> .
>> The jar ended up to be a bit of a headache to produce, but it keeps things
>> high level and easier to maintain.
>>
>> That is, of course, unless I'm missing the obvious and there was an easy
>> way to add GeoTools jars on the Databricks cluster and let
>> sedona-python-adapter find them, which isn't entirely excluded.
>>
>> On Thu, 11 Feb 2021 at 10:03, Jia Yu  wrote:
>>
>>> Thanks, Gregory. I think this behavior is not expected. We will look
>>> into this.
>>>
>>> In fact, you should let us know about your situation early on. In fact,
>>> you can download the GeoTools jars manually and copy to SPARK_HOME/jars/
>>> folder... You don't have to compile the code. Download links are given in
>>> the comments:
>>> http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240
>>>
>>> We should make our doc more clear.
>>>
>>>
>>> On Thu, Feb 11, 2021 at 12:44 AM Grégory Dugernier 
>>> wrote:
>>>
 Hi Jia,

 After much sweat and tears, I went the long road and compiled the code
 locally. I'm working on Windows so I had to change a few things in the
 POM.xml:

- When trying to compile just the python-adapter lib, Maven didn't
like the dynamic versioning of sedona-core and sedona-sql, so I had to
hardcode the current version.
- For some reason, Maven couldn't find spark-version-converter from
within the python-adapter directory, so I just decided to compile the 
 full
library. It might be possible to just compile the adapter, I just 
 decided
pushing in this direction further seemed like it would take longer.
- When trying to compile the full library, the attach-javadoc goal
just keep erroring-out, even with the latest version of
maven-javadoc-plugin, so I just removed it entirely.

 By the end, I got the jar, uploaded it in Databricks and it works like
 a charm so far.

 I did however meet another issue, it seems that when using 
 *ShapefileReader.readToGeometryRDD(spark.sparkContext,
 file_url) *to read multiple Shapefiles files at once, then use the
 Adapter, same-named columns aren't combined in the resulting DataFrame (see
 example below). It might be normal RDD behavior -I have little experience
 using them instead of DataFrames-, and I already found a workaround by
 creating multiple dfs and using union(), but I prefer to let you know in
 case it isn't the expected behavior.
 [image: image.png]

 Regards,
 Grégory

 On Thu, 11 Feb 2021 at 07:58, Jia Yu  wrote:

> Hi Gregory,
>
> Please let us know if you get your issue fixed. I know many of our
> users are also using Databricks cluster. We are also interested in the
> solution.
>
> Thanks,
> Jia
>
> On Wed, Feb 10, 2021 at 5:17 AM Grégory Dugernier 
> wrote:
>
>> Thank you for the quick reply!
>>
>> It seems my 

Re: [Bug][Python] Missing Java Class?

2021-02-11 Thread Jia Yu
Thanks for letting us know. Yes, our source code is not supposed to be
compiled on Windows. I didn't expect so much trouble to get this jar. We
will figure a better way to solve this issue soon.

On Thu, Feb 11, 2021 at 1:46 AM Grégory Dugernier  wrote:

> In fact, you should let us know about your situation early on. In fact,
>> you can download the GeoTools jars manually and copy to SPARK_HOME/jars/
>> folder... You don't have to compile the code. Download links are given in
>> the comments:
>> http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240
>
>
> I did copy the Geotools jars and added them to my cluster library, but
> python-adapter didn't seem to find them in the FileStore. Placing the jars
> inside SPARK_HOME on the cluster means trying to first determine where the
> environment variable points to inside the DBFS architecture, then most
> likely add them through CLI commands. This represented several short terms
> obstacles, but also raised many issues down the line, because we are
> deploying our clusters through Terraform and not all developers will have
> the elevated permissions to perform CLI commands. A single, compiled jar
> with all the dependencies within can easily be deployed at cluster creation
> with a databricks_dbfs_file
> 
> resource and using the library.jar property of databricks_cluster
> .
> The jar ended up to be a bit of a headache to produce, but it keeps things
> high level and easier to maintain.
>
> That is, of course, unless I'm missing the obvious and there was an easy
> way to add GeoTools jars on the Databricks cluster and let
> sedona-python-adapter find them, which isn't entirely excluded.
>
> On Thu, 11 Feb 2021 at 10:03, Jia Yu  wrote:
>
>> Thanks, Gregory. I think this behavior is not expected. We will look into
>> this.
>>
>> In fact, you should let us know about your situation early on. In fact,
>> you can download the GeoTools jars manually and copy to SPARK_HOME/jars/
>> folder... You don't have to compile the code. Download links are given in
>> the comments:
>> http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240
>>
>> We should make our doc more clear.
>>
>>
>> On Thu, Feb 11, 2021 at 12:44 AM Grégory Dugernier 
>> wrote:
>>
>>> Hi Jia,
>>>
>>> After much sweat and tears, I went the long road and compiled the code
>>> locally. I'm working on Windows so I had to change a few things in the
>>> POM.xml:
>>>
>>>- When trying to compile just the python-adapter lib, Maven didn't
>>>like the dynamic versioning of sedona-core and sedona-sql, so I had to
>>>hardcode the current version.
>>>- For some reason, Maven couldn't find spark-version-converter from
>>>within the python-adapter directory, so I just decided to compile the 
>>> full
>>>library. It might be possible to just compile the adapter, I just decided
>>>pushing in this direction further seemed like it would take longer.
>>>- When trying to compile the full library, the attach-javadoc goal
>>>just keep erroring-out, even with the latest version of
>>>maven-javadoc-plugin, so I just removed it entirely.
>>>
>>> By the end, I got the jar, uploaded it in Databricks and it works like a
>>> charm so far.
>>>
>>> I did however meet another issue, it seems that when using 
>>> *ShapefileReader.readToGeometryRDD(spark.sparkContext,
>>> file_url) *to read multiple Shapefiles files at once, then use the
>>> Adapter, same-named columns aren't combined in the resulting DataFrame (see
>>> example below). It might be normal RDD behavior -I have little experience
>>> using them instead of DataFrames-, and I already found a workaround by
>>> creating multiple dfs and using union(), but I prefer to let you know in
>>> case it isn't the expected behavior.
>>> [image: image.png]
>>>
>>> Regards,
>>> Grégory
>>>
>>> On Thu, 11 Feb 2021 at 07:58, Jia Yu  wrote:
>>>
 Hi Gregory,

 Please let us know if you get your issue fixed. I know many of our
 users are also using Databricks cluster. We are also interested in the
 solution.

 Thanks,
 Jia

 On Wed, Feb 10, 2021 at 5:17 AM Grégory Dugernier 
 wrote:

> Thank you for the quick reply!
>
> It seems my particular situation is a bit more complex than that,
> since I'm running the notebook on a Databricks cluster, and the default
> spark config doesn't seem to allow for more jar repositories (GeoTools
> isn't on Maven Central), nor does creating a new SparkSession appears to
> work. I've tried to download the jars and add them manually to the cluster
> but it doesn't seem to work either. But at least I know where the issue's
> at!
>
> Thanks 

Re: [Bug][Python] Missing Java Class?

2021-02-11 Thread Grégory Dugernier
>
> In fact, you should let us know about your situation early on. In fact,
> you can download the GeoTools jars manually and copy to SPARK_HOME/jars/
> folder... You don't have to compile the code. Download links are given in
> the comments:
> http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240


I did copy the Geotools jars and added them to my cluster library, but
python-adapter didn't seem to find them in the FileStore. Placing the jars
inside SPARK_HOME on the cluster means trying to first determine where the
environment variable points to inside the DBFS architecture, then most
likely add them through CLI commands. This represented several short terms
obstacles, but also raised many issues down the line, because we are
deploying our clusters through Terraform and not all developers will have
the elevated permissions to perform CLI commands. A single, compiled jar
with all the dependencies within can easily be deployed at cluster creation
with a databricks_dbfs_file

resource and using the library.jar property of databricks_cluster
.
The jar ended up to be a bit of a headache to produce, but it keeps things
high level and easier to maintain.

That is, of course, unless I'm missing the obvious and there was an easy
way to add GeoTools jars on the Databricks cluster and let
sedona-python-adapter find them, which isn't entirely excluded.

On Thu, 11 Feb 2021 at 10:03, Jia Yu  wrote:

> Thanks, Gregory. I think this behavior is not expected. We will look into
> this.
>
> In fact, you should let us know about your situation early on. In fact,
> you can download the GeoTools jars manually and copy to SPARK_HOME/jars/
> folder... You don't have to compile the code. Download links are given in
> the comments:
> http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240
>
> We should make our doc more clear.
>
>
> On Thu, Feb 11, 2021 at 12:44 AM Grégory Dugernier  wrote:
>
>> Hi Jia,
>>
>> After much sweat and tears, I went the long road and compiled the code
>> locally. I'm working on Windows so I had to change a few things in the
>> POM.xml:
>>
>>- When trying to compile just the python-adapter lib, Maven didn't
>>like the dynamic versioning of sedona-core and sedona-sql, so I had to
>>hardcode the current version.
>>- For some reason, Maven couldn't find spark-version-converter from
>>within the python-adapter directory, so I just decided to compile the full
>>library. It might be possible to just compile the adapter, I just decided
>>pushing in this direction further seemed like it would take longer.
>>- When trying to compile the full library, the attach-javadoc goal
>>just keep erroring-out, even with the latest version of
>>maven-javadoc-plugin, so I just removed it entirely.
>>
>> By the end, I got the jar, uploaded it in Databricks and it works like a
>> charm so far.
>>
>> I did however meet another issue, it seems that when using 
>> *ShapefileReader.readToGeometryRDD(spark.sparkContext,
>> file_url) *to read multiple Shapefiles files at once, then use the
>> Adapter, same-named columns aren't combined in the resulting DataFrame (see
>> example below). It might be normal RDD behavior -I have little experience
>> using them instead of DataFrames-, and I already found a workaround by
>> creating multiple dfs and using union(), but I prefer to let you know in
>> case it isn't the expected behavior.
>> [image: image.png]
>>
>> Regards,
>> Grégory
>>
>> On Thu, 11 Feb 2021 at 07:58, Jia Yu  wrote:
>>
>>> Hi Gregory,
>>>
>>> Please let us know if you get your issue fixed. I know many of our users
>>> are also using Databricks cluster. We are also interested in the solution.
>>>
>>> Thanks,
>>> Jia
>>>
>>> On Wed, Feb 10, 2021 at 5:17 AM Grégory Dugernier 
>>> wrote:
>>>
 Thank you for the quick reply!

 It seems my particular situation is a bit more complex than that, since
 I'm running the notebook on a Databricks cluster, and the default spark
 config doesn't seem to allow for more jar repositories (GeoTools isn't on
 Maven Central), nor does creating a new SparkSession appears to work. I've
 tried to download the jars and add them manually to the cluster but it
 doesn't seem to work either. But at least I know where the issue's at!

 Thanks again for your help,
 Regards

 On Wed, 10 Feb 2021 at 12:22, Jia Yu  wrote:

> Hi Gregory,
>
> Thanks for letting us know. This is not a bug. We cannot include
> GeoTools jars due to license issues. But indeed we forgot to update the
> docs and jupyter notebook examples. I just updated them. Please read them
> here:
>
>
> 

Re: [Bug][Python] Missing Java Class?

2021-02-11 Thread Jia Yu
Thanks, Gregory. I think this behavior is not expected. We will look into
this.

In fact, you should let us know about your situation early on. In fact, you
can download the GeoTools jars manually and copy to SPARK_HOME/jars/
folder... You don't have to compile the code. Download links are given in
the comments:
http://sedona.apache.org/download/GeoSpark-All-Modules-Maven-Central-Coordinates/#geotools-240

We should make our doc more clear.


On Thu, Feb 11, 2021 at 12:44 AM Grégory Dugernier  wrote:

> Hi Jia,
>
> After much sweat and tears, I went the long road and compiled the code
> locally. I'm working on Windows so I had to change a few things in the
> POM.xml:
>
>- When trying to compile just the python-adapter lib, Maven didn't
>like the dynamic versioning of sedona-core and sedona-sql, so I had to
>hardcode the current version.
>- For some reason, Maven couldn't find spark-version-converter from
>within the python-adapter directory, so I just decided to compile the full
>library. It might be possible to just compile the adapter, I just decided
>pushing in this direction further seemed like it would take longer.
>- When trying to compile the full library, the attach-javadoc goal
>just keep erroring-out, even with the latest version of
>maven-javadoc-plugin, so I just removed it entirely.
>
> By the end, I got the jar, uploaded it in Databricks and it works like a
> charm so far.
>
> I did however meet another issue, it seems that when using 
> *ShapefileReader.readToGeometryRDD(spark.sparkContext,
> file_url) *to read multiple Shapefiles files at once, then use the
> Adapter, same-named columns aren't combined in the resulting DataFrame (see
> example below). It might be normal RDD behavior -I have little experience
> using them instead of DataFrames-, and I already found a workaround by
> creating multiple dfs and using union(), but I prefer to let you know in
> case it isn't the expected behavior.
> [image: image.png]
>
> Regards,
> Grégory
>
> On Thu, 11 Feb 2021 at 07:58, Jia Yu  wrote:
>
>> Hi Gregory,
>>
>> Please let us know if you get your issue fixed. I know many of our users
>> are also using Databricks cluster. We are also interested in the solution.
>>
>> Thanks,
>> Jia
>>
>> On Wed, Feb 10, 2021 at 5:17 AM Grégory Dugernier  wrote:
>>
>>> Thank you for the quick reply!
>>>
>>> It seems my particular situation is a bit more complex than that, since
>>> I'm running the notebook on a Databricks cluster, and the default spark
>>> config doesn't seem to allow for more jar repositories (GeoTools isn't on
>>> Maven Central), nor does creating a new SparkSession appears to work. I've
>>> tried to download the jars and add them manually to the cluster but it
>>> doesn't seem to work either. But at least I know where the issue's at!
>>>
>>> Thanks again for your help,
>>> Regards
>>>
>>> On Wed, 10 Feb 2021 at 12:22, Jia Yu  wrote:
>>>
 Hi Gregory,

 Thanks for letting us know. This is not a bug. We cannot include
 GeoTools jars due to license issues. But indeed we forgot to update the
 docs and jupyter notebook examples. I just updated them. Please read them
 here:


 https://github.com/apache/incubator-sedona/blob/master/python/ApacheSedonaSQL.ipynb

 (Make sure you disable the browser cache or open it in an incognito
 window)
 http://sedona.apache.org/download/overview/#install-sedona-python

 In short, you need to add the following coordinates in the notebook:

 spark = SparkSession. \ builder. \ appName('appName'). \ config(
 "spark.serializer", KryoSerializer.getName). \ config(
 "spark.kryo.registrator", SedonaKryoRegistrator.getName). \ config(
 "spark.jars.repositories", 'https://repo.osgeo.org/repository/release,'
 'https://download.java.net/maven/2'). \ config('spark.jars.packages',
 'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating,'
 'org.geotools:gt-main:24.0,' 'org.geotools:gt-referencing:24.0,'
 'org.geotools:gt-epsg-hsql:24.0'). \ getOrCreate()

 On Wed, Feb 10, 2021 at 2:35 AM Grégory Dugernier 
 wrote:

> Hello,
>
> I've been trying to run Sedona for Python on Databricks for 2 days and
> I
> think I've stumbled upon a bug.
>
> *Configuration*:
>
>- Spark 3.0.1
>- Scala 2.12
>- Python 3.7
>
> *Librairies*:
>
>- apache-sedona (from PyPi)
>- org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating
>(from Maven)
>
> *What I'm trying to do:*
>
> I'm trying to load a series of Shapefiles files into a dataframe for
> geospatial analysis. See code snippet below, based of your example
> notebook
> <
> https://github.com/apache/incubator-sedona/blob/master/python/ApacheSedonaCore.ipynb
> >
>
>
> > from sedona.core.formatMapper.shapefileParser import ShapefileReader
> > 

Re: [Bug][Python] Missing Java Class?

2021-02-11 Thread Grégory Dugernier
Hi Jia,

After much sweat and tears, I went the long road and compiled the code
locally. I'm working on Windows so I had to change a few things in the
POM.xml:

   - When trying to compile just the python-adapter lib, Maven didn't like
   the dynamic versioning of sedona-core and sedona-sql, so I had to hardcode
   the current version.
   - For some reason, Maven couldn't find spark-version-converter from
   within the python-adapter directory, so I just decided to compile the full
   library. It might be possible to just compile the adapter, I just decided
   pushing in this direction further seemed like it would take longer.
   - When trying to compile the full library, the attach-javadoc goal just
   keep erroring-out, even with the latest version of maven-javadoc-plugin, so
   I just removed it entirely.

By the end, I got the jar, uploaded it in Databricks and it works like a
charm so far.

I did however meet another issue, it seems that when using
*ShapefileReader.readToGeometryRDD(spark.sparkContext,
file_url) *to read multiple Shapefiles files at once, then use the Adapter,
same-named columns aren't combined in the resulting DataFrame (see example
below). It might be normal RDD behavior -I have little experience using
them instead of DataFrames-, and I already found a workaround by creating
multiple dfs and using union(), but I prefer to let you know in case it
isn't the expected behavior.
[image: image.png]

Regards,
Grégory

On Thu, 11 Feb 2021 at 07:58, Jia Yu  wrote:

> Hi Gregory,
>
> Please let us know if you get your issue fixed. I know many of our users
> are also using Databricks cluster. We are also interested in the solution.
>
> Thanks,
> Jia
>
> On Wed, Feb 10, 2021 at 5:17 AM Grégory Dugernier  wrote:
>
>> Thank you for the quick reply!
>>
>> It seems my particular situation is a bit more complex than that, since
>> I'm running the notebook on a Databricks cluster, and the default spark
>> config doesn't seem to allow for more jar repositories (GeoTools isn't on
>> Maven Central), nor does creating a new SparkSession appears to work. I've
>> tried to download the jars and add them manually to the cluster but it
>> doesn't seem to work either. But at least I know where the issue's at!
>>
>> Thanks again for your help,
>> Regards
>>
>> On Wed, 10 Feb 2021 at 12:22, Jia Yu  wrote:
>>
>>> Hi Gregory,
>>>
>>> Thanks for letting us know. This is not a bug. We cannot include
>>> GeoTools jars due to license issues. But indeed we forgot to update the
>>> docs and jupyter notebook examples. I just updated them. Please read them
>>> here:
>>>
>>>
>>> https://github.com/apache/incubator-sedona/blob/master/python/ApacheSedonaSQL.ipynb
>>>
>>> (Make sure you disable the browser cache or open it in an incognito
>>> window)
>>> http://sedona.apache.org/download/overview/#install-sedona-python
>>>
>>> In short, you need to add the following coordinates in the notebook:
>>>
>>> spark = SparkSession. \ builder. \ appName('appName'). \ config(
>>> "spark.serializer", KryoSerializer.getName). \ config(
>>> "spark.kryo.registrator", SedonaKryoRegistrator.getName). \ config(
>>> "spark.jars.repositories", 'https://repo.osgeo.org/repository/release,'
>>> 'https://download.java.net/maven/2'). \ config('spark.jars.packages',
>>> 'org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating,'
>>> 'org.geotools:gt-main:24.0,' 'org.geotools:gt-referencing:24.0,'
>>> 'org.geotools:gt-epsg-hsql:24.0'). \ getOrCreate()
>>>
>>> On Wed, Feb 10, 2021 at 2:35 AM Grégory Dugernier 
>>> wrote:
>>>
 Hello,

 I've been trying to run Sedona for Python on Databricks for 2 days and I
 think I've stumbled upon a bug.

 *Configuration*:

- Spark 3.0.1
- Scala 2.12
- Python 3.7

 *Librairies*:

- apache-sedona (from PyPi)
- org.apache.sedona:sedona-python-adapter-3.0_2.12:1.0.0-incubating
(from Maven)

 *What I'm trying to do:*

 I'm trying to load a series of Shapefiles files into a dataframe for
 geospatial analysis. See code snippet below, based of your example
 notebook
 <
 https://github.com/apache/incubator-sedona/blob/master/python/ApacheSedonaCore.ipynb
 >


 > from sedona.core.formatMapper.shapefileParser import ShapefileReader
 > from sedona.register import SedonaRegistrator
 > from sedona.utils.adapter import Adapter
 >
 > SedonaRegistrator.registerAll(spark)
 > shape_rdd = ShapefileReader.readToGeometryRDD(spark.sparkContext,
 > file_name)
 > df = Adapter.toDf(shape_rdd, spark)
 >

 *Bug*:

 The ShapefileReader.readToGeometryRDD() currently throws the following
 error:

 > Py4JJavaError: An error occurred while calling
 >
 z:org.apache.sedona.core.formatMapper.shapefileParser.ShapefileReader.readToGeometryRDD.
 > : java.lang.NoClassDefFoundError:
 org/opengis/referencing/FactoryException
 > at