[incubator-sedona] branch master updated: [DOCS] Updated Databricks setup documentation (#558)

jiayu Fri, 05 Nov 2021 01:12:45 -0700

This is an automated email from the ASF dual-hosted git repository.

jiayu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-sedona.git



The following commit(s) were added to refs/heads/master by this push:
     new 609261b  [DOCS] Updated Databricks setup documentation (#558)
609261b is described below

commit 609261b9a0efb7bf1c19c2168975f726adf3a054
Author: Erni Durdevic <[email protected]>
AuthorDate: Fri Nov 5 09:12:23 2021 +0100

    [DOCS] Updated Databricks setup documentation (#558)
---
 docs/download/databricks.md | 102 +++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 97 insertions(+), 5 deletions(-)

diff --git a/docs/download/databricks.md b/docs/download/databricks.md
index 13c1676..9a1b0d5 100644
--- a/docs/download/databricks.md
+++ b/docs/download/databricks.md
@@ -4,13 +4,105 @@ You just need to install the Sedona jars and Sedona Python 
on Databricks using D
 
 ## Advanced editions
 
-If you are not using the free version of Databricks, there is an issue with 
the path where Sedona Python looks for the jar. Thanks to the report from 
Sedona user @amoyrand.
+### Databricks DBR 7.x (Recommended)
 
-Two steps to fix this:
+If you are using the commercial version of Databricks up to version 7.x you 
can install the Sedona jars and Sedona Python using the Databricks default web 
UI and everything should work.
 
-1. Upload the jars in /dbfs/FileStore/jars/
-2. Add this line to the config `.config("spark.jars", 
"/dbfs/FileStore/jars/sedona-python-adapter-3.0_2.12-{{ sedona.current_version 
}}.jar") \`
+### Databricks DBR 8.x, 9.x, 10.x
+
+If you are not using the free version of Databricks, there are currently some 
compatibility issues with DBR 8.x+. Specifically, the `ST_intersect` join query 
with the DataFrame API will throw a `java.lang.NoSuchMethodError` exception. As 
a temporary solution you can mix your DataFrame API with RDD API to perform 
spatial join queries (See 
[example](https://github.com/apache/incubator-sedona/blob/master/binder/ApacheSedonaSQL_SpatialJoin_AirportsPerCountry.ipynb)).
+
+
+## Install Sedona from the web UI
+
+1) From the Libraries tab install from Maven Coordinates
+    ```
+    org.apache.sedona:sedona-python-adapter-3.0_2.12:{{ sedona.current_version 
}}
+    org.datasyslab:geotools-wrapper:{{ sedona.current_geotools }}
+    ```
+
+2) From the Libraries tab install from PyPI
+    ```
+    apache-sedona
+    ```
+
+3) (Optional) You can speed up the serialization of geometry types by adding 
to your spark configurations (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options`) the following lines:
+
+    ```
+    spark.serializer org.apache.spark.serializer.KryoSerializer
+    spark.kryo.registrator org.apache.sedona.core.serde.SedonaKryoRegistrator
+    ```
+
+    *This options are not compatible with the commercial Databricks DBR 
versions (8.x+).*
+
+## Initialise
+
+After you have installed the libraries and started the cluster, you can 
initialize the Sedona `ST_*` functions and types by running from your code: 
+
+(scala)
+```Scala
+import org.apache.sedona.sql.utils.SedonaSQLRegistrator
+SedonaSQLRegistrator.registerAll(sparkSession)
+```
+
+(or python)
+```Python
+from sedona.register.geo_registrator import SedonaRegistrator
+SedonaRegistrator.registerAll(spark)
+```
 
 ## Pure SQL environment
+ 
+In order to use the Sedona `ST_*` functions from SQL without having to 
register the Sedona functions from a python/scala cell, you need to install the 
sedona libraries from the [cluster 
init-scripts](https://docs.databricks.com/clusters/init-scripts.html) as 
follows.
+
+Download the Sedona jars to a DBFS location. You can do that manually via UI 
or from a notebook with
+
+```bash
+%sh 
+# Create JAR directory for Sedona
+mkdir -p /dbfs/jars/sedona/{{ sedona.current_version }}
+
+# Download the dependencies from Maven into DBFS
+curl -o /dbfs/jars/sedona/{{ sedona.current_version 
}}/geotools-wrapper-geotools-{{ sedona.current_geotools }}.jar 
"https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/geotools-{{ 
sedona.current_geotools }}/geotools-wrapper-geotools-{{ sedona.current_geotools 
}}.jar"
+
+curl -o /dbfs/jars/sedona/{{ sedona.current_version 
}}/sedona-python-adapter-3.0_2.12-{{ sedona.current_version }}.jar 
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-python-adapter-3.0_2.12/{{
 sedona.current_version }}/sedona-python-adapter-3.0_2.{{ 
sedona.current_version }}.jar"
+
+curl -o /dbfs/jars/sedona/{{ sedona.current_version }}/sedona-viz-2.4_2.12-{{ 
sedona.current_version }}.jar 
"https://repo1.maven.org/maven2/org/apache/sedona/sedona-viz-2.4_2.12/{{ 
sedona.current_version }}/sedona-viz-2.4_2.12-{{ sedona.current_version }}.jar"
+```
+
+Create an init script in DBFS that loads the Sedona jars into the cluster's 
default jar directory. You can create that from any notebook by running: 
+
+```bash
+%sh 
+
+# Create init script directory for Sedona
+mkdir -p /dbfs/sedona/
+
+# Create init script
+cat > /dbfs/sedona/sedona-init.sh <<'EOF'
+#!/bin/bash
+#
+# File: sedona-init.sh
+# Author: Erni Durdevic
+# Created: 2021-11-01
+# 
+# On cluster startup, this script will copy the Sedona jars to the cluster's 
default jar directory.
+# In order to activate Sedona functions, remember to add to your spark 
configuration the Sedona extensions: "spark.sql.extensions 
org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions"
+
+cp /dbfs/jars/sedona/{{ sedona.current_version }}/*.jar /databricks/jars
+
+EOF
+```
+
+From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options` -> `Spark`) activate the Sedona functions by adding to the 
Spark Config 
+```
+spark.sql.extensions 
org.apache.sedona.viz.sql.SedonaVizExtensions,org.apache.sedona.sql.SedonaSqlExtensions
+```
+
+From your cluster configuration (`Cluster` -> `Edit` -> `Configuration` -> 
`Advanced options` -> `Init Scripts`) add the newly created init script 
+```
+/dbfs/sedona/sedona-init.sh
+```
+
+*Note: You need to install the sedona libraries via init script because the 
libraries installed via UI are installed after the cluster has already started, 
and therefore the classes specified by the config `spark.sql.extensions` are 
not available at startup time.*
 
-Currently, Sedona cannot be used in [a pure SQL 
environment](/tutorial/sql-pure-sql) (e.g., an SQL notebook) on Databricks. You 
have to mix it with Scala or Python in order to call 
`SedonaSQLRegistrator.registerAll(sparkSession)`. Please see a similar report 
on 
[Stackoverflow](https://stackoverflow.com/questions/66721168/sparksessionextensions-injectfunction-in-databricks-environment).
\ No newline at end of file

[incubator-sedona] branch master updated: [DOCS] Updated Databricks setup documentation (#558)

Reply via email to