[I] Issue on the documentation steps on how to configure Sedona in AWS Glue using pyspark a ETL job [sedona]

via GitHub Tue, 19 Nov 2024 15:02:59 -0800


MDiakhate12 opened a new issue, #1685:
URL: https://github.com/apache/sedona/issues/1685


   ## Expected behavior
   I want to use Apache Sedona in pyspark in an AWS glue environment.
   
   ## Actual behavior
   The sedona librarie does not work when following the steps described in the 
doc : https://sedona.apache.org/latest-snapshot/setup/glue/
   
   Error sent : 
   
   ```python
   Traceback (most recent call last):
     File "/path/to/my/file.py", line 23, in <module>
       sedona = SedonaContext.create(spark)
     File 
"/home/glue_user/.local/lib/python3.10/site-packages/sedona/spark/SedonaContext.py",
 line 38, in create
       spark._jvm.SedonaContext.create(spark._jsparkSession, "python")
   TypeError: 'JavaPackage' object is not callable
   ```
   
   Error happens in line `sedona = SedonaContext.create(spark)`
   
   ```python
   # -*- coding: utf-8 -*-
   import sys
   from awsglue.context import GlueContext
   from awsglue.job import Job
   from awsglue.utils import getResolvedOptions
   from pyspark.context import SparkContext
   from sedona.spark import SedonaContext
   
   # JOB CONTEXT SETUP
   args = getResolvedOptions(sys.argv, ["JOB_NAME", "environment", 
"additional-python-modules", "extra-jars", "extra-py-files"])
   
   print(args)
   
   # Method 1
   glue_context = GlueContext(SparkContext())
   
   spark = glue_context.spark_session
   
   job = Job(glue_context)
   
   sedona = SedonaContext.create(spark)
   
   print(SedonaContext)
   print(sedona)
   ```
   
   It seems that apache-sedona is not able to find the jar file.
   
   ## Steps to reproduce the problem
   
   Create a pyspark ETL Job on AWS Glue web console :
   
   1. Go to AWS Glue > Data Integration and ETL > ETL Jobs 
   2. Click on Create job > Script editor > Engine = Spark
   3. Documentation steps:
     .  From your job's page, navigate to the "Job details" tab. At the bottom 
of the page expand the "Advanced properties" section. In the "Dependent JARs 
path" field, add the paths to the jars, separated by a comma (it corresponds to 
versions 
**[sedona-spark-shaded-3.3_2.12/1.6.1/sedona-spark-shaded-3.3_2.12-1.6.1.jar](https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.3_2.12/1.6.1/sedona-spark-shaded-3.3_2.12-1.6.1.jar)**
 and 
**[geotools-wrapper-1.6.1-28.2.jar](https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.6.1-28.2/geotools-wrapper-1.6.1-28.2.jar)**)
     .  Add the Sedona Python package by navigating to the "Job Parameters" 
section and add a new parameter with the key --additional-python-modules and 
the value apache-sedona==1.6.1
   5. Use the code show above
   6. Click Save and Run
   
   You can repeat the same step using aws glue locally in docker by following 
[this official documentation on 
AWS](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image-setup-visual-studio)
   
   and then run your script inside the container using :
   
   ```bash
   /usr/local/bin/python3 "$PYTHON_SCRIPT" \
       --JOB_NAME "$JOB_NAME" \
       --environment "$ENVIRONMENT" \
       --enable-glue-datacatalog \
       --extra-jars 
https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.3_2.12/1.6.1/sedona-spark-shaded-3.3_2.12-1.6.1.jar,
 
https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.6.1-28.2/geotools-wrapper-1.6.1-28.2.jar
 \
       --additional-python-modules apache-sedona==1.6.1 \
       --job-language python
   ```
   
   ## Settings
   
   Sedona version = 1.6.1
   
   Apache Spark version = 3.3.0
   
   Apache Flink version = ?
   
   API type = Python
   
   Scala version = 2.12
   
   Python version = 3.12
   
   Environment = AWS Glue
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@sedona.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[I] Issue on the documentation steps on how to configure Sedona in AWS Glue using pyspark a ETL job [sedona]

Reply via email to