[I] error on rawDf.show() [sedona]

via GitHub Sat, 14 Dec 2024 16:12:29 -0800


angelosnm opened a new issue, #1723:
URL: https://github.com/apache/sedona/issues/1723


   I have set up a standalone Spark cluster where PySpark jobs are being sent. 
These jobs are having the below config where S3/MinIO is being used as HDFS 
(using the S3A package) to read raster files:
   
   ```python
   config = (
       SedonaContext.builder()
       .master(spark_endpoint) \
       .appName("RasterProcessingWithSedona") \
       .config("spark.driver.host", socket.gethostbyname(socket.gethostname())) 
\
       .config("spark.driver.port", "2222") \
       .config("spark.blockManager.port", "36859") \
       .config("spark.executor.memory", "16g") \
       .config("spark.executor.cores", "4") \
       .config("spark.driver.memory", "10g") \
       .config("spark.hadoop.fs.s3a.endpoint", s3_endpoint) \
       .config("spark.hadoop.fs.s3a.access.key", s3_access_key_id) \
       .config("spark.hadoop.fs.s3a.secret.key", s3_secret_access_key) \
       .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
       .config("spark.hadoop.fs.s3a.path.style.access", "true") \
       .config(
           'spark.jars.packages',
           'org.apache.sedona:sedona-spark-shaded-3.5_2.12:1.6.1,'
           'org.datasyslab:geotools-wrapper:1.6.1-28.2'
       )
       .getOrCreate()
   )
   ```
   Then, the raster/tif files are being accessed as per below:
   
   ```python
   raster_path = "s3a://data/BFA"
   
   rawDf = sedona.read.format("binaryFile").option("recursiveFileLookup", 
"true").option("pathGlobFilter", "*.tif*").load(raster_path)
   rawDf.createOrReplaceTempView("rawdf")
   rawDf.show()
   ```
   
   And this code. returns the error mentioned in the "Actual behavior" entry.
   
   If this code runs under local mode it runs normally
   
   ## Expected behavior
   
   
![image](https://github.com/user-attachments/assets/062bcdb7-b50f-491d-af15-f00a626b49ca)
   
   ## Actual behavior
   
   ```bash
   Py4JJavaError: An error occurred while calling o66.showString.
   : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 
(TID 5) (192.168.18.112 executor 1): TaskResultLost (result lost from block 
manager)
   Driver stacktrace:
        at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
        at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
        at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
        at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
        at scala.Option.foreach(Option.scala:407)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
        at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:989)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2393)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2414)
        at org.apache.spark.SparkContext.runJob(SparkContext.scala:2433)
        at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:530)
        at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:483)
        at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:61)
        at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:4334)
        at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:3316)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:4324)
        at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:546)
        at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:4322)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
        at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
        at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
        at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
        at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
        at org.apache.spark.sql.Dataset.withAction(Dataset.scala:4322)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:3316)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:3539)
        at org.apache.spark.sql.Dataset.getRows(Dataset.scala:280)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:315)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:282)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at 
py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.lang.Thread.run(Thread.java:750)
   ```
   
   ## Steps to reproduce the problem
   
   ## Settings
   
   Sedona version = 1.6.1
   
   Apache Spark version = 3.5.2
   
   Apache Flink version = N/A
   
   API type = Python
   
   Scala version = 2.12
   
   JRE version = 1.8.0_432
   
   Python version = 3.11.10
   
   Environment = Standalone
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] error on rawDf.show() [sedona]

Reply via email to