Kristin Cowalcijk created SEDONA-218:
----------------------------------------

             Summary: Flaky test caused by improper handling of null struct 
values in Adapter.toDf
                 Key: SEDONA-218
                 URL: https://issues.apache.org/jira/browse/SEDONA-218
             Project: Apache Sedona
          Issue Type: Bug
            Reporter: Kristin Cowalcijk


Test case {{can convert JavaPairRDD to DataFrame with user-supplied schema}} 
randomly fails when running on my laptop, and this failure could be 
consistently reproduced after switching the geometry serde to a [faster 
alternative|https://issues.apache.org/jira/browse/SEDONA-207] I'm currently 
working on:

{code:java}
- can convert JavaPairRDD to DataFrame with user-supplied schema *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 2290.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2290.0 
(TID 3859) (kontinuation executor driver): scala.MatchError: 
StructType(StructField(structText,StringType,true),StructField(structInt,IntegerType,true),StructField(structBool,BooleanType,true))
 (of class org.apache.spark.sql.types.StructType)
        at org.apache.sedona.sql.utils.Adapter$.parseString(Adapter.scala:287)
        at 
org.apache.sedona.sql.utils.Adapter$.$anonfun$castRowToSchema$1(Adapter.scala:271)
        at scala.collection.immutable.List.map(List.scala:297)
        at 
org.apache.sedona.sql.utils.Adapter$.castRowToSchema(Adapter.scala:268)
        at 
org.apache.sedona.sql.utils.Adapter$.$anonfun$toDf$7(Adapter.scala:189)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at scala.collection.Iterator$$anon$10.next(Iterator.scala:461)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:760)
        at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:364)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
        at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
{code}

This test fails only when the only row processed by {{Adapter.toDF}} contains a 
null struct value, and the test is flaky because not all rows have null struct 
values, and the order of rows in the data frame is non-deterministic. We can 
fix the null struct value handling problem and make the test code not only 
process the first row but all of them.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to