Kristin Cowalcijk created SEDONA-497:
----------------------------------------

             Summary: SpatialRDD read from multiple Shapefiles has incorrect 
fieldName property
                 Key: SEDONA-497
                 URL: https://issues.apache.org/jira/browse/SEDONA-497
             Project: Apache Sedona
          Issue Type: Bug
    Affects Versions: 1.5.1
            Reporter: Kristin Cowalcijk
             Fix For: 1.6.0
         Attachments: debug_shapefiles.zip

A user reported this issue on Discord. It could be easily reproduced using the 
following shapefiles provided by the user: [^debug_shapefiles.zip]

The following code loads a directory containing multiple shapefiles to 
SpatialRDD at once, and then use {{Adapter.toDF}} to convert the SpatialRDD to 
a Spark DataFrame:

{code:python}
parcel_rdd = ShapefileReader.readToGeometryRDD(sc, parcel_path)
parcel_df = Adapter.toDf(parcel_rdd, sedona)
parcel_df.printSchema()
parcel_df.show()
{code}

The above code yields the following output:

{code}
root
 |-- geometry: geometry (nullable = true)
 |-- id: string (nullable = true)
 |-- name id: string (nullable = true)
 |-- name id: string (nullable = true)
 |-- name: string (nullable = true)

24/01/31 14:09:24 WARN TaskSetManager: Lost task 0.0 in stage 32.0 (TID 43) 
(172.20.0.130 executor 0): org.apache.spark.SparkRuntimeException: Error while 
encoding: java.lang.ArrayIndexOutOfBoundsException: Index 3 out of bounds for 
length 3
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
newInstance(class org.apache.spark.sql.sedona_sql.UDT.GeometryUDT).serialize AS 
geometry#275
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 1, id), StringType, ObjectType(class 
java.lang.String)), true, false, true) AS id#276
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 2, name id), StringType, ObjectType(class 
java.lang.String)), true, false, true) AS name id#277
{code}

The reason why {{Adapter.toDf}} returns a dataframe with weird schema is 
because the {{fieldNames}} property of {{parcel_rdd}} is incorrect:

{code}
>>> parcel_rdd.fieldNames
['id', 'name id', 'name id', 'name']
{code}

The schema of the shapefiles should be ['id', 'name'], but it was strangely 
duplicated 3 times.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to