umartin commented on issue #782:
URL: https://github.com/apache/sedona/issues/782#issuecomment-1451563470

   I would create a udf to validate the wkt string before parsing.
   
   ```
   import pyspark.sql.functions as f
   from pyspark.sql.types import BooleanType
   import shapely.wkt
   
   def _is_valid_wkt(wkt: str) -> bool:
       try:
           shapely.wkt.loads(wkt)
           return True
       except:
           return False
   
   is_valid_wkt = f.udf(_is_valid_wkt, BooleanType())
   
   df = spark.createDataFrame([("POINT (1 1)",),("INVALID",)], ["wkt"])
   df = df.withColumn("valid", is_valid_wkt("wkt"))
   df = df.withColumn("geom", f.when(df.valid, f.expr("ST_GeomFromText(wkt)")))
   df.show()
   
   +-----------+-----+-----------+
   |        wkt|valid|       geom|
   +-----------+-----+-----------+
   |POINT (1 1)| true|POINT (1 1)|
   |    INVALID|false|       null|
   +-----------+-----+-----------+
   ```
   Failing on invalid input is a sensible default. Otherwise you would risk 
silent data corruption.
   
   Since ST_GeomFromText and ST_GeomFromWKT already has an optional srid 
parameter I don't think it would make sense to add another optional parameter 
to control leniency.
   
   If Sedona where to support this use case I would prefer a new function for 
validating the wkt string like ST_IsValidWKT(wkt: String): Boolean.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to