umartin commented on issue #782:
URL: https://github.com/apache/sedona/issues/782#issuecomment-1451563470
I would create a udf to validate the wkt string before parsing.
```
import pyspark.sql.functions as f
from pyspark.sql.types import BooleanType
import shapely.wkt
def _is_valid_wkt(wkt: str) -> bool:
try:
shapely.wkt.loads(wkt)
return True
except:
return False
is_valid_wkt = f.udf(_is_valid_wkt, BooleanType())
df = spark.createDataFrame([("POINT (1 1)",),("INVALID",)], ["wkt"])
df = df.withColumn("valid", is_valid_wkt("wkt"))
df = df.withColumn("geom", f.when(df.valid, f.expr("ST_GeomFromText(wkt)")))
df.show()
+-----------+-----+-----------+
| wkt|valid| geom|
+-----------+-----+-----------+
|POINT (1 1)| true|POINT (1 1)|
| INVALID|false| null|
+-----------+-----+-----------+
```
Failing on invalid input is a sensible default. Otherwise you would risk
silent data corruption.
Since ST_GeomFromText and ST_GeomFromWKT already has an optional srid
parameter I don't think it would make sense to add another optional parameter
to control leniency.
If Sedona where to support this use case I would prefer a new function for
validating the wkt string like ST_IsValidWKT(wkt: String): Boolean.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]