Doug Dennis created SEDONA-166:
----------------------------------

             Summary: Provide DataFrame Style API
                 Key: SEDONA-166
                 URL: https://issues.apache.org/jira/browse/SEDONA-166
             Project: Apache Sedona
          Issue Type: New Feature
            Reporter: Doug Dennis


Spark provides an API to operate on Column types. Especially in python, this 
API is by far the most common pattern I have seen used when developing Spark 
applications. Currently, Sedona only provides the SQL API which requires either 
generating a temporary view and using the sql method, using the expr function, 
or using the selectExpr method. There is no performance loss but it does cause 
disruption when writing applications using sedona and it makes certain tasks 
tricky to accomplish.

I'll use an example of using a Sedona function inside of a transform function 
call to generate geometry from a list of coordinates. Assume the variable spark 
is a spark session. Here is how it can be accomplished today (I omit the 
version with expr since it is nearly identical to selectExpr):
{code:python}
df = spark.sql("SELECT array(array(0.0,0.0),array(1.1,2.2)) AS points_list")

# generate a temp view and use the sql method
df.createTempView("tbl")
spark.sql("SELECT transform(points_list, p -> ST_Point(p[0], p[1])) AS 
points_list FROM tbl")

# selectExpr
df.selectExpr("transform(points_list, p -> ST_Point(p[0], p[1])) AS 
points_list")
{code}

I propose implementing a similar API style to Spark that works with Columns. 
This would allow for something like this:

{code:python}
import pyspark.sql.functions as f
import sedona.sql.st_functions as stf

df.select(f.transform(f.col("points_list"), lambda x: stf.st_point(x[0], x[1])))
{code}

I believe that the way that Spark implements this functionality can be mirrored 
to accomplish this task.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to