[ https://issues.apache.org/jira/browse/SEDONA-739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17985052#comment-17985052 ]
James Willis commented on SEDONA-739: ------------------------------------- You are hitting this case: https://github.com/apache/sedona/blob/master/spark/common/src/main/scala/org/apache/spark/sql/sedona_sql/expressions/PhysicalFunction.scala#L56 Generally the optimization rules should try to avoid this case by creating separate physical nodes for first creating the geometry column and then evaluating the DBSCAN function. The underlying dbscan function should probably be converted to accept Columns instead of column names to avoid creating extra nodes etc. I'm not sure if there are surprising challenges in that effort. > ST_DBScan fails when selecting only subset of columns. > ------------------------------------------------------ > > Key: SEDONA-739 > URL: https://issues.apache.org/jira/browse/SEDONA-739 > Project: Apache Sedona > Issue Type: Bug > Reporter: Paweł Kociński > Priority: Major > > > {code:java} > dbscan_df = sedona.sql( > """ > SELECT > index, > geom AS geom, > ST_DBSCAN(geom, 0.5, 10, False) AS scan > FROM points > """ > ) {code} > Selecting all columns works just fine > > > {code:java} > result = dbscan_df.select("scan.*", "index", "geom") {code} > but subset > > {code:java} > dbscan_df.select("scan.*", "index").show() {code} > is causing the > {code:java} > IllegalArgumentException Traceback (most recent call last) > Cell In[56], line 1 > ----> 1 dbscan_df.select("scan.*", "index").show() > File /opt/spark/python/pyspark/sql/dataframe.py:947, in DataFrame.show(self, > n, truncate, vertical) > 887 def show(self, n: int = 20, truncate: Union[bool, int] = True, > vertical: bool = False) -> None: > 888 """Prints the first ``n`` rows to the console. > 889 > 890 .. versionadded:: 1.3.0 > (...) > 945 name | Bob > 946 """ > --> 947 print(self._show_string(n, truncate, vertical)) > File /opt/spark/python/pyspark/sql/dataframe.py:965, in > DataFrame._show_string(self, n, truncate, vertical) > 959 raise PySparkTypeError( > 960 error_class="NOT_BOOL", > 961 message_parameters={"arg_name": "vertical", "arg_type": > type(vertical).__name__}, > 962 ) > 964 if isinstance(truncate, bool) and truncate: > --> 965 return self._jdf.showString(n, 20, vertical) > 966 else: > 967 try: > File /usr/local/lib/python3.10/dist-packages/py4j/java_gateway.py:1322, in > JavaMember.__call__(self, *args) > 1316 command = proto.CALL_COMMAND_NAME +\ > 1317 self.command_header +\ > 1318 args_command +\ > 1319 proto.END_COMMAND_PART > 1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( > 1323 answer, self.gateway_client, self.target_id, self.name) > 1325 for temp_arg in temp_args: > 1326 if hasattr(temp_arg, "_detach"): > File /opt/spark/python/pyspark/errors/exceptions/captured.py:185, in > capture_sql_exception.<locals>.deco(*a, **kw) > 181 converted = convert_exception(e.java_exception) > 182 if not isinstance(converted, UnknownException): > 183 # Hide where the exception came from that shows a non-Pythonic > 184 # JVM exception message. > --> 185 raise converted from None > 186 else: > 187 raise > IllegalArgumentException: geometry argument must be a named reference to an > existing column {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)