[GitHub] [airflow] janvandervegt commented on a diff in pull request #30980: Db Partition Sensor

via GitHub Tue, 02 May 2023 01:09:11 -0700


janvandervegt commented on code in PR #30980:
URL: https://github.com/apache/airflow/pull/30980#discussion_r1182222111



##########
docs/apache-airflow-providers-databricks/operators/sql.rst:
##########
@@ -113,10 +113,53 @@ Configuring Databricks connection to be used with the 
Sensor.
     :start-after: [START howto_sensor_databricks_connection_setup]
     :end-before: [END howto_sensor_databricks_connection_setup]
 
-Poking the specific table for existence of data/partition:
+Poking the specific table with the SQL statement:
 
 .. exampleinclude:: 
/../../tests/system/providers/databricks/example_databricks_sensors.py
     :language: python
     :dedent: 4
     :start-after: [START howto_sensor_databricks_sql]
     :end-before: [END howto_sensor_databricks_sql]
+
+
+DatabricksPartitionSensor

Review Comment:
   How familiar are people with the concept of sensors? Should we include a 
line or two that this can be used to wait until a partition is available?



##########
docs/apache-airflow-providers-databricks/operators/sql.rst:
##########
@@ -113,10 +113,53 @@ Configuring Databricks connection to be used with the 
Sensor.
     :start-after: [START howto_sensor_databricks_connection_setup]
     :end-before: [END howto_sensor_databricks_connection_setup]
 
-Poking the specific table for existence of data/partition:
+Poking the specific table with the SQL statement:
 
 .. exampleinclude:: 
/../../tests/system/providers/databricks/example_databricks_sensors.py
     :language: python
     :dedent: 4
     :start-after: [START howto_sensor_databricks_sql]
     :end-before: [END howto_sensor_databricks_sql]
+
+
+DatabricksPartitionSensor
+=========================
+
+Use the 
:class:`~airflow.providers.databricks.sensors.partition.DatabricksPartitionSensor`
 to run the sensor
+for a table accessible via a Databricks SQL warehouse or interactive cluster.
+
+Using the Sensor
+----------------
+
+The sensor accepts the table name and partition name(s), value(s) from the 
user and generates the SQL query to check if
+the specified partition name, value(s) exist in the specified table.
+
+The required parameters are:
+
+* ``table_name`` (name of the table for partition check).
+
+* ``partitions`` (name of the partitions to check).
+
+* ``partition_operator`` (comparison operator for partitions, such as >=).

Review Comment:
   Reading this line, it is not immediately clear to me how this comparison 
operator is used. What are you comparing?



##########
airflow/providers/databricks/sensors/databricks_partition.py:
##########
@@ -0,0 +1,225 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+"""This module contains Databricks sensors."""
+
+from __future__ import annotations
+
+from datetime import datetime
+from typing import TYPE_CHECKING, Any, Callable, Sequence
+
+from databricks.sql.utils import ParamEscaper
+
+from airflow.compat.functools import cached_property
+from airflow.exceptions import AirflowException
+from airflow.providers.common.sql.hooks.sql import fetch_all_handler
+from airflow.providers.databricks.hooks.databricks_sql import DatabricksSqlHook
+from airflow.sensors.base import BaseSensorOperator
+
+if TYPE_CHECKING:
+    from airflow.utils.context import Context
+
+
+class DatabricksPartitionSensor(BaseSensorOperator):
+    """
+    Sensor to detect the presence of table partitions in Databricks.
+
+    :param databricks_conn_id: Reference to :ref:`Databricks
+        connection id<howto/connection:databricks>` (templated), defaults to
+        DatabricksSqlHook.default_conn_name.
+    :param sql_warehouse_name: Optional name of Databricks SQL warehouse. If 
not specified, ``http_path``
+        must be provided as described below, defaults to None
+    :param http_path: Optional string specifying HTTP path of Databricks SQL 
warehouse or All Purpose cluster.
+        If not specified, it should be either specified in the Databricks 
connection's
+        extra parameters, or ``sql_warehouse_name`` must be specified.
+    :param session_configuration: An optional dictionary of Spark session 
parameters. If not specified,
+        it could be specified in the Databricks connection's extra parameters, 
defaults to None
+    :param http_headers: An optional list of (k, v) pairs
+        that will be set as HTTP headers on every request. (templated).
+    :param catalog: An optional initial catalog to use.
+        Requires Databricks Runtime version 9.0+ (templated), defaults to ""
+    :param schema: An optional initial schema to use.
+        Requires Databricks Runtime version 9.0+ (templated), defaults to 
"default"
+    :param table_name: Name of the table to check partitions.
+    :param partitions: Name of the partitions to check.
+        Example: {"date": "2023-01-03", "name": ["abc", "def"]}
+    :param partition_operator: Optional comparison operator for partitions, 
such as >=.
+    :param handler: Handler for DbApiHook.run() to return results, defaults to 
fetch_all_handler
+    :param client_parameters: Additional parameters internal to Databricks SQL 
connector parameters.
+    """
+
+    template_fields: Sequence[str] = (
+        "databricks_conn_id",
+        "schema",
+        "http_headers",

Review Comment:
   I would bundle the table related identifiers together, in "catalog", 
"schema" and "table_name" order, followed by the partitions argument. Probably 
above http_headers.



##########
docs/apache-airflow-providers-databricks/operators/sql.rst:
##########
@@ -113,10 +113,53 @@ Configuring Databricks connection to be used with the 
Sensor.
     :start-after: [START howto_sensor_databricks_connection_setup]
     :end-before: [END howto_sensor_databricks_connection_setup]
 
-Poking the specific table for existence of data/partition:
+Poking the specific table with the SQL statement:
 
 .. exampleinclude:: 
/../../tests/system/providers/databricks/example_databricks_sensors.py
     :language: python
     :dedent: 4
     :start-after: [START howto_sensor_databricks_sql]
     :end-before: [END howto_sensor_databricks_sql]
+
+
+DatabricksPartitionSensor
+=========================
+
+Use the 
:class:`~airflow.providers.databricks.sensors.partition.DatabricksPartitionSensor`
 to run the sensor
+for a table accessible via a Databricks SQL warehouse or interactive cluster.
+
+Using the Sensor
+----------------
+
+The sensor accepts the table name and partition name(s), value(s) from the 
user and generates the SQL query to check if
+the specified partition name, value(s) exist in the specified table.
+
+The required parameters are:
+
+* ``table_name`` (name of the table for partition check).
+
+* ``partitions`` (name of the partitions to check).
+
+* ``partition_operator`` (comparison operator for partitions, such as >=).
+
+*   One of ``sql_warehouse_name`` (name of Databricks SQL warehouse to use) or 
``http_path`` (HTTP path for Databricks SQL warehouse or Databricks cluster).
+
+Other parameters are optional and could be found in the class documentation.

Review Comment:
   could -> can



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] janvandervegt commented on a diff in pull request #30980: Db Partition Sensor

Reply via email to