andygrove commented on code in PR #2831:
URL: https://github.com/apache/datafusion-comet/pull/2831#discussion_r2640824998


##########
spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala:
##########
@@ -391,4 +391,315 @@ class CometStringExpressionSuite extends CometTestBase {
     }
   }
 
+  test("regexp_extract basic") {
+    withSQLConf(CometConf.COMET_REGEXP_ALLOW_INCOMPATIBLE.key -> "true") {
+      val data = Seq(
+        ("100-200", 1),
+        ("300-400", 1),
+        (null, 1), // NULL input
+        ("no-match", 1), // no match → should return ""
+        ("abc123def456", 1),
+        ("", 1) // empty string
+      )
+
+      withParquetTable(data, "tbl") {
+        // Test basic extraction: group 0 (full match)
+        checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 0) FROM tbl")
+        // Test group 1
+        checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\d+)-(\\d+)', 1) FROM tbl")

Review Comment:
   ```suggestion
           checkSparkAnswerAndOperator("SELECT regexp_extract(_1, 
'(\\\\d+)-(\\\\d+)', 1) FROM tbl")
   ```
   
   The escaping is incorrect for many of these queries. Here is debug logging 
for this query now, which demonstrates that it is not actually extracting 
anything:
   
   ```
   SELECT regexp_extract(_1, '(\d+)-(\d+)', 1) FROM tbl
   *(1) CometColumnarToRow
   +- CometProject [regexp_extract(_1, (d+)-(d+), 1)#33], [regexp_extract(_1#6, 
(d+)-(d+), 1) AS regexp_extract(_1, (d+)-(d+), 1)#33]
      +- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-46241ff7-ee37-48fe-ad88-ea06bceae81f], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:string>
   
   +--------------------------------+
   |regexp_extract(_1, (d+)-(d+), 1)|
   +--------------------------------+
   |                                |
   |                                |
   |                                |
   |                                |
   |                                |
   |NULL                            |
   +--------------------------------+
   ```
   
   The issue is double-escaping. In the Scala string:
   - `(\\d+)-(\\d+)` → Scala produces `(\d+)-(\d+)`
   - But SQL then interprets `\d` as an escape sequence → becomes `(d+)-(d+)`
   
   After fixing this, I see:
   
   ```
   SELECT regexp_extract(_1, '(\\d+)-(\\d+)', 0) FROM tbl
   *(1) CometColumnarToRow
   +- CometProject [regexp_extract(_1, (\d+)-(\d+), 0)#15], 
[regexp_extract(_1#6, (\d+)-(\d+), 0) AS regexp_extract(_1, (\d+)-(\d+), 0)#15]
      +- CometScan [native_iceberg_compat] parquet [_1#6] Batched: true, 
DataFilters: [], Format: CometParquet, Location: InMemoryFileIndex(1 
paths)[file:/tmp/spark-901feb7b-b26a-4f2a-99ff-fc594af56804], PartitionFilters: 
[], PushedFilters: [], ReadSchema: struct<_1:string>
   
   +----------------------------------+
   |regexp_extract(_1, (\d+)-(\d+), 0)|
   +----------------------------------+
   |                                  |
   |                                  |
   |                                  |
   |100-200                           |
   |300-400                           |
   |NULL                              |
   +----------------------------------+
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to