Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
alamb commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4366221655 See ticket for details - https://github.com/apache/datafusion-sqlparser-rs/issues/2196 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4356918295 I'm putting this on hold while I focus on getting a new sqlparser-rs release out with Spark SQL dialect support, and then experiement with using Spark SQL syntax in the existing slt tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4254519503 We discussed this PR during the weekly sync today. There seemed to be support for the idea of having tests that use Spark SQL so we can run the same queries in Spark and DataFusion. Here is a PR towards that efffort, which adds Spark SQL dialect to the sqlparser crate. I used Comet's existing SQL tests to drive the features. https://github.com/apache/datafusion-sqlparser-rs/pull/2305 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
comphead commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4240611812 Thanks @andygrove how can I test this tool on #21522 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
parthchandra commented on PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4239762029
There are some tricky cases that will likely not get parsed correctly -
opening/closing parens inside a quoted string, `]` inside an array element,
nested cast.
```
arrow_cast(regexp_replace(x, ')', ''), 'Utf8')
SELECT ['a]b', 'c']
SELECT [['a]b', 'c'], ['d', 'e']]
SELECT coalesce(x, ')')::TEXT
SELECT coalesce('(', x)::TEXT
SELECT CAST(CAST(x AS FLOAT8) AS FLOAT8)
```
We could document these if they turn out to be hard to address.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4229845772
@shehabgamin @parthchandra @comphead PTAL. I made some improvements based on
feedback so far.
PR Review Feedback
- Fixed SQL translation bugs raised by @parthchandra:
- Handle escaped quotes in string literals ('Andy''s'::TYPE)
- Handle commas inside quoted strings in arrow_cast() parsing
- Skip unsupported cast types instead of passing them through and hoping
for the best
- Added type mappings per @shehabgamin: Utf8View→STRING, LargeUtf8→STRING,
BinaryView→BINARY, LargeBinary→BINARY
- Added Decimal Arrow type handling: Decimal32/64/128/256(p,s) →
DECIMAL(p,s) with precision validation
Multi-version CI
- Changed workflow from single PySpark version to a matrix strategy testing
3.4.4, 3.5.8, and 4.1.1
- Added version-conditional known-failures syntax ([spark>=4.0]
math/abs.slt) so entries can target specific Spark versions
- Added Spark runtime version detection (spark_version())
Unresolved Function Handling
- UNRESOLVED_ROUTINE errors from PySpark are now skipped instead of failed —
these indicate functions not available in the current Spark version (e.g.,
bitmap_bit_position in 3.4, try_parse_url in 3.5), not bugs
Code Cleanup
- Removed dead UNSUPPORTED_ARROW_TYPES empty set and its check
- Pre-compiled Decimal regex at module level
- Extracted _parse_version() helper
- Fixed in_quote type from bool|str to Optional[str]
- Decoupled known-failures loading from version resolution
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on code in PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#discussion_r3068302707
##
datafusion/spark/scripts/validate_slt.py:
##
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Validate hardcoded expected values in .slt (sqllogictest) test files
+by running the same queries against PySpark and comparing results.
+
+Usage:
+python validate_slt.py # Run all .slt files
+python validate_slt.py --path math/abs.slt # Single file
+python validate_slt.py --path string/ # All files in subdirectory
+python validate_slt.py --verbose # Show details
+python validate_slt.py --show-skipped# Show skipped queries
+"""
+
+import argparse
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+# ---
+# Arrow type -> Spark type mapping
+# ---
+ARROW_TO_SPARK_TYPE = {
+"Int8": "TINYINT",
+"Int16": "SMALLINT",
+"Int32": "INT",
+"Int64": "BIGINT",
+"UInt8": "SMALLINT",
+"UInt16": "INT",
+"UInt32": "BIGINT",
+"UInt64": "BIGINT",
+"Float16": "FLOAT",
+"Float32": "FLOAT",
+"Float64": "DOUBLE",
+"Utf8": "STRING",
+"Boolean": "BOOLEAN",
+"Binary": "BINARY",
+"Date32": "DATE",
+"Date64": "DATE",
+}
+
+# DataFusion cast type -> Spark type mapping
+DF_TO_SPARK_CAST_TYPE = {
+"TINYINT": "TINYINT",
+"SMALLINT": "SMALLINT",
+"INT": "INT",
+"INTEGER": "INT",
+"BIGINT": "BIGINT",
+"FLOAT": "FLOAT",
+"REAL": "FLOAT",
+"DOUBLE": "DOUBLE",
+"STRING": "STRING",
+"VARCHAR": "STRING",
+"TEXT": "STRING",
+"BOOLEAN": "BOOLEAN",
+"BINARY": "BINARY",
+"DATE": "DATE",
+"TIMESTAMP": "TIMESTAMP",
+# PostgreSQL-style aliases used in some .slt files
+"FLOAT8": "DOUBLE",
+"FLOAT4": "FLOAT",
+"INT8": "BIGINT",
+"INT4": "INT",
+"INT2": "SMALLINT",
+"BYTEA": "BINARY",
+}
+
+# Unsupported Arrow types for Spark
+UNSUPPORTED_ARROW_TYPES = {
+"Utf8View",
+"LargeUtf8",
+"LargeBinary",
+"BinaryView",
+}
+
+# ---
+# SLT record types
+# ---
+
+
+@dataclass
+class QueryRecord:
+"""A 'query [rowsort]' block."""
+
+type_codes: str
+sql: str
+expected: list[str]
+rowsort: bool
+line_number: int
+in_ansi_block: bool = False
+
+
+@dataclass
+class ErrorRecord:
+"""A 'query error ' or 'statement error ' block."""
+
+pattern: str
+sql: str
+line_number: int
+kind: str = "query" # "query" or "statement"
+in_ansi_block: bool = False
+
+
+@dataclass
+class StatementRecord:
+"""A 'statement ok' block (DDL/config)."""
+
+sql: str
+line_number: int
+in_ansi_block: bool = False
+
+
+# ---
+# 1. SLT Parser
+# ---
+
+
+def parse_slt(filepath: str) -> list:
+"""Parse an .slt file into a list of records."""
+with open(filepath) as f:
+lines = f.readlines()
+
+records = []
+i = 0
+in_ansi_mode = False
+
+while i < len(lines):
+line = lines[i].rstrip("\n")
+
+# Skip blank lines and comments
+if not line.strip() or line.strip().startswith("#"):
+i += 1
+continue
+
+# query error
+m = re.match(r"^query\s+error\s+(.*)", line)
+if m:
+pattern = m.group(1).strip()
+line_num = i + 1
+i += 1
+sql_lines = []
+while i < len(lines) and lines[i].strip() and not
lines[i].strip().startswith("#"):
+stripped = lines[i].rstrip("\n")
+if (
+re.match(r"^query\s", stripped)
+or re.match(r"^stateme
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on code in PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#discussion_r3068282054
##
.github/workflows/spark-pyspark-validation.yml:
##
@@ -0,0 +1,63 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Spark PySpark Validation
+
+on:
+ push:
+branches-ignore:
+ - 'gh-readonly-queue/**'
+paths:
+ - 'datafusion/spark/**'
+ - 'datafusion/sqllogictest/test_files/spark/**'
+ - '.github/workflows/spark-pyspark-validation.yml'
+ pull_request:
+paths:
+ - 'datafusion/spark/**'
+ - 'datafusion/sqllogictest/test_files/spark/**'
+ - '.github/workflows/spark-pyspark-validation.yml'
+
+permissions:
+ contents: read
+
+concurrency:
+ group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{
github.workflow }}
+ cancel-in-progress: true
+
+jobs:
+ pyspark-validation:
+runs-on: ubuntu-latest
+name: Validate .slt tests against PySpark
+steps:
+ - uses: actions/checkout@v4
+
+ - name: Set up Python
+uses: actions/setup-python@v5
+with:
+ python-version: '3.11'
+
+ - name: Set up Java
+uses: actions/setup-java@v4
+with:
+ distribution: 'temurin'
+ java-version: '17'
+
+ - name: Install PySpark
+run: pip install pyspark==3.5.5
Review Comment:
Some of the current tests pass with Spark 3.5.x and fail with Spark 4.0.x
due to Spark 4 being stricter in some areas. Let's run for both versions and
add version-specific ignores.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
shehabgamin commented on code in PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#discussion_r3066833812
##
datafusion/spark/scripts/known-failures.txt:
##
@@ -0,0 +1,74 @@
+# Known failures for PySpark SLT validation.
+# Each line is a .slt file path (relative to spark test dir) to skip.
+# Blank lines and lines starting with # are ignored.
+
+# format_string %t specifiers pass microseconds where Java expects milliseconds
+# https://github.com/apache/datafusion/issues/21515
+string/format_string.slt
+
+# format_string %f/%e/%g/%a not supported with Decimal types in Spark
+# format_string %c requires INT not BIGINT in Spark
+# (also covered by #21515)
+
+# substring handles large negative start positions differently from Spark
+# https://github.com/apache/datafusion/issues/21510
+string/substring.slt
+
+# array_repeat incorrectly returns NULL when element is NULL
+# https://github.com/apache/datafusion/issues/21512
+array/array_repeat.slt
+
+# mod/pmod returns NaN instead of NULL for float division by zero
+# https://github.com/apache/datafusion/issues/21514
+math/mod.slt
+
+# String literal escape sequences (\t, \n) not interpreted like Spark
+# https://github.com/apache/datafusion/issues/21516
+string/soundex.slt
+
+# array_contains rejects NULL typed arguments in Spark
+array/array_contains.slt
+
+# Spark's shuffle() only takes 1 argument, not 2 (no seed parameter)
+array/shuffle.slt
+
+# map(array, array) creates single-entry map in Spark, not map_from_arrays
+# Wrong test: uses map() where map_from_arrays() was intended
+collection/size.slt
+
+# date_add/date_sub overflow: Spark errors, DataFusion wraps
+# CAST(date AS INT) not supported in Spark
+datetime/date_add.slt
+datetime/date_sub.slt
+
+# Interval formatting differs between DataFusion and Spark
+datetime/make_dt_interval.slt
+datetime/make_interval.slt
+
+# TIME data type is Spark 4.0 only
Review Comment:
Related to
https://github.com/apache/datafusion/pull/21508#discussion_r3066824998
##
.github/workflows/spark-pyspark-validation.yml:
##
@@ -0,0 +1,63 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Spark PySpark Validation
+
+on:
+ push:
+branches-ignore:
+ - 'gh-readonly-queue/**'
+paths:
+ - 'datafusion/spark/**'
+ - 'datafusion/sqllogictest/test_files/spark/**'
+ - '.github/workflows/spark-pyspark-validation.yml'
+ pull_request:
+paths:
+ - 'datafusion/spark/**'
+ - 'datafusion/sqllogictest/test_files/spark/**'
+ - '.github/workflows/spark-pyspark-validation.yml'
+
+permissions:
+ contents: read
+
+concurrency:
+ group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{
github.workflow }}
+ cancel-in-progress: true
+
+jobs:
+ pyspark-validation:
+runs-on: ubuntu-latest
+name: Validate .slt tests against PySpark
+steps:
+ - uses: actions/checkout@v4
+
+ - name: Set up Python
+uses: actions/setup-python@v5
+with:
+ python-version: '3.11'
+
+ - name: Set up Java
+uses: actions/setup-java@v4
+with:
+ distribution: 'temurin'
+ java-version: '17'
+
+ - name: Install PySpark
+run: pip install pyspark==3.5.5
Review Comment:
We should validate with the latest version of Spark (4.1.x). Any test case
that passes on Spark 3.5 will also pass on later versions.
##
datafusion/spark/scripts/validate_slt.py:
##
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
parthchandra commented on code in PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#discussion_r3061073535
##
datafusion/spark/scripts/validate_slt.py:
##
@@ -0,0 +1,1210 @@
+#!/usr/bin/env python3
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""
+Validate hardcoded expected values in .slt (sqllogictest) test files
+by running the same queries against PySpark and comparing results.
+
+Usage:
+python validate_slt.py # Run all .slt files
+python validate_slt.py --path math/abs.slt # Single file
+python validate_slt.py --path string/ # All files in subdirectory
+python validate_slt.py --verbose # Show details
+python validate_slt.py --show-skipped# Show skipped queries
+"""
+
+import argparse
+import math
+import os
+import re
+import sys
+from dataclasses import dataclass, field
+from pathlib import Path
+from typing import Optional
+
+# ---
+# Arrow type -> Spark type mapping
+# ---
+ARROW_TO_SPARK_TYPE = {
+"Int8": "TINYINT",
+"Int16": "SMALLINT",
+"Int32": "INT",
+"Int64": "BIGINT",
+"UInt8": "SMALLINT",
+"UInt16": "INT",
+"UInt32": "BIGINT",
+"UInt64": "BIGINT",
+"Float16": "FLOAT",
+"Float32": "FLOAT",
+"Float64": "DOUBLE",
+"Utf8": "STRING",
+"Boolean": "BOOLEAN",
+"Binary": "BINARY",
+"Date32": "DATE",
+"Date64": "DATE",
+}
+
+# DataFusion cast type -> Spark type mapping
+DF_TO_SPARK_CAST_TYPE = {
+"TINYINT": "TINYINT",
+"SMALLINT": "SMALLINT",
+"INT": "INT",
+"INTEGER": "INT",
+"BIGINT": "BIGINT",
+"FLOAT": "FLOAT",
+"REAL": "FLOAT",
+"DOUBLE": "DOUBLE",
+"STRING": "STRING",
+"VARCHAR": "STRING",
+"TEXT": "STRING",
+"BOOLEAN": "BOOLEAN",
+"BINARY": "BINARY",
+"DATE": "DATE",
+"TIMESTAMP": "TIMESTAMP",
+# PostgreSQL-style aliases used in some .slt files
+"FLOAT8": "DOUBLE",
+"FLOAT4": "FLOAT",
+"INT8": "BIGINT",
+"INT4": "INT",
+"INT2": "SMALLINT",
+"BYTEA": "BINARY",
+}
+
+# Unsupported Arrow types for Spark
+UNSUPPORTED_ARROW_TYPES = {
+"Utf8View",
+"LargeUtf8",
+"LargeBinary",
+"BinaryView",
+}
+
+# ---
+# SLT record types
+# ---
+
+
+@dataclass
+class QueryRecord:
+"""A 'query [rowsort]' block."""
+
+type_codes: str
+sql: str
+expected: list[str]
+rowsort: bool
+line_number: int
+in_ansi_block: bool = False
+
+
+@dataclass
+class ErrorRecord:
+"""A 'query error ' or 'statement error ' block."""
+
+pattern: str
+sql: str
+line_number: int
+kind: str = "query" # "query" or "statement"
+in_ansi_block: bool = False
+
+
+@dataclass
+class StatementRecord:
+"""A 'statement ok' block (DDL/config)."""
+
+sql: str
+line_number: int
+in_ansi_block: bool = False
+
+
+# ---
+# 1. SLT Parser
+# ---
+
+
+def parse_slt(filepath: str) -> list:
+"""Parse an .slt file into a list of records."""
+with open(filepath) as f:
+lines = f.readlines()
+
+records = []
+i = 0
+in_ansi_mode = False
+
+while i < len(lines):
+line = lines[i].rstrip("\n")
+
+# Skip blank lines and comments
+if not line.strip() or line.strip().startswith("#"):
+i += 1
+continue
+
+# query error
+m = re.match(r"^query\s+error\s+(.*)", line)
+if m:
+pattern = m.group(1).strip()
+line_num = i + 1
+i += 1
+sql_lines = []
+while i < len(lines) and lines[i].strip() and not
lines[i].strip().startswith("#"):
+stripped = lines[i].rstrip("\n")
+if (
+re.match(r"^query\s", stripped)
+or re.match(r"^stat
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
shehabgamin commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4216408338 > @mbutrovich @comphead @shehabgamin let me know what you think about this approach. It isn't as comprehensive as the testing approach in Comet, but should help catch some compatibility issues. Epic! Will review over the next couple days. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
comphead commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4216047929 I'm planning to add an `array_compact` function to DF today, and this function can also be used in Spark, however no plans to have extra `array_compact` in spark crate. My question is, can this tool be also used to check if DF first order builtin functions conforms to Spark semantics? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
github-advanced-security[bot] commented on code in PR #21508:
URL: https://github.com/apache/datafusion/pull/21508#discussion_r3059038680
##
.github/workflows/spark-pyspark-validation.yml:
##
@@ -0,0 +1,60 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+name: Spark PySpark Validation
+
+on:
+ push:
+branches-ignore:
+ - 'gh-readonly-queue/**'
+paths:
+ - 'datafusion/spark/**'
+ - 'datafusion/sqllogictest/test_files/spark/**'
+ - '.github/workflows/spark-pyspark-validation.yml'
+ pull_request:
+paths:
+ - 'datafusion/spark/**'
+ - 'datafusion/sqllogictest/test_files/spark/**'
+ - '.github/workflows/spark-pyspark-validation.yml'
+
+concurrency:
+ group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{
github.workflow }}
+ cancel-in-progress: true
+
+jobs:
+ pyspark-validation:
+runs-on: ubuntu-latest
+name: Validate .slt tests against PySpark
+steps:
+ - uses: actions/checkout@v4
+
+ - name: Set up Python
+uses: actions/setup-python@v5
+with:
+ python-version: '3.11'
+
+ - name: Set up Java
+uses: actions/setup-java@v4
+with:
+ distribution: 'temurin'
+ java-version: '17'
+
+ - name: Install PySpark
+run: pip install pyspark==3.5.5
+
+ - name: Run PySpark validation
+run: python datafusion/spark/scripts/validate_slt.py
Review Comment:
## CodeQL / Workflow does not contain permissions
Actions job or workflow does not limit the permissions of the GITHUB_TOKEN.
Consider setting an explicit permissions block, using the following as a
minimal starting point: {{contents: read}}
[Show more
details](https://github.com/apache/datafusion/security/code-scanning/43)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
-
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4215551000 @comphead I added a GitHub workflow, and a known issues list so that we can skip the currently failing ones. This will at least catch issues with new tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4215456735 > I would say if we have github bot action, similar to `run benchmarks` on the the PR would help remove the local testing part. How this script is planned to be called? It would be nice to eventually add a GitHub workflow to run this, but for now, probably best just to make the script available for people to run. Many of the tests are written in such a way that we cannot support them in PySpark, which makes this quite challenging. The Comet approach is much nicer, but there is no way in this repo to actually run the DF expressions from within Spark, so we cannot use Spark SQL for the tests. I suppose we could update the sql parser crate to support Spark SQL and update the planner to support using Spark expressions and then the tests could be written in Spark SQL. Sounds like a lot of work though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
comphead commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4215402723 Thanks @andygrove this is something that would be superhelpful. Currently I'm going the flow to feed the PR to LLM to convert queries to spark syntax + expected answers from PR, and then run them in local spark sandbox, and this is super annoying. I would say if we have github bot action, similar to `run benchmarks` on the the PR would help remove the local testing part. How this script is planned to be called? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
Re: [PR] feat: add PySpark validation script for datafusion-spark .slt tests [datafusion]
andygrove commented on PR #21508: URL: https://github.com/apache/datafusion/pull/21508#issuecomment-4214875803 @mbutrovich @comphead @shehabgamin let me know what you think about this approach. It isn't as comprehensive as the testing approach in Comet, but should help catch some compatibility issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] - To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
