This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 9a45da21dd1c [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs 9a45da21dd1c is described below commit 9a45da21dd1c7dd93152f7126c8c611b8ba031e7 Author: Gengliang Wang <gengli...@apache.org> AuthorDate: Sat May 4 11:54:49 2024 -0700 [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs ### What changes were proposed in this pull request? Similar to https://github.com/apache/spark/pull/46375/, this PR provides a constant table schema in PySpark for querying structured logs. The doc of logging configuration is also updated. ### Why are the changes needed? Provide a convenient way to query Spark logs using PySpark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test ### Was this patch authored or co-authored using generative AI tooling? No Closes #46384 from gengliangwang/pythonLog. Authored-by: Gengliang Wang <gengli...@apache.org> Signed-off-by: Dongjoon Hyun <dh...@apple.com> --- docs/configuration.md | 9 ++++++++- python/pyspark/util.py | 16 ++++++++++++++++ 2 files changed, 24 insertions(+), 1 deletion(-) diff --git a/docs/configuration.md b/docs/configuration.md index d07decf02505..a55ce89c096b 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3677,8 +3677,15 @@ Starting from version 4.0.0, `spark-submit` has adopted the [JSON Template Layou To configure the layout of structured logging, start with the `log4j2.properties.template` file. -To query Spark logs using Spark SQL, you can use the following Scala code snippet: +To query Spark logs using Spark SQL, you can use the following Python code snippet: +```python +from pyspark.util import LogUtils + +logDf = spark.read.schema(LogUtils.LOG_SCHEMA).json("path/to/logs") +``` + +Or using the following Scala code snippet: ```scala import org.apache.spark.util.LogUtils.LOG_SCHEMA diff --git a/python/pyspark/util.py b/python/pyspark/util.py index f0fa4a2413ce..4920ba957c19 100644 --- a/python/pyspark/util.py +++ b/python/pyspark/util.py @@ -107,6 +107,22 @@ class VersionUtils: ) +class LogUtils: + """ + Utils for querying structured Spark logs with Spark SQL. + """ + + LOG_SCHEMA = ( + "ts TIMESTAMP, " + "level STRING, " + "msg STRING, " + "context map<STRING, STRING>, " + "exception STRUCT<class STRING, msg STRING, " + "stacktrace ARRAY<STRUCT<class STRING, method STRING, file STRING,line STRING>>>," + "logger STRING" + ) + + def fail_on_stopiteration(f: Callable) -> Callable: """ Wraps the input function to fail on 'StopIteration' by raising a 'RuntimeError' --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org