(spark) branch master updated: [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs

dongjoon Sat, 04 May 2024 11:55:19 -0700

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 9a45da21dd1c [SPARK-48129][PYTHON] Provide a constant table schema in 
PySpark for querying structured logs
9a45da21dd1c is described below

commit 9a45da21dd1c7dd93152f7126c8c611b8ba031e7
Author: Gengliang Wang <gengli...@apache.org>
AuthorDate: Sat May 4 11:54:49 2024 -0700

    [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for 
querying structured logs
    
    ### What changes were proposed in this pull request?
    
    Similar to https://github.com/apache/spark/pull/46375/, this PR provides a 
constant table schema in PySpark for querying structured logs.
    The doc of logging configuration is also updated.
    
    ### Why are the changes needed?
    
    Provide a convenient way to query Spark logs using PySpark.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No
    
    ### How was this patch tested?
    
    Manual test
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    No
    
    Closes #46384 from gengliangwang/pythonLog.
    
    Authored-by: Gengliang Wang <gengli...@apache.org>
    Signed-off-by: Dongjoon Hyun <dh...@apple.com>
---
 docs/configuration.md  |  9 ++++++++-
 python/pyspark/util.py | 16 ++++++++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/docs/configuration.md b/docs/configuration.md
index d07decf02505..a55ce89c096b 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -3677,8 +3677,15 @@ Starting from version 4.0.0, `spark-submit` has adopted 
the [JSON Template Layou
 
 To configure the layout of structured logging, start with the 
`log4j2.properties.template` file.
 
-To query Spark logs using Spark SQL, you can use the following Scala code 
snippet:
+To query Spark logs using Spark SQL, you can use the following Python code 
snippet:
 
+```python
+from pyspark.util import LogUtils
+
+logDf = spark.read.schema(LogUtils.LOG_SCHEMA).json("path/to/logs")
+```
+
+Or using the following Scala code snippet:
 ```scala
 import org.apache.spark.util.LogUtils.LOG_SCHEMA
 
diff --git a/python/pyspark/util.py b/python/pyspark/util.py
index f0fa4a2413ce..4920ba957c19 100644
--- a/python/pyspark/util.py
+++ b/python/pyspark/util.py
@@ -107,6 +107,22 @@ class VersionUtils:
             )
 
 
+class LogUtils:
+    """
+    Utils for querying structured Spark logs with Spark SQL.
+    """
+
+    LOG_SCHEMA = (
+        "ts TIMESTAMP, "
+        "level STRING, "
+        "msg STRING, "
+        "context map<STRING, STRING>, "
+        "exception STRUCT<class STRING, msg STRING, "
+        "stacktrace ARRAY<STRUCT<class STRING, method STRING, file STRING,line 
STRING>>>,"
+        "logger STRING"
+    )
+
+
 def fail_on_stopiteration(f: Callable) -> Callable:
     """
     Wraps the input function to fail on 'StopIteration' by raising a 
'RuntimeError'


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

(spark) branch master updated: [SPARK-48129][PYTHON] Provide a constant table schema in PySpark for querying structured logs

Reply via email to