This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch branch-3.3 in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/branch-3.3 by this push: new 2672624931d [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors 2672624931d is described below commit 2672624931dd4784fad6cdd912e3669c83741060 Author: Xinrong Meng <xinrong.m...@databricks.com> AuthorDate: Sun May 15 09:25:02 2022 +0900 [SPARK-38953][PYTHON][DOC] Document PySpark common exceptions / errors ### What changes were proposed in this pull request? Document PySpark(SQL, pandas API on Spark, and Py4J) common exceptions/errors and respective solutions. ### Why are the changes needed? Make PySpark debugging easier. There are common exceptions/errors in PySpark SQL, pandas API on Spark, and Py4J. Documenting exceptions and respective solutions help users debug PySpark. ### Does this PR introduce _any_ user-facing change? No. Document change only. ### How was this patch tested? Manual test. <img width="1019" alt="image" src="https://user-images.githubusercontent.com/47337188/165145874-b0de33b1-835a-459d-9062-94086e62e254.png"> Please refer to https://github.com/apache/spark/blob/7a1c7599a21cbbe2778707b72643cf98ac601ab1/python/docs/source/development/debugging.rst#common-exceptions--errors for the whole rendered page. Closes #36267 from xinrong-databricks/common_err. Authored-by: Xinrong Meng <xinrong.m...@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> (cherry picked from commit f940d7adfd6d071bc3bdcc406e01263a7f03e955) Signed-off-by: Hyukjin Kwon <gurwls...@apache.org> --- python/docs/source/development/debugging.rst | 280 +++++++++++++++++++++++++++ 1 file changed, 280 insertions(+) diff --git a/python/docs/source/development/debugging.rst b/python/docs/source/development/debugging.rst index 1e6571da028..05c47ae4bf7 100644 --- a/python/docs/source/development/debugging.rst +++ b/python/docs/source/development/debugging.rst @@ -332,3 +332,283 @@ The UDF IDs can be seen in the query plan, for example, ``add1(...)#2L`` in ``Ar This feature is not supported with registered UDFs. + +Common Exceptions / Errors +-------------------------- + +PySpark SQL +~~~~~~~~~~~ + +**AnalysisException** + +``AnalysisException`` is raised when failing to analyze a SQL query plan. + +Example: + +.. code-block:: python + + >>> df = spark.range(1) + >>> df['bad_key'] + Traceback (most recent call last): + ... + pyspark.sql.utils.AnalysisException: Cannot resolve column name "bad_key" among (id) + +Solution: + +.. code-block:: python + + >>> df['id'] + Column<'id'> + +**ParseException** + +``ParseException`` is raised when failing to parse a SQL command. + +Example: + +.. code-block:: python + + >>> spark.sql("select * 1") + Traceback (most recent call last): + ... + pyspark.sql.utils.ParseException: + Syntax error at or near '1': extra input '1'(line 1, pos 9) + == SQL == + select * 1 + ---------^^^ + +Solution: + +.. code-block:: python + + >>> spark.sql("select *") + DataFrame[] + +**IllegalArgumentException** + +``IllegalArgumentException`` is raised when passing an illegal or inappropriate argument. + +Example: + +.. code-block:: python + + >>> spark.range(1).sample(-1.0) + Traceback (most recent call last): + ... + pyspark.sql.utils.IllegalArgumentException: requirement failed: Sampling fraction (-1.0) must be on interval [0, 1] without replacement + +Solution: + +.. code-block:: python + + >>> spark.range(1).sample(1.0) + DataFrame[id: bigint] + +**PythonException** + +``PythonException`` is thrown from Python workers. + +You can see the type of exception that was thrown from the Python worker and its stack trace, as ``TypeError`` below. + +Example: + +.. code-block:: python + + >>> from pyspark.sql.functions import udf + >>> def f(x): + ... return F.abs(x) + ... + >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect() + 22/04/12 14:52:31 ERROR Executor: Exception in task 7.0 in stage 37.0 (TID 232) + org.apache.spark.api.python.PythonException: Traceback (most recent call last): + ... + TypeError: Invalid argument, not a string or column: -1 of type <class 'int'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function. + +Solution: + +.. code-block:: python + + >>> def f(x): + ... return abs(x) + ... + >>> spark.range(-1, 1).withColumn("abs", udf(f)("id")).collect() + [Row(id=-1, abs='1'), Row(id=0, abs='0')] + +**StreamingQueryException** + +``StreamingQueryException`` is raised when failing a StreamingQuery. Most often, it is thrown from Python workers, that wrap it as a ``PythonException``. + +Example: + +.. code-block:: python + + >>> sdf = spark.readStream.format("text").load("python/test_support/sql/streaming") + >>> from pyspark.sql.functions import col, udf + >>> bad_udf = udf(lambda x: 1 / 0) + >>> (sdf.select(bad_udf(col("value"))).writeStream.format("memory").queryName("q1").start()).processAllAvailable() + Traceback (most recent call last): + ... + org.apache.spark.api.python.PythonException: Traceback (most recent call last): + File "<stdin>", line 1, in <lambda> + ZeroDivisionError: division by zero + ... + pyspark.sql.utils.StreamingQueryException: Query q1 [id = ced5797c-74e2-4079-825b-f3316b327c7d, runId = 65bacaf3-9d51-476a-80ce-0ac388d4906a] terminated with exception: Writing job aborted + +Solution: + +Fix the StreamingQuery and re-execute the workflow. + +**SparkUpgradeException** + +``SparkUpgradeException`` is thrown because of Spark upgrade. + +Example: + +.. code-block:: python + + >>> from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime + >>> df = spark.createDataFrame([("2014-31-12",)], ["date_str"]) + >>> df2 = df.select("date_str", to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa")))) + >>> df2.collect() + Traceback (most recent call last): + ... + pyspark.sql.utils.SparkUpgradeException: You may get a different result due to the upgrading to Spark >= 3.0: Fail to recognize 'yyyy-dd-aa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html + +Solution: + +.. code-block:: python + + >>> spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") + >>> df2 = df.select("date_str", to_date(from_unixtime(unix_timestamp("date_str", "yyyy-dd-aa")))) + >>> df2.collect() + [Row(date_str='2014-31-12', to_date(from_unixtime(unix_timestamp(date_str, yyyy-dd-aa), yyyy-MM-dd HH:mm:ss))=None)] + +pandas API on Spark +~~~~~~~~~~~~~~~~~~~ + +There are specific common exceptions / errors in pandas API on Spark. + +**ValueError: Cannot combine the series or dataframe because it comes from a different dataframe** + +Operations involving more than one series or dataframes raises a ``ValueError`` if ``compute.ops_on_diff_frames`` is disabled (disabled by default). Such operations may be expensive due to joining of underlying Spark frames. So users should be aware of the cost and enable that flag only when necessary. + +Exception: + +.. code-block:: python + + >>> ps.Series([1, 2]) + ps.Series([3, 4]) + Traceback (most recent call last): + ... + ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option. + + +Solution: + +.. code-block:: python + + >>> with ps.option_context('compute.ops_on_diff_frames', True): + ... ps.Series([1, 2]) + ps.Series([3, 4]) + ... + 0 4 + 1 6 + dtype: int64 + +**RuntimeError: Result vector from pandas_udf was not the required length** + +Exception: + +.. code-block:: python + + >>> def f(x) -> ps.Series[np.int32]: + ... return x[:-1] + ... + >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f) + 22/04/12 13:46:39 ERROR Executor: Exception in task 2.0 in stage 16.0 (TID 88) + org.apache.spark.api.python.PythonException: Traceback (most recent call last): + ... + RuntimeError: Result vector from pandas_udf was not the required length: expected 1, got 0 + +Solution: + +.. code-block:: python + + >>> def f(x) -> ps.Series[np.int32]: + ... return x + ... + >>> ps.DataFrame({"x":[1, 2], "y":[3, 4]}).transform(f) + x y + 0 1 3 + 1 2 4 + +Py4j +~~~~ + +**Py4JJavaError** + +``Py4JJavaError`` is raised when an exception occurs in the Java client code. +You can see the type of exception that was thrown on the Java side and its stack trace, as ``java.lang.NullPointerException`` below. + +Example: + +.. code-block:: python + + >>> spark.sparkContext._jvm.java.lang.String(None) + Traceback (most recent call last): + ... + py4j.protocol.Py4JJavaError: An error occurred while calling None.java.lang.String. + : java.lang.NullPointerException + .. + +Solution: + +.. code-block:: python + + >>> spark.sparkContext._jvm.java.lang.String("x") + 'x' + +**Py4JError** + +``Py4JError`` is raised when any other error occurs such as when the Python client program tries to access an object that no longer exists on the Java side. + +Example: + +.. code-block:: python + + >>> from pyspark.ml.linalg import Vectors + >>> from pyspark.ml.regression import LinearRegression + >>> df = spark.createDataFrame( + ... [(1.0, 2.0, Vectors.dense(1.0)), (0.0, 2.0, Vectors.sparse(1, [], []))], + ... ["label", "weight", "features"], + ... ) + >>> lr = LinearRegression( + ... maxIter=1, regParam=0.0, solver="normal", weightCol="weight", fitIntercept=False + ... ) + >>> model = lr.fit(df) + >>> model + LinearRegressionModel: uid=LinearRegression_eb7bc1d4bf25, numFeatures=1 + >>> model.__del__() + >>> model + Traceback (most recent call last): + ... + py4j.protocol.Py4JError: An error occurred while calling o531.toString. Trace: + py4j.Py4JException: Target Object ID does not exist for this gateway :o531 + ... + +Solution: + +Access an object that exists on the Java side. + +**Py4JNetworkError** + +``Py4JNetworkError`` is raised when a problem occurs during network transfer (e.g., connection lost). In this case, we shall debug the network and rebuild the connection. + +Stack Traces +------------ + +There are Spark configurations to control stack traces: + +- ``spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled`` is true by default to simplify traceback from Python UDFs. + +- ``spark.sql.pyspark.jvmStacktrace.enabled`` is false by default to hide JVM stacktrace and to show a Python-friendly exception only. + +Spark configurations above are independent from log level settings. Control log levels through :meth:`pyspark.SparkContext.setLogLevel`. --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org