[ https://issues.apache.org/jira/browse/SPARK-41125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jamie updated SPARK-41125: -------------------------- Description: I am using python's pytest library to write unit tests for a pyspark library I am building. pytest has a popular capability called fixtures which allow us to write reusable preparation steps for our tests. I have a simple fixture that creates a pyspark.sql.DataFrame which works on python 3.7, 3.8, 3.9, 3.10 but fails on python 3.11. The failing code is in a fixture called {{{}dataframe_of_purchases{}}}. Here is my fixtures code: {code:python} from decimal import Decimal import pytest from pyspark.sql import DataFrame, SparkSession from pyspark.sql.types import ( DecimalType, IntegerType, StringType, StructField, StructType, ) @pytest.fixture(scope="session") def purchases_schema(): return StructType( [ StructField("Customer", StringType(), True), StructField("Store", StringType(), True), StructField("Channel", StringType(), True), StructField("Product", StringType(), True), StructField("Quantity", IntegerType(), True), StructField("Basket", StringType(), True), StructField("GrossSpend", DecimalType(10, 2), True), ] ) @pytest.fixture(scope="session") def dataframe_of_purchases(purchases_schema) -> DataFrame: spark = SparkSession.builder.getOrCreate() return spark.createDataFrame( data=[ ("Leia", "Hammersmith", "Instore", "Cheddar", 2, "Basket1", Decimal(2.50)) ], schema=purchases_schema, ) {code} This code can be seen here: [https://github.com/jamiekt/jstark/blob/9e1d0e654195932a0765f66db6c8359ed8b60a3b/tests/conftest.py] The tests run in a GitHub Actions CI pipeline against many different versions of python on linux, Windows & MacOS. The tests only fail for python 3.11, and on all platforms: !screenshot-1.png! This run can be seen at: https://github.com/jamiekt/jstark/actions/runs/3457011099 The error is _pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range The full stacktrace is: {quote} Traceback (most recent call last): File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/serializers.py", line 458, in dumps return cloudpickle.dumps(obj, pickle_protocol) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps cp.dump(obj) File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump return Pickler.dump(self, obj) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 692, in reducer_override return self._function_reduce(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 565, in _function_reduce return self._dynamic_function_reduce(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 546, in _dynamic_function_reduce state = _function_getstate(func) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 157, in _function_getstate f_globals_ref = _extract_code_globals(func.__code__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in _extract_code_globals out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in <dictcomp> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)} ~~~~~^^^^^^^ IndexError: tuple index out of range {quote} This is not a huge blocker for me however I assume it will be for someone at some point so I thought it prudent to report it here. was: I am using python's pytest library to write unit tests for a pyspark library I am building. pytest has a popular capability called fixtures which allow us to write reusable preparation steps for our tests. I have a simple fixture that creates a pyspark.sql.DataFrame which works on python 3.7, 3.8, 3.9, 3.10 but fails on python 3.11. The failing code is in a fixture called {{{}dataframe_of_purchases{}}}. Here is my fixtures code: {code:python} from decimal import Decimal import pytest from pyspark.sql import DataFrame, SparkSession from pyspark.sql.types import ( DecimalType, IntegerType, StringType, StructField, StructType, ) @pytest.fixture(scope="session") def purchases_schema(): return StructType( [ StructField("Customer", StringType(), True), StructField("Store", StringType(), True), StructField("Channel", StringType(), True), StructField("Product", StringType(), True), StructField("Quantity", IntegerType(), True), StructField("Basket", StringType(), True), StructField("GrossSpend", DecimalType(10, 2), True), ] ) @pytest.fixture(scope="session") def dataframe_of_purchases(purchases_schema) -> DataFrame: spark = SparkSession.builder.getOrCreate() return spark.createDataFrame( data=[ ("Leia", "Hammersmith", "Instore", "Cheddar", 2, "Basket1", Decimal(2.50)) ], schema=purchases_schema, ) {code} This code can be seen here: [https://github.com/jamiekt/jstark/blob/9e1d0e654195932a0765f66db6c8359ed8b60a3b/tests/conftest.py] The tests run in a GitHub Actions CI pipeline against many different versions of python on linux, Windows & MacOS. The tests only fail for python 3.11, and on all platforms: !image-2022-11-13-21-58-35-751.png! This run can be seen at: https://github.com/jamiekt/jstark/actions/runs/3457011099 The error is _pickle.PicklingError: Could not serialize object: IndexError: tuple index out of range The full stacktrace is: {quote} Traceback (most recent call last): File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/serializers.py", line 458, in dumps return cloudpickle.dumps(obj, pickle_protocol) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 73, in dumps cp.dump(obj) File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 602, in dump return Pickler.dump(self, obj) ^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 692, in reducer_override return self._function_reduce(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 565, in _function_reduce return self._dynamic_function_reduce(obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 546, in _dynamic_function_reduce state = _function_getstate(func) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", line 157, in _function_getstate f_globals_ref = _extract_code_globals(func.__code__) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in _extract_code_globals out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)} ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", line 334, in <dictcomp> out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)} ~~~~~^^^^^^^ IndexError: tuple index out of range {quote} This is not a huge blocker for me however I assume it will be for someone at some point so I thought it prudent to report it here. > Simple call to createDataFrame fails with PicklingError but only on python3.11 > ------------------------------------------------------------------------------ > > Key: SPARK-41125 > URL: https://issues.apache.org/jira/browse/SPARK-41125 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 3.3.1 > Reporter: Jamie > Priority: Minor > Attachments: screenshot-1.png > > > I am using python's pytest library to write unit tests for a pyspark library > I am building. pytest has a popular capability called fixtures which allow us > to write reusable preparation steps for our tests. I have a simple fixture > that creates a pyspark.sql.DataFrame which works on python 3.7, 3.8, 3.9, > 3.10 but fails on python 3.11. > The failing code is in a fixture called {{{}dataframe_of_purchases{}}}. Here > is my fixtures code: > {code:python} > from decimal import Decimal > import pytest > from pyspark.sql import DataFrame, SparkSession > from pyspark.sql.types import ( > DecimalType, > IntegerType, > StringType, > StructField, > StructType, > ) > @pytest.fixture(scope="session") > def purchases_schema(): > return StructType( > [ > StructField("Customer", StringType(), True), > StructField("Store", StringType(), True), > StructField("Channel", StringType(), True), > StructField("Product", StringType(), True), > StructField("Quantity", IntegerType(), True), > StructField("Basket", StringType(), True), > StructField("GrossSpend", DecimalType(10, 2), True), > ] > ) > @pytest.fixture(scope="session") > def dataframe_of_purchases(purchases_schema) -> DataFrame: > spark = SparkSession.builder.getOrCreate() > return spark.createDataFrame( > data=[ > ("Leia", "Hammersmith", "Instore", "Cheddar", 2, "Basket1", > Decimal(2.50)) > ], > schema=purchases_schema, > ) > {code} > This code can be seen here: > [https://github.com/jamiekt/jstark/blob/9e1d0e654195932a0765f66db6c8359ed8b60a3b/tests/conftest.py] > The tests run in a GitHub Actions CI pipeline against many different versions > of python on linux, Windows & MacOS. The tests only fail for python 3.11, and > on all platforms: > !screenshot-1.png! > This run can be seen at: > https://github.com/jamiekt/jstark/actions/runs/3457011099 > The error is > _pickle.PicklingError: Could not serialize object: IndexError: tuple index > out of range > The full stacktrace is: > > {quote} > Traceback (most recent call last): > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/serializers.py", > line 458, in dumps > return cloudpickle.dumps(obj, pickle_protocol) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 73, in dumps > cp.dump(obj) > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 602, in dump > return Pickler.dump(self, obj) > ^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 692, in reducer_override > return self._function_reduce(obj) > ^^^^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 565, in _function_reduce > return self._dynamic_function_reduce(obj) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 546, in _dynamic_function_reduce > state = _function_getstate(func) > ^^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle_fast.py", > line 157, in _function_getstate > f_globals_ref = _extract_code_globals(func.__code__) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", > line 334, in _extract_code_globals > out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)} > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File > "/home/runner/.local/share/hatch/env/virtual/jstark/fjzPEUEi/jstark/lib/python3.11/site-packages/pyspark/cloudpickle/cloudpickle.py", > line 334, in <dictcomp> > out_names = {names[oparg]: None for _, oparg in _walk_global_ops(co)} > ~~~~~^^^^^^^ > IndexError: tuple index out of range > {quote} > This is not a huge blocker for me however I assume it will be for someone at > some point so I thought it prudent to report it here. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org