ybhaw opened a new pull request, #50726: URL: https://github.com/apache/spark/pull/50726
### What changes were proposed in this pull request? Current implementation of [_type_verifier](https://github.com/apache/spark/blob/b634978936499f58f8cb2e8ea16339feb02ffb52/python/pyspark/sql/types.py#L2607) does not support classes extending the [acceptable types](https://github.com/apache/spark/blob/b634978936499f58f8cb2e8ea16339feb02ffb52/python/pyspark/sql/types.py#L2568). Here is a small test case for same that fails in current implementation: <details> <summary>Sample test case that fails currently</summary> ```python import unittest from pyspark.sql.types import StructType, _make_type_verifier class ExtendedStructType(StructType): ... class SampleTest(unittest.TestCase): def test_extended_struct_type(self): schema = ExtendedStructType([]) _make_type_verifier(schema)([]) ``` <details> <summary>Failure logs</summary> ```sh Failure Traceback (most recent call last): File ".../spark/python/pyspark/sql/tests/test_types.py", line 3016, in test_extended_struct_type _make_type_verifier(schema)([]) File ".../spark/python/pyspark/sql/types.py", line 2947, in verify verify_value(obj) File ".../spark/python/pyspark/sql/types.py", line 2878, in verify_struct assert_acceptable_types(obj) File ".../spark/python/pyspark/sql/types.py", line 2707, in assert_acceptable_types assert _type in _acceptable_types, new_msg( AssertionError: unknown datatype: StructType([]) for object [] ``` </details> </details> **This is happening due to** current implementation using `type(data_type)` which does not return *StructType* for classes extending *StructType*. [(ref)](https://github.com/apache/spark/blob/b634978936499f58f8cb2e8ea16339feb02ffb52/python/pyspark/sql/types.py#L2704) ### Why are the changes needed? **proposal:** Changing implementation to use `isinstance()` instead of `type()` I believe inheritance should be allowed for DataTypes as it enables users to add behavior, validations or schematic meanings to them. <details> <summary>Example: my use case that is failing currently</summary> I was trying to achieve this behavior: ```python class Schema(StructType): """Some implementation allowing class attributes as fields of StructType""" class Person(Schema): name = StructField("name", StringType()) person = Person() # Equivalent to StructType([StructField("name", StringType())]) # this was failing df = spark.createDataFrame({...}, schema=Person()) ``` </details> > If you fix a bug, you can clarify why it is a bug. The current implementation only checks for behavior of a data type. By using `type` it restricts inheritance. It can achieve same by using `isinstance` too. IF inheritance is not desirable, then maybe types should be annotated with `@final`. But in either cases, I would consider it to be a bug. ### Does this PR introduce _any_ user-facing change? No. This PR does not change any existing user facing behavior, but allows them to extend DataTypes if they need to. ### How was this patch tested? There are already few unit tests for `_make_type_verifier` [(here)](https://github.com/apache/spark/blob/b634978936499f58f8cb2e8ea16339feb02ffb52/python/pyspark/sql/tests/test_types.py#L2488) that test against the direct supported data types. Created a copy of those tests and instead of using the direct types, checked against extended datatypes. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
