Github user gberger commented on the issue: https://github.com/apache/spark/pull/19792 Hey all, ## Error message I revamped the error message and made it "recursive" similar to @HyukjinKwon. Here's an example: ``` >>> _merge_type( ... StructType([StructField("f1", ArrayType(MapType(StringType(), LongType())))]), ... StructType([StructField("f1", ArrayType(MapType(DoubleType(), LongType())))]) ... ) Traceback (most recent call last): File "<stdin>", line 3, in <module> File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1129, in _merge_type for f in a.fields] File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1129, in <listcomp> for f in a.fields] File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1137, in _merge_type return ArrayType(_merge_type(a.elementType, b.elementType, field=field+'.arrayElement'), True) File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1140, in _merge_type return MapType(_merge_type(a.keyType, b.keyType, field=field+'.mapKey'), File "/Users/gberger/Projects/spark/python/pyspark/sql/types.py", line 1122, in _merge_type raise TypeError("%s: Can not merge type %s and %s" % (field, type(a), type(b))) TypeError: .structField("f1").arrayElement.mapKey: Can not merge type <class 'pyspark.sql.types.StringType'> and <class 'pyspark.sql.types.DoubleType'> ``` Happy to iterate on the exact formatting or wording of the path shown. ## Tests I wrote a bunch of tests too, hope they are comprehensive enough but happy to add more if not. @ueshin ## Benchmark It seems that the time it takes for a nested _merge_type on my machine has increased for ~2.75 microseconds to ~2.85 microseconds, around a 3% increase. This can be attributed to the string concatenation that goes on every time _merge_type goes one level down from a StructType, ArrayType or MapType. I'm not sure if there's a better way to propagate this information down the stack, maybe a tuple? Code used: ``` from pyspark.sql.types import * from pyspark.sql.types import _merge_type import time def test_f(): return _merge_type( StructType([StructField("f1", ArrayType(MapType(StringType(), LongType())))]), StructType([StructField("f1", ArrayType(MapType(StringType(), LongType())))]) ) def timing(f): def wrap(*args): time1 = time.time() for __ in range(100000): ret = f(*args) time2 = time.time() print('took %0.3f ms' % ((time2-time1)*1000.0)) return ret return wrap for _ in range(10): timing(test_f)() ``` Before: > took 2701.337 ms > took 2905.867 ms > took 2725.119 ms > took 2796.098 ms > took 2718.981 ms > took 2773.560 ms > took 2717.995 ms > took 2796.466 ms > took 2716.173 ms > took 2744.121 ms After: > took 2865.038 ms > took 2836.403 ms > took 2871.871 ms > took 2827.625 ms > took 2820.170 ms > took 2873.976 ms > took 2833.609 ms > took 2909.599 ms > took 3162.108 ms > took 2940.864 ms
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org