Martin Bode created SPARK-54421:
-----------------------------------

             Summary: LocalDataToArrowConversion fails on Windows for for very 
low and high datetime/TimestampType values
                 Key: SPARK-54421
                 URL: https://issues.apache.org/jira/browse/SPARK-54421
             Project: Spark
          Issue Type: Bug
          Components: Connect, PySpark
    Affects Versions: 4.0.1
         Environment: Windows 11
            Reporter: Martin Bode


When trying to create a `DataFrame` from a Python list of dicts where some 
values are very low (<`1970-01-01`) or high (>`3001-01-19`), this will lead to 
an error.

This seem to be specific for {*}Windows OS{*}.
h1. Reproduce
{code:python}
from datetime import datetime

data = [
    {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)},  # ❌ causes error
    {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)},  # ✅ works fine
    {"id": 3, "some_datetime": datetime(2025, 1, 2, 3, 4, 5)},  # ✅ works fine
    {"id": 4, "some_datetime": datetime(3001, 1, 19, 3, 4, 5)},  # ✅ works fine
    {"id": 5, "some_datetime": datetime(3001, 1, 20, 3, 4, 5)},  # ❌ causes 
error
    {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 5)},  # ❌ causes error
]

df_testdata = spark.createDataFrame(data=data, schema="id LONG, some_datetime 
TIMESTAMP")

df_testdata.show(truncate=False)
{code}
h1. Error
{code:python}
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[76], line 12
      1 from datetime import datetime
      3 data = [
      4     {"id": 1, "some_datetime": datetime(1970, 1, 1, 3, 4, 5)},  # ❌ 
causes error
      5     {"id": 2, "some_datetime": datetime(1970, 1, 2, 3, 4, 5)},  # ✅ 
works fine
   (...)      9     {"id": 6, "some_datetime": datetime(9999, 1, 2, 3, 4, 5)},  
# ❌ causes error
     10 ]
---> 12 df_testdata = spark.createDataFrame(data=data, schema="id LONG, 
some_datetime TIMESTAMP")
     14 df_testdata.show(truncate=False)

File c:\...\.venv\Lib\site-packages\pyspark\sql\connect\session.py:707, in 
SparkSession.createDataFrame(self, data, schema, samplingRatio, verifySchema)
    700     from pyspark.sql.conversion import (
    701         LocalDataToArrowConversion,
    702     )
    704     # Spark Connect will try its best to build the Arrow table with the
    705     # inferred schema in the client side, and then rename the columns 
and
    706     # cast the datatypes in the server side.
--> 707     _table = LocalDataToArrowConversion.convert(_data, _schema, 
prefers_large_types)
    709 # TODO: Beside the validation on number of columns, we should also check
    710 # whether the Arrow Schema is compatible with the user provided Schema.
    711 if _num_cols is not None and _num_cols != _table.shape[1]:

File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:347, in 
LocalDataToArrowConversion.convert(data, schema, use_large_var_types)
    345 if isinstance(item, dict):
    346     for i, col in enumerate(column_names):
--> 347         pylist[i].append(column_convs[i](item.get(col)))
    348 else:
    349     if len(item) != len(column_names):

File c:\...\.venv\Lib\site-packages\pyspark\sql\conversion.py:222, in 
LocalDataToArrowConversion._create_converter.<locals>.convert_timestamp(value)
    220 else:
    221     assert isinstance(value, datetime.datetime)
--> 222     return value.astimezone(datetime.timezone.utc)

OSError: [Errno 22] Invalid argument
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to