[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable
[ https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Smart updated SPARK-34195: -- Description: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-")).collect()}} Out[117]: [Row(dt='10-JAN-69', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} \{{ cols = []}} {{ for x in schema[entity]:}} \{{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} \{{ dd = F.substring(c, 1, 2)}} \{{ MMM = F.substring(c, 4, 3)}} \{{ yy = F.substring(c, 8, 2)}} \{{ n = (}} \{{ F.when(trim(col(c)) == "", None).otherwise(}} \{{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} \{{ )}} \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} \{{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} \{{ return std_df.select(*cols)}} was: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} \{{ cols = []}} {{ for x in schema[entity]:}} \{{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} \{{ dd = F.substring(c, 1, 2)}} \{{ MMM = F.substring(c, 4, 3)}} \{{ yy = F.substring(c, 8, 2)}} \{{ n = (}} \{{ F.when(trim(col(c)) == "", None).otherwise(}} \{{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} \{{ )}} \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} \{{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} \{{ return
[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable
[ https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anthony Smart updated SPARK-34195: -- Description: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html] Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} \{{ cols = []}} {{ for x in schema[entity]:}} \{{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} \{{ dd = F.substring(c, 1, 2)}} \{{ MMM = F.substring(c, 4, 3)}} \{{ yy = F.substring(c, 8, 2)}} \{{ n = (}} \{{ F.when(trim(col(c)) == "", None).otherwise(}} \{{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} \{{ )}} \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} \{{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} \{{ return std_df.select(*cols)}} was: The base value is set as 2000 within spark for parsing a two-digit year date string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 instead of 1997. I'm unclear as to why this base value has been changed within spark as the standard python datetime module instead uses a more sensible value of 69 as the boundary cut-off for determining the correct century to apply. Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base value of 2000 is rather non-functional indeed. Most dates encountered in the real world will pertain to both centuries and therefore I propose this functionality is reverted to match the existing python datetime module and / or allow the base value to be set as an option to the various date functions. This would ensure there's consistent behaviour across both python and pyspark. Python: {{import datetime}} {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}} Out[118]: datetime.date(1969, 1, 10) {{Pyspark:}}{{import spark.sql.functions as F}} {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}} {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), "dd-MM-yy")).collect()}} Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')] As a work around I had to write my own solution to deal with this. The code below is specific to my data pipeline but you get the idea of the issue I had to deal with just to change the boundary cut-off to better handle two-digit years. {{from pyspark.sql.functions import to_date, col, trim}}{{def convert_dtypes(entity, schema, boundary="40"):}} {{ cols = []}} {{ for x in schema[entity]:}} {{ for c in std_df.columns:}} {{ if x['name'] == c:}} {{ if x['dtype'] == 'date':}} {{ dd = F.substring(c, 1, 2)}} {{ MMM = F.substring(c, 4, 3)}} {{ yy = F.substring(c, 8, 2)}} {{ n = (}} {{ F.when(trim(col(c)) == "", None).otherwise(}} {{ F.when(yy >= ("40"), }} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}} {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}} {{ )}} {{ cols.append(to_date(n, 'dd-MMM-').alias(c))}} {{ else:}} {{ cols.append(col(c).cast(x['dtype']))}} {{ #cols[-1].nullable = x['nullable']}} {{ return std_df.select(*cols)}} > Base value for parsing two-digit