[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable

2021-01-21 Thread Anthony Smart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Smart updated SPARK-34195:
--
Description: 
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-69', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return std_df.select(*cols)}}
  

  was:
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return 

[jira] [Updated] (SPARK-34195) Base value for parsing two-digit year should be made configurable

2021-01-21 Thread Anthony Smart (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anthony Smart updated SPARK-34195:
--
Description: 
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: [https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html]

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
 {{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
 Out[118]: datetime.date(1969, 1, 10)
  
 {{Pyspark:}}{{import spark.sql.functions as F}}
 {{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
 {{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", 
"dd-MMM-yy"), "dd-MM-")).collect()}}
  
 Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
  
  
  
 As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
  
 {{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
 \{{ cols = []}}
 {{ for x in schema[entity]:}}
 \{{ for c in std_df.columns:}}
 {{ if x['name'] == c:}}
 {{ if x['dtype'] == 'date':}}
 \{{ dd = F.substring(c, 1, 2)}}
 \{{ MMM = F.substring(c, 4, 3)}}
 \{{ yy = F.substring(c, 8, 2)}}
 \{{ n = (}}
 \{{ F.when(trim(col(c)) == "", None).otherwise(}}
 \{{ F.when(yy >= ("40"), }}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
 {{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
 \{{ )}}
 \{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
 \{{ else:}}
 {{ cols.append(col(c).cast(x['dtype']))}}
 {{ #cols[-1].nullable = x['nullable']}}
 \{{ return std_df.select(*cols)}}
  

  was:
The base value is set as 2000 within spark for parsing a two-digit year date 
string. If we try to parse "10-JAN-97" then this will be interpreted as 2097 
instead of 1997.

I'm unclear as to why this base value has been changed within spark as the 
standard python datetime module instead uses a more sensible value of 69 as the 
boundary cut-off for determining the correct century to apply.

Reference: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html

Other libraries e.g. .NET Core will use 29 as the boundary cut-off. But a base 
value of 2000 is rather non-functional indeed. Most dates encountered in the 
real world will pertain to both centuries and therefore I propose this 
functionality is reverted to match the existing python datetime module and / or 
allow the base value to be set as an option to the various date functions. This 
would ensure there's consistent behaviour across both python and pyspark.

 

Python:

{{import datetime}}
{{datetime.datetime.strptime('10-JAN-69', '%d-%b-%y').date()}}
Out[118]: datetime.date(1969, 1, 10)
 
{{Pyspark:}}{{import spark.sql.functions as F}}
{{df = spark.createDataFrame([('10-JAN-69',)], ['dt'])}}
{{df.withColumn("newdate", F.from_unixtime(F.unix_timestamp("dt", "dd-MMM-yy"), 
"dd-MM-yy")).collect()}}
 
Out[117]: [Row(dt='10-JAN-70', newdate='10-01-2069')]
 
 
 
As a work around I had to write my own solution to deal with this. The code 
below is specific to my data pipeline but you get the idea of the issue I had 
to deal with just to change the boundary cut-off to better handle two-digit 
years.
 
{{from pyspark.sql.functions import to_date, col, trim}}{{def 
convert_dtypes(entity, schema, boundary="40"):}}
{{ cols = []}}
{{ for x in schema[entity]:}}
{{ for c in std_df.columns:}}
{{ if x['name'] == c:}}
{{ if x['dtype'] == 'date':}}
{{ dd = F.substring(c, 1, 2)}}
{{ MMM = F.substring(c, 4, 3)}}
{{ yy = F.substring(c, 8, 2)}}
{{ n = (}}
{{ F.when(trim(col(c)) == "", None).otherwise(}}
{{ F.when(yy >= ("40"), }}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("19"), yy)).otherwise(}}
{{ F.concat(dd, F.lit("-"), MMM, F.lit("-"), F.lit("20"), yy)))}}
{{ )}}
{{ cols.append(to_date(n, 'dd-MMM-').alias(c))}}
{{ else:}}
{{ cols.append(col(c).cast(x['dtype']))}}
{{ #cols[-1].nullable = x['nullable']}}
{{ return std_df.select(*cols)}}
 


> Base value for parsing two-digit