Re: Change default timestamp offset on data load

Mich Talebzadeh Wed, 06 Sep 2023 01:15:46 -0700

Hi Jack,

You may use from_utc_timestamp and to_utc_timestamp to see if they help.


from pyspark.sql.functions import from_utc_timestamp

You can read your Parquet file into DF

df = spark.read.parquet('parquet_file_path')

# Convert timestamps (assuming your column name) from UTC to
Pacific/Auckland timezone

df_with_local_timezone = df.withColumn( 'timestamp',
from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') )

HTH

Mich Talebzadeh,
Solutions Architect & Engineer
London
United Kingdom



Disclaimer: Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Sept 2023 at 04:19, Jack Goodson <jackagood...@gmail.com> wrote:

> Hi,
>
> I've got a number of tables that I'm loading in from a SQL server. The
> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these as
> parquet files in our raw storage location and pick them up in Databricks.
> When I load the data in databricks, the dataframe/spark assumes UTC or
> +0000 on the timestamp like 2003-11-24T09:02:32+0000 the time and date is
> the same as what's in SQL server however the offset is incorrect
>
> I've tried various methods like the below code to set the JVM timezone to
> my local timezone but when viewing the data it seems to just subtract the
> offset from the timestamp and add it to the offset part like 
> 2003-11-24T09:02:32+0000
> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)
>
> spark = pyspark.sql.SparkSession \
> .Builder()\
> .appName('test') \
> .master('local') \
> .config('spark.driver.extraJavaOptions',
> '-Duser.timezone=Pacific/Auckland') \
> .config('spark.executor.extraJavaOptions',
> '-Duser.timezone=Pacific/Auckland') \
> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \
> .getOrCreate()
>
>
>
> I understand that in Parquet these are stored as UNIX time and aren't
> timezone aware, however are there any settings that I can set in spark that
> would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000
> to 2003-11-24T09:02:32+1300 I know this can be done with transformations
> however I'm trying to avoid doing transformations for every timestamp on
> 100's tables
>
> Any help much appreciated, thanks,
>
> Jack
>
>
>
>

Re: Change default timestamp offset on data load

Reply via email to