Thanks Mich, sorry, I might have been a bit unclear in my original email. The timestamps are getting loaded as 2003-11-24T09:02:32+0000 for example but I want it loaded as 2003-11-24T09:02:32+1300 I know how to do this with various transformations however I'm wondering if there's any spark or jvm settings that I can change so it assumes +1300 (as the time in the column is relative to NZ local time not UTC) on load instead of +0000. I inspected the parquet column with my created date with pyarrow with the below results.
I had a look in here https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and it looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how to set it <pyarrow._parquet.ColumnChunkMetaData object at 0xdeadbeef> file_offset: 6019 file_path: physical_type: INT96 num_values: 4 path_in_schema: created is_stats_set: False statistics: None compression: SNAPPY encodings: ('BIT_PACKED', 'PLAIN', 'RLE') has_dictionary_page: False dictionary_page_offset: None data_page_offset: 6019 total_compressed_size: 90 total_uncompressed_size: 103 On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Jack, > > You may use from_utc_timestamp and to_utc_timestamp to see if they help. > > from pyspark.sql.functions import from_utc_timestamp > > You can read your Parquet file into DF > > df = spark.read.parquet('parquet_file_path') > > # Convert timestamps (assuming your column name) from UTC to > Pacific/Auckland timezone > > df_with_local_timezone = df.withColumn( 'timestamp', > from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') ) > > HTH > > Mich Talebzadeh, > Solutions Architect & Engineer > London > United Kingdom > > > > Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > Mich Talebzadeh, > Distinguished Technologist, Solutions Architect & Engineer > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Wed, 6 Sept 2023 at 04:19, Jack Goodson <jackagood...@gmail.com> wrote: > >> Hi, >> >> I've got a number of tables that I'm loading in from a SQL server. The >> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these >> as parquet files in our raw storage location and pick them up in >> Databricks. When I load the data in databricks, the dataframe/spark assumes >> UTC or +0000 on the timestamp like 2003-11-24T09:02:32+0000 the time and >> date is the same as what's in SQL server however the offset is incorrect >> >> I've tried various methods like the below code to set the JVM timezone to >> my local timezone but when viewing the data it seems to just subtract the >> offset from the timestamp and add it to the offset part like >> 2003-11-24T09:02:32+0000 >> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter) >> >> spark = pyspark.sql.SparkSession \ >> .Builder()\ >> .appName('test') \ >> .master('local') \ >> .config('spark.driver.extraJavaOptions', >> '-Duser.timezone=Pacific/Auckland') \ >> .config('spark.executor.extraJavaOptions', >> '-Duser.timezone=Pacific/Auckland') \ >> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \ >> .getOrCreate() >> >> >> >> I understand that in Parquet these are stored as UNIX time and aren't >> timezone aware, however are there any settings that I can set in spark that >> would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000 >> to 2003-11-24T09:02:32+1300 I know this can be done with transformations >> however I'm trying to avoid doing transformations for every timestamp on >> 100's tables >> >> Any help much appreciated, thanks, >> >> Jack >> >> >> >>