Re: Change default timestamp offset on data load

Jack Goodson Wed, 06 Sep 2023 17:42:21 -0700

Thanks Mich, sorry, I might have been a bit unclear in my original email.
The timestamps are getting loaded as 2003-11-24T09:02:32+0000 for example
but I want it loaded as 2003-11-24T09:02:32+1300 I know how to do this with
various transformations however I'm wondering if there's any spark or jvm
settings that I can change so it assumes +1300 (as the time in the column
is relative to NZ local time not UTC) on load instead of +0000. I inspected
the parquet column with my created date with pyarrow with the below
results.


I had a look in here
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and it
looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how to
set it

<pyarrow._parquet.ColumnChunkMetaData object at 0xdeadbeef>

  file_offset: 6019

  file_path:

  physical_type: INT96

  num_values: 4

  path_in_schema: created

  is_stats_set: False

  statistics:

    None

  compression: SNAPPY

  encodings: ('BIT_PACKED', 'PLAIN', 'RLE')

  has_dictionary_page: False

  dictionary_page_offset: None

  data_page_offset: 6019

  total_compressed_size: 90

  total_uncompressed_size: 103

On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Jack,
>
> You may use from_utc_timestamp and to_utc_timestamp to see if they help.
>
> from pyspark.sql.functions import from_utc_timestamp
>
> You can read your Parquet file into DF
>
> df = spark.read.parquet('parquet_file_path')
>
> # Convert timestamps (assuming your column name) from UTC to
> Pacific/Auckland timezone
>
> df_with_local_timezone = df.withColumn( 'timestamp',
> from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') )
>
> HTH
>
> Mich Talebzadeh,
> Solutions Architect & Engineer
> London
> United Kingdom
>
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 6 Sept 2023 at 04:19, Jack Goodson <jackagood...@gmail.com> wrote:
>
>> Hi,
>>
>> I've got a number of tables that I'm loading in from a SQL server. The
>> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these
>> as parquet files in our raw storage location and pick them up in
>> Databricks. When I load the data in databricks, the dataframe/spark assumes
>> UTC or +0000 on the timestamp like 2003-11-24T09:02:32+0000 the time and
>> date is the same as what's in SQL server however the offset is incorrect
>>
>> I've tried various methods like the below code to set the JVM timezone to
>> my local timezone but when viewing the data it seems to just subtract the
>> offset from the timestamp and add it to the offset part like 
>> 2003-11-24T09:02:32+0000
>> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)
>>
>> spark = pyspark.sql.SparkSession \
>> .Builder()\
>> .appName('test') \
>> .master('local') \
>> .config('spark.driver.extraJavaOptions',
>> '-Duser.timezone=Pacific/Auckland') \
>> .config('spark.executor.extraJavaOptions',
>> '-Duser.timezone=Pacific/Auckland') \
>> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \
>> .getOrCreate()
>>
>>
>>
>> I understand that in Parquet these are stored as UNIX time and aren't
>> timezone aware, however are there any settings that I can set in spark that
>> would implicitly convert/assume the timestamp from 2003-11-24T09:02:32+0000
>> to 2003-11-24T09:02:32+1300 I know this can be done with transformations
>> however I'm trying to avoid doing transformations for every timestamp on
>> 100's tables
>>
>> Any help much appreciated, thanks,
>>
>> Jack
>>
>>
>>
>>

Re: Change default timestamp offset on data load

Reply via email to