[GitHub] [drill] cgivre commented on issue #2746: [DISCUSSION] Use INT96 as default timestamp format in Parquet tables

via GitHub Wed, 01 Feb 2023 05:57:52 -0800


cgivre commented on issue #2746:
URL: https://github.com/apache/drill/issues/2746#issuecomment-1412101491


   I'll weigh in here.  It seems that since this is user configurable, it would 
make sense to make that the default and fix the UDFs.  We're about to release 
1.21 which has a lot of major improvements, so IMHO it would be a good time to 
do so.
   
   Vova, would you mind explaining how this will break UDFs?
   Best,
   -- C
   
   
   
   > On Feb 1, 2023, at 7:54 AM, Christian Pfarr ***@***.***> wrote:
   > 
   > 
   > Hi everyone,
   > 
   > i want to raise a discussion about the current behavior in drill regarding 
parquet timestamps.
   > 
   > Drill uses INT64 for timestamps and you can switch to INT96 by setting 
store.parquet.reader.int96_as_timestamp to true. With that its not a big 
problem to work with both types of parquet timestamps, but since that spark 
uses INT96 as default, you have to switch this configure in almost all 
situations, especially when working with new lakehouse architectures like 
deltalake and iceberg.
   > 
   > For spark its clearly documented that they use INT96 in all scenarios:
   > 
   > here for reading -> 
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html
   > 
   > Some Parquet-producing systems, in particular Impala and Hive, store 
Timestamp into INT96. This flag tells Spark SQL to interpret INT96 data as a 
timestamp to provide compatibility with these systems.
   > 
   > here for writing-> https://spark.apache.org/docs/latest/configuration.html
   > 
   > Sets which Parquet timestamp type to use when Spark writes data to Parquet 
files. INT96 is a non-standard but commonly used timestamp type in Parquet. 
TIMESTAMP_MICROS is a standard timestamp type in Parquet, which stores number 
of microseconds from the Unix epoch. TIMESTAMP_MILLIS is also standard, but 
with millisecond precision, which means Spark has to truncate the microsecond 
portion of its timestamp value.
   > 
   > Of course we could advise every drill user to write its spark jobs with 
the configuration spark.sql.parquet.outputTimestampType to TIMESTAMP_MICROS or 
TIMESTAMP_MILLIS or always toggle this drill configuration after startup, but 
its still an additional step.
   > 
   > @vvysotskyi <https://github.com/vvysotskyi> mentioned that if we would 
switch this default now, we would have issues with some UDF´s, so i would think 
it could be a topic for upcomming Drill 2.0.0 as a breaking change.
   > 
   > What do you think?
   > 
   > —
   > Reply to this email directly, view it on GitHub 
<https://github.com/apache/drill/issues/2746>, or unsubscribe 
<https://github.com/notifications/unsubscribe-auth/ABKB7PTKHEOHFBSTC433NIDWVJMIHANCNFSM6AAAAAAUNV7C5Y>.
   > You are receiving this because you are subscribed to this thread.
   > 
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [drill] cgivre commented on issue #2746: [DISCUSSION] Use INT96 as default timestamp format in Parquet tables

Reply via email to