I think the `to_timestamp_*` sounds like a good idea to me On Sat, Apr 17, 2021 at 3:00 PM Evan Chan <e...@urbanlogiq.com> wrote:
> Andrew and others, > > Thanks for your input. > > > > > 3. For functions (e.g. date_trunc(...)), I think the infrastructure for > > multiple type signatures exists, we just need implementations for > different > > resolutions > > Sure, we can add more type support in date_trunc() and other functions. > > > 4. For converting back and forth from different resolutions in SQL Server > > > > For 3, given there is seemingly no SQL standard way to do this, I suggest > > adding a new, non-postgres function like `convert_resolution` that > performs > > the desired conversion. This function could be implemented today as a > user > > defined function (UDF), and/or included as a built in function in > > DataFusion in the future. > > What do folks think about adding functions like the following to Arrow? > > * to_timestamp_millis(STRING) -> converts string timestamps to timestamps > with millis resolution > * to_timestamp_millis(Timestamp(xxx)) -> converts or truncates other > timestamp column types to millis resolution > * to_timestamp_micros(…) > * to_timestamp_seconds(…) > > -Evan > > > > > Andrew > > > > [1] https://github.com/apache/arrow/pull/10005#discussion_r612551640 < > https://github.com/apache/arrow/pull/10005#discussion_r612551640> > > > > > > > > On Thu, Apr 15, 2021 at 4:41 PM Evan Chan <e...@urbanlogiq.com <mailto: > e...@urbanlogiq.com>> wrote: > > > >> Hi folks, > >> > >> So currently Arrow Rust/DataFusion supports four types of Timestamp > >> arrays, with Nano, Micro, Millisecond and Second resolution. However, > the > >> best supported by far are Nanos. For example, in DataFusion, the > following > >> only works for Nanos and not the other resolutions: > >> * CAST(x as TIMESTAMP) -> Nanos only > >> * date_trunc() -> nanos only > >> * filtering a timestamp array, eg my column > > >> to_timestamp(‘2020-06-30T12:00Z’) > >> > >> In the broader SQL world, in general there seems to be only a single > >> Timestamp type in most databases, though in many cases there is a > variable > >> resolution. > >> PostGres’s TIMESTAMP type is microsecond based: > >> https://www.postgresql.org/docs/9.1/datatype-datetime.html < > >> https://www.postgresql.org/docs/9.1/datatype-datetime.html < > https://www.postgresql.org/docs/9.1/datatype-datetime.html>> > >> > >> For UrbanLogiq, in some cases we would like to standardize on > millisecond > >> resolution. Nanoseconds yields only ~300 years of span for i64, which > >> isn’t enough for some applications. At minimum, we’d like to get the > >> following working: > >> * Either something like CAST(x AS TIMESTAMP(Milliseconds)) or > >> date_trunc(‘milliseconds’, x) , which can cast different types of > timestamp > >> arrays to Timestamp(Milliseconds, None) > >> * filtering that can compare, ideally, different types of timestamp > >> columns to a to_timestamp(….) > >> > >> The last problem is easy to solve as the coercion logic can be fixed to > >> address that, but solving the first problem is not as straightforward. > >> - There isn’t any universal SQL standard for a type that supports > >> different timestamp resolutions > >> - The most non-intrusive way I can think of is: > >> - CAST(x AS TIMESTAMP) -> Nanos > >> - CAST(x AS TIMESTAMP(Micros/Millis/Seconds)) ->. > >> Micros/Millis/Seconds arrays > >> - Functions are designed/work best with a single output type, and > >> date_trunc() is designed to output nanos only. Fixing this would not be > >> trivial, it would probably require changing the signature of > return_type() > >> so that return types can be determined from argument values, not just > >> argument types > >> > >> Basically the larger question for the Arrow/DataFusion community is how > do > >> we want to deal with supporting different timestamp types. The ideal > would > >> be that different functions work on all the timestamp types, but it’ll > take > >> a long time to get there I fear. Some possible directions: > >> > >> - Continue current support for nanos as the “first class” citizen, but > add > >> support for casting to different timestamp resolutions, and coercion to > >> nanos to work with different functions like date_trunc(). > >> - This means Arrow would not be usable in some cases. Converting > from > >> micros to nanos loses year range (for 64-bits anyways) > >> - There is a performance penalty for the coercion > >> - Switch to different basis for the “base” timestamp type, like PostGres > >> did with micros > >> - Use more than 64 bits to represent nanos > >> - Add support for functions to work with different timestamp types > >> - For example, cast() produces different timestamp resolutions > >> - Other date functions like date_trunc() can input from different > >> resolutions > >> - Signatures and return type calculation would be more complex > >> - Switch to a universal timestamp type which supports different > >> resolutions, as some SQL databases support > >> > >> Thanks for your input, > >> Evan > >