On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <weston.p...@gmail.com> wrote:

> Arrow's "Timestamp with Timezone" can have fields extracted
> from it.
>

Sure, one *can* extract fields from timestamp+tz. But I don't feel
timestamp+tz is *designed* for extracting fields:

   - Extracting fields from int64+tz is inefficient, because it bundles two
   steps: 1) convert to datetime struct; and 2) return one field from the
   datetime struct. (If I want to extract Year, Month, Day, is that three
   function calls that *each* convert to datetime struct?)
   - Extracting fields from int64+tz is awkward, because it's not obvious
   which timezone is being used. (To extract fields in a custom timezone, must
   I 1) clone the column with a new timezone; and 2) call the function?)

My understanding of "best practice" for extracting multiple fields using
Arrow's timestamp columns is:

1. Convert from timestamp column to date32 and/or time32/time64 columns in
one pass (one of three operations, perhaps: timestamp=>date32,
timestamp=>time64, or timestamp=>struct{date32,time64})
2. Extract fields from those date32 and time64 columns.

Only step 1 needs a timezone. In C, the analogue is localtime().

We do step 1 at Workbench -- see converttimestamptodate
<https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py>
for
our implementation. We haven't had much demand for step 2, so we'll get to
it later.

I think of this "best practice" as a compromise:

   - date32+time64 aren't as time-efficient as C's struct tm, but together
   they use 12 bytes whereas the C struct costs 50-100 bytes.
   - date32+time64 are 50% less space-efficient than int64, but they're
   intuitive and they save time.

A small benchmark to prove that "save time" assertion in Python:

>>> import datetime, os, time, timeit
>>> os.environ['TZ'] = 'America/Montreal'
>>> time.tzset()
>>> timestamp = time.time()
>>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year)
0.2955563920113491
>>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year)  # baseline:
timeit overhead + tuple construction
0.2509278700017603

Most of the test is overhead; but certainly the timestamp=>date conversion
takes time, and it's sane to try and minimize that overhead.

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

Reply via email to