On Tue, Jun 15, 2021 at 1:19 PM Weston Pace <weston.p...@gmail.com> wrote:
> Arrow's "Timestamp with Timezone" can have fields extracted > from it. > Sure, one *can* extract fields from timestamp+tz. But I don't feel timestamp+tz is *designed* for extracting fields: - Extracting fields from int64+tz is inefficient, because it bundles two steps: 1) convert to datetime struct; and 2) return one field from the datetime struct. (If I want to extract Year, Month, Day, is that three function calls that *each* convert to datetime struct?) - Extracting fields from int64+tz is awkward, because it's not obvious which timezone is being used. (To extract fields in a custom timezone, must I 1) clone the column with a new timezone; and 2) call the function?) My understanding of "best practice" for extracting multiple fields using Arrow's timestamp columns is: 1. Convert from timestamp column to date32 and/or time32/time64 columns in one pass (one of three operations, perhaps: timestamp=>date32, timestamp=>time64, or timestamp=>struct{date32,time64}) 2. Extract fields from those date32 and time64 columns. Only step 1 needs a timezone. In C, the analogue is localtime(). We do step 1 at Workbench -- see converttimestamptodate <https://github.com/CJWorkbench/converttimestamptodate/blob/main/converttimestamptodate.py> for our implementation. We haven't had much demand for step 2, so we'll get to it later. I think of this "best practice" as a compromise: - date32+time64 aren't as time-efficient as C's struct tm, but together they use 12 bytes whereas the C struct costs 50-100 bytes. - date32+time64 are 50% less space-efficient than int64, but they're intuitive and they save time. A small benchmark to prove that "save time" assertion in Python: >>> import datetime, os, time, timeit >>> os.environ['TZ'] = 'America/Montreal' >>> time.tzset() >>> timestamp = time.time() >>> timeit.timeit(lambda: datetime.date.fromtimestamp(timestamp).year) 0.2955563920113491 >>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year) # baseline: timeit overhead + tuple construction 0.2509278700017603 Most of the test is overhead; but certainly the timestamp=>date conversion takes time, and it's sane to try and minimize that overhead. Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com