Re: C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

2021-07-21 Thread Adam Hooper
but it's not ideal for streaming because it's high-RAM and high time-to-first-byte. Thank you again for your advice. You've been more than helpful. Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

2021-07-20 Thread Adam Hooper
nested values. Does the C++ parquet reader support reading a batch of values and their validity bitmap? Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-22 Thread Adam Hooper
correct. Applications can choose what they feel makes > sense > > > to > > > > > them (as long as they don't start automatically tacking on > timezones to > > > > > naive timestamps). My interpretation of the specification has been > > > di

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-17 Thread Adam Hooper
.com/document/d/1QDwX4ypfNvESc2ywcT1ygaf2Y1R8SmkpifMV7gpJdBI/edit#> you set up! But to answer your question here: my understanding is we're debating how to store an Instant in Arrow. Or conversely, how to interpret a timestamp that has no timezone field. -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-17 Thread Adam Hooper
a different byte structure.) Perhaps we can make a spreadsheet and look comprehensively at how many > use cases would be disenfranchised by requiring UTC normalization > always. Hear, hear! Can we also poll people to find out how they're storing Instants today? Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Adam Hooper
>>> timeit.timeit(lambda: datetime.date(2021, 6, 15).year) # baseline: timeit overhead + tuple construction 0.2509278700017603 Most of the test is overhead; but certainly the timestamp=>date conversion takes time, and it's sane to try and minimize that overhead. Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-15 Thread Adam Hooper
DateTime - full date and time with time-zone > >> * LocalDateTime - date-time without a time-zone > >> > >> ... > >> > >> I recommend that Arrow supports all three. Choose clear, distinct > >> names for all three, consistent with names used elsewhere in the > >> industry. > > > > It seems to me that we are discussing whether our "timestamp without > > timezone" should be interpreted as a LocalDateTime or as an Instant > > (since interpreting it as UTC makes it an Instant, I think). Is that a > > correct / helpful framing? > > That is correct, IMHO. > > -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Adam Hooper
l ... what *is* the meaning of the timezone field? (In my opinion, there shouldn't be a field at all.) Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format][Important] Needed clarification of timezone-less timestamps

2021-06-14 Thread Adam Hooper
zone.) I'm a smart person. I keep making these embarrassing -- and costly -- mistakes. I've never been tripped up by java.time.Instant. It's no wonder Java embraced it. I hope Arrow empowers its community to make tools that make me feel not-stupid. Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format] Timestamp timezone semantics?

2021-06-03 Thread Adam Hooper
On Thu, Jun 3, 2021 at 2:02 PM Adam Hooper wrote: > I understand isAdjustedToUTC=true to mean "timestamp", and > isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some > docs because > https://github.com/apache/parquet-format/blob/master/Logi

Re: [Format] Timestamp timezone semantics?

2021-06-03 Thread Adam Hooper
timestamp", and isAdjustedToUTC=false to mean, "int64 and I hope somebody attached some docs because https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#local-semantics-timestamps-not-normalized-to-utc lists a whole slew of potential meanings and without extra metadata I'll never be able to figure out what this column means." Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Format] Timestamp timezone semantics?

2021-06-02 Thread Adam Hooper
imezones are yet to be decreed by politicians). Don't follow in C or SQL's footsteps. Store timestamps as integers UTC timestamps. Store timezone somewhere else; use it to convert to local timezone when formatting and to convert to calendar for calendar math. -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: Long title on github page

2021-05-18 Thread Adam Hooper
; > >>>>> github.com/apache/beam - Apache Beam is a unified > > > > programming > > > > > model > > > > > > > > >> for > > > > > > > > >>>>> Batch and Streaming > > > > > > >

Re: [Discuss] Storing metadata about the "sortedness" of data

2021-05-11 Thread Adam Hooper
esql.org/wiki/Collations> to version collations in v13/v14. I'm a Postgres user who experienced index corruption between collation versions, To me, Postgres' effort seems both cutting-edge and essential. Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [Rust][DataFusion] Inconsistent array ordering with "GROUP BY" SQL

2021-02-21 Thread Adam Hooper
can memorize the pattern. With that information she can detect, in a given amount of time, how many times Bob ran the same query. Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

2020-09-16 Thread Adam Hooper
ltitude of weaknesses are most urgent. The RDBMS provides few knobs and much documentation. The team selects compromises. I think a good bullet for your list of requirements is: "simple enough to explain to a non-programmer." A disaster *will* happen; and someone will need to explain

Re: Where can I report a security-related issue?

2019-12-19 Thread Adam Hooper
e. > Thank you for clarifying. This is all music to my ears. I feel Arrow's careful design gives me all the tools I need to confidently repel malicious input. <https://github.com/cyb70289/utf8> Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

Re: Where can I report a security-related issue?

2019-12-18 Thread Adam Hooper
; > Regards > > Antoine. > > > Le 18/12/2019 à 17:42, Adam Hooper a écrit : > > My project parses Arrow files produced by untrusted code. > > > > It looks to me like the "validate" function should help me avoid > undefined > > behavior given an

[jira] [Created] (ARROW-7435) Security issue: ValidateOffsets() does not prevent buffer over-read

2019-12-18 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-7435: -- Summary: Security issue: ValidateOffsets() does not prevent buffer over-read Key: ARROW-7435 URL: https://issues.apache.org/jira/browse/ARROW-7435 Project: Apache Arrow

Where can I report a security-related issue?

2019-12-18 Thread Adam Hooper
s security a goal of the Arrow project/format? If so, how shall I report this bug without endangering other users in my situation? Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com

[jira] [Created] (ARROW-7281) AdaptiveIntBuilder::length() does not consider pending_pos_.

2019-11-29 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-7281: -- Summary: AdaptiveIntBuilder::length() does not consider pending_pos_. Key: ARROW-7281 URL: https://issues.apache.org/jira/browse/ARROW-7281 Project: Apache Arrow

[jira] [Created] (ARROW-7266) dictionary_encode() of a slice gives wrong result

2019-11-26 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-7266: -- Summary: dictionary_encode() of a slice gives wrong result Key: ARROW-7266 URL: https://issues.apache.org/jira/browse/ARROW-7266 Project: Apache Arrow Issue

[jira] [Created] (ARROW-6895) parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()`

2019-10-15 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-6895: -- Summary: parquet::arrow::ColumnReader: ByteArrayDictionaryRecordReader repeats returned values when calling `NextBatch()` Key: ARROW-6895 URL: https://issues.apache.org/jira/browse

[jira] [Created] (ARROW-6861) With arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize

2019-10-11 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-6861: -- Summary: With arrow-0.14.1-output Parquet dictionary column: Failure reading column: IOError: Arrow error: Invalid: Resize cannot downsize Key: ARROW-6861 URL: https

[jira] [Created] (ARROW-6568) pyarrow.parquet crash writing zero-chunk dictionary-type column

2019-09-15 Thread Adam Hooper (Jira)
Adam Hooper created ARROW-6568: -- Summary: pyarrow.parquet crash writing zero-chunk dictionary-type column Key: ARROW-6568 URL: https://issues.apache.org/jira/browse/ARROW-6568 Project: Apache Arrow