AW: Support for TIMESTAMP_NANOS in parquet-cpp
Hi, that sounds like the task might not be ideally suited for someone new to implementations of both arrow and parquet, especially since all that compatibility issues should be handled correctly. I think it does not make sense for me to continue with this implementation, unless there are some further specifications on how this should be implemented. Roman Von: Wes McKinney Gesendet: Montag, 12. November 2018 16:50 An: dev@arrow.apache.org Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp hi Roman, For nanosecond Arrow timestamps, the relevant code path for this is here: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607 You'll also have to modify some code in parquet/types.*, parquet/schema.*, parquet/arrow/schema.cc to handle the additional metadata. If you aren't dealing with Arrow at all, then it should be sufficient just to modify the handling of the logical types metadata in parquet/types.*. So there is a significant complication that I didn't think about yet: we aren't handling the new logical types union in parquet-cpp yet, so there's quite a lot of work beyond just dealing with the nanosecond metadata. I am also not sure what are the implications for backwards compatibility and haven't had time to look in detail at what needs to be done since the new metadata structure was added to the Thrift definition - Wes On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter wrote: > > I've had the chance to look into this. > There is one issue that came up which I don't know how to handle. Previously, > int96 seems to have been used for nanosecond precision, but this is somewhat > deprecated, as far as I understand it. > So, how should we handle nanoseconds and int96 vs int64 in 1) reading from > and b) writing to parquet. > There seem to be some writer settings, all related to timestamp precision > properties. Is there any advise someone of you can give me in that regard? > > Thanks, > Roman > > Von: Roman Karlstetter > Gesendet: Freitag, 9. November 2018 08:38 > An: dev@arrow.apache.org > Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp > > I would be willing to implement that. I’ll probably need some advice on my > patch though, as I’m fairly new to the parquet code. > > Roman > > Von: Wes McKinney > Gesendet: Donnerstag, 8. November 2018 23:22 > An: dev@arrow.apache.org > Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp > > I opened an issue here > https://issues.apache.org/jira/browse/ARROW-3729. Patches would be > welcome > On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > > > hi Roman, > > > > We would welcome adding such a document to the Arrow wiki > > https://cwiki.apache.org/confluence/display/ARROW. As to your other > > questions, it really depends on whether there is a member of the > > Parquet community who will do the work. Patches that implement any > > released functionality in the Parquet format specification are > > welcome. > > > > Thanks > > Wes > > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > > wrote: > > > > > > Hi everyone, > > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > > https://github.com/apache/parquet-format/pull/102 > > > For parquet-cpp, this is not yet supported. I have a few questions now: > > > • is there an overview of what release of parquet-format is currently > > > fully support in parquet-cpp (something like a feature support matrix)? > > > • how fast are new features in parquet-format adopted? > > > I think having a document describing the current completeness of > > > implementation of the spec would be very helpful for users of the > > > parquet-cpp library. > > > Thanks, > > > Roman > > > > > > > >
AW: Support for TIMESTAMP_NANOS in parquet-cpp
I've had the chance to look into this. There is one issue that came up which I don't know how to handle. Previously, int96 seems to have been used for nanosecond precision, but this is somewhat deprecated, as far as I understand it. So, how should we handle nanoseconds and int96 vs int64 in 1) reading from and b) writing to parquet. There seem to be some writer settings, all related to timestamp precision properties. Is there any advise someone of you can give me in that regard? Thanks, Roman Von: Roman Karlstetter Gesendet: Freitag, 9. November 2018 08:38 An: dev@arrow.apache.org Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp I would be willing to implement that. I’ll probably need some advice on my patch though, as I’m fairly new to the parquet code. Roman Von: Wes McKinney Gesendet: Donnerstag, 8. November 2018 23:22 An: dev@arrow.apache.org Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp I opened an issue here https://issues.apache.org/jira/browse/ARROW-3729. Patches would be welcome On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > hi Roman, > > We would welcome adding such a document to the Arrow wiki > https://cwiki.apache.org/confluence/display/ARROW. As to your other > questions, it really depends on whether there is a member of the > Parquet community who will do the work. Patches that implement any > released functionality in the Parquet format specification are > welcome. > > Thanks > Wes > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > wrote: > > > > Hi everyone, > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > https://github.com/apache/parquet-format/pull/102 > > For parquet-cpp, this is not yet supported. I have a few questions now: > > • is there an overview of what release of parquet-format is currently fully > > support in parquet-cpp (something like a feature support matrix)? > > • how fast are new features in parquet-format adopted? > > I think having a document describing the current completeness of > > implementation of the spec would be very helpful for users of the > > parquet-cpp library. > > Thanks, > > Roman > > > >
AW: Support for TIMESTAMP_NANOS in parquet-cpp
I would be willing to implement that. I’ll probably need some advice on my patch though, as I’m fairly new to the parquet code. Roman Von: Wes McKinney Gesendet: Donnerstag, 8. November 2018 23:22 An: dev@arrow.apache.org Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp I opened an issue here https://issues.apache.org/jira/browse/ARROW-3729. Patches would be welcome On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney wrote: > > hi Roman, > > We would welcome adding such a document to the Arrow wiki > https://cwiki.apache.org/confluence/display/ARROW. As to your other > questions, it really depends on whether there is a member of the > Parquet community who will do the work. Patches that implement any > released functionality in the Parquet format specification are > welcome. > > Thanks > Wes > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter > wrote: > > > > Hi everyone, > > in parquet-format, there is now support for TIMESTAMP_NANOS: > > https://github.com/apache/parquet-format/pull/102 > > For parquet-cpp, this is not yet supported. I have a few questions now: > > • is there an overview of what release of parquet-format is currently fully > > support in parquet-cpp (something like a feature support matrix)? > > • how fast are new features in parquet-format adopted? > > I think having a document describing the current completeness of > > implementation of the spec would be very helpful for users of the > > parquet-cpp library. > > Thanks, > > Roman > > > >