AW: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-13 Thread Roman Karlstetter
Hi,

that sounds like the task might not be ideally suited for someone new to 
implementations of both arrow and parquet, especially since all that 
compatibility issues should be handled correctly.
I think it does not make sense for me to continue with this implementation, 
unless there are some further specifications on how this should be implemented.

Roman

Von: Wes McKinney
Gesendet: Montag, 12. November 2018 16:50
An: dev@arrow.apache.org
Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp

hi Roman,

For nanosecond Arrow timestamps, the relevant code path for this is here:

https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L607

You'll also have to modify some code in parquet/types.*,
parquet/schema.*, parquet/arrow/schema.cc to handle the additional
metadata. If you aren't dealing with Arrow at all, then it should be
sufficient just to modify the handling of the logical types metadata
in parquet/types.*.

So there is a significant complication that I didn't think about yet:
we aren't handling the new logical types union in parquet-cpp yet, so
there's quite a lot of work beyond just dealing with the nanosecond
metadata. I am also not sure what are the implications for backwards
compatibility and haven't had time to look in detail at what needs to
be done since the new metadata structure was added to the Thrift
definition

- Wes
On Mon, Nov 12, 2018 at 4:31 AM Roman Karlstetter
 wrote:
>
> I've had the chance to look into this.
> There is one issue that came up which I don't know how to handle. Previously, 
> int96 seems to have been used for nanosecond precision, but this is somewhat 
> deprecated, as far as I understand it.
> So, how should we handle nanoseconds and int96 vs int64 in 1) reading from 
> and b) writing to parquet.
> There seem to be some writer settings, all related to timestamp precision 
> properties. Is there any advise someone of you can give me in that regard?
>
> Thanks,
> Roman
>
> Von: Roman Karlstetter
> Gesendet: Freitag, 9. November 2018 08:38
> An: dev@arrow.apache.org
> Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp
>
> I would be willing to implement that. I’ll probably need some advice on my 
> patch though, as I’m fairly new to the parquet code.
>
> Roman
>
> Von: Wes McKinney
> Gesendet: Donnerstag, 8. November 2018 23:22
> An: dev@arrow.apache.org
> Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp
>
> I opened an issue here
> https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
> welcome
> On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
> >
> > hi Roman,
> >
> > We would welcome adding such a document to the Arrow wiki
> > https://cwiki.apache.org/confluence/display/ARROW. As to your other
> > questions, it really depends on whether there is a member of the
> > Parquet community who will do the work. Patches that implement any
> > released functionality in the Parquet format specification are
> > welcome.
> >
> > Thanks
> > Wes
> > On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
> >  wrote:
> > >
> > > Hi everyone,
> > > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > > https://github.com/apache/parquet-format/pull/102
> > > For parquet-cpp, this is not yet supported. I have a few questions now:
> > > • is there an overview of what release of parquet-format is currently 
> > > fully support in parquet-cpp (something like a feature support matrix)?
> > > • how fast are new features in parquet-format adopted?
> > > I think having a document describing the current completeness of 
> > > implementation of the spec would be very helpful for users of the 
> > > parquet-cpp library.
> > > Thanks,
> > > Roman
> > >
> > >
>
>



AW: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-12 Thread Roman Karlstetter
I've had the chance to look into this.
There is one issue that came up which I don't know how to handle. Previously, 
int96 seems to have been used for nanosecond precision, but this is somewhat 
deprecated, as far as I understand it.
So, how should we handle nanoseconds and int96 vs int64 in 1) reading from and 
b) writing to parquet. 
There seem to be some writer settings, all related to timestamp precision 
properties. Is there any advise someone of you can give me in that regard?

Thanks,
Roman

Von: Roman Karlstetter
Gesendet: Freitag, 9. November 2018 08:38
An: dev@arrow.apache.org
Betreff: AW: Support for TIMESTAMP_NANOS in parquet-cpp

I would be willing to implement that. I’ll probably need some advice on my 
patch though, as I’m fairly new to the parquet code.

Roman

Von: Wes McKinney
Gesendet: Donnerstag, 8. November 2018 23:22
An: dev@arrow.apache.org
Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp

I opened an issue here
https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
welcome
On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
>
> hi Roman,
>
> We would welcome adding such a document to the Arrow wiki
> https://cwiki.apache.org/confluence/display/ARROW. As to your other
> questions, it really depends on whether there is a member of the
> Parquet community who will do the work. Patches that implement any
> released functionality in the Parquet format specification are
> welcome.
>
> Thanks
> Wes
> On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
>  wrote:
> >
> > Hi everyone,
> > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > https://github.com/apache/parquet-format/pull/102
> > For parquet-cpp, this is not yet supported. I have a few questions now:
> > • is there an overview of what release of parquet-format is currently fully 
> > support in parquet-cpp (something like a feature support matrix)?
> > • how fast are new features in parquet-format adopted?
> > I think having a document describing the current completeness of 
> > implementation of the spec would be very helpful for users of the 
> > parquet-cpp library.
> > Thanks,
> > Roman
> >
> >




AW: Support for TIMESTAMP_NANOS in parquet-cpp

2018-11-08 Thread Roman Karlstetter
I would be willing to implement that. I’ll probably need some advice on my 
patch though, as I’m fairly new to the parquet code.

Roman

Von: Wes McKinney
Gesendet: Donnerstag, 8. November 2018 23:22
An: dev@arrow.apache.org
Betreff: Re: Support for TIMESTAMP_NANOS in parquet-cpp

I opened an issue here
https://issues.apache.org/jira/browse/ARROW-3729. Patches would be
welcome
On Sat, Oct 20, 2018 at 12:55 PM Wes McKinney  wrote:
>
> hi Roman,
>
> We would welcome adding such a document to the Arrow wiki
> https://cwiki.apache.org/confluence/display/ARROW. As to your other
> questions, it really depends on whether there is a member of the
> Parquet community who will do the work. Patches that implement any
> released functionality in the Parquet format specification are
> welcome.
>
> Thanks
> Wes
> On Thu, Oct 18, 2018 at 10:59 AM Roman Karlstetter
>  wrote:
> >
> > Hi everyone,
> > in parquet-format, there is now support for TIMESTAMP_NANOS: 
> > https://github.com/apache/parquet-format/pull/102
> > For parquet-cpp, this is not yet supported. I have a few questions now:
> > • is there an overview of what release of parquet-format is currently fully 
> > support in parquet-cpp (something like a feature support matrix)?
> > • how fast are new features in parquet-format adopted?
> > I think having a document describing the current completeness of 
> > implementation of the spec would be very helpful for users of the 
> > parquet-cpp library.
> > Thanks,
> > Roman
> >
> >