[VOTE] Release Apache Parquet Format 2.4.0 RC1

2017-10-16 Thread Ryan Blue
Hi everyone,

I propose the following RC to be released as official Apache Parquet Format
2.4.0 release.

The commit id is 3fb6b391db7f369de8b4114ae071f5725db7247c

   - This corresponds to the tag: apache-parquet-format-2.4.0
   - https://github.com/apache/parquet-format/tree/3fb6b39
   -
   
https://git-wip-us.apache.org/repos/asf/projects/repo?p=parquet-format.git=commit=3fb6b39

The release tarball, signature, and checksums are here:

   -
   
https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-format-2.4.0-rc1/

You can find the KEYS file here:

   - https://dist.apache.org/repos/dist/dev/parquet/KEYS

Binary artifacts are staged in Nexus here:

   -
   
https://repository.apache.org/content/groups/staging/org/apache/parquet/parquet-format/2.4.0/

This release includes:

   - Definitions for indexing columns
   - Support for new compression codecs: Brotli, Zstd, LZ4
   - New representation for logical types
   - Other bug fixes and improvements

Please download, verify, and test.

This vote will close on Friday morning, 20 October.

[ ] +1 - Release this as Apache Parquet Format 2.4.0
[ ] +0
[ ] -1 - Do not release this because…
​
-- 
Ryan Blue


[jira] [Updated] (PARQUET-1032) Change link in Encodings.md for variable length encoding

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1032:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Change link in Encodings.md for variable length encoding
> 
>
> Key: PARQUET-1032
> URL: https://issues.apache.org/jira/browse/PARQUET-1032
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Lars Volker
>Assignee: Konstantin Shaposhnikov
>Priority: Trivial
> Fix For: format-2.4.0
>
>
> There's a PR for this already: 
> [#30|https://github.com/apache/parquet-format/pull/30]
> {quote}
> The spec says that varint-encode() is ULEB-128 encoding but links to VLQ 
> algorithm that is slightly different from ULEB-128
> {quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1091) Wrong and broken links in README

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1091:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Wrong and broken links in README
> 
>
> Key: PARQUET-1091
> URL: https://issues.apache.org/jira/browse/PARQUET-1091
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: format-2.4.0
>
>
> Multiple links in README.md still point to the old {{Parquet/parquet-format}} 
> repository, which is now removed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1049) Make thrift version a property in pom.xml

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1049:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Make thrift version a property in pom.xml
> -
>
> Key: PARQUET-1049
> URL: https://issues.apache.org/jira/browse/PARQUET-1049
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>Priority: Minor
> Fix For: format-2.4.0
>
>
> In parquet-mr, the version of the thrift dependency is controlled by a 
> property and can be overridden from the command line, for example `mvn clean 
> verify -Dthrift.version=0.9.0`.
> In parquet-format, however, the version number of thrift is hard-coded in the 
> pom.xml and a different version can only be used by modifying pom.xml itself.
> The thrift version should be a property in Parquet-format as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1050) The comment of Parquet Format Thrift definition file error

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1050:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> The comment of Parquet Format Thrift definition file error
> --
>
> Key: PARQUET-1050
> URL: https://issues.apache.org/jira/browse/PARQUET-1050
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: lynn
>Assignee: lynn
>Priority: Minor
> Fix For: format-2.4.0
>
> Attachments: comments are inverse.png
>
>
> The comments in the parquet.thrift File are inverse! 
> !comments are inverse.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1076) [Format] Switch to long key ids in KEYs file

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1076:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> [Format] Switch to long key ids in KEYs file
> 
>
> Key: PARQUET-1076
> URL: https://issues.apache.org/jira/browse/PARQUET-1076
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Lars Volker
>Assignee: Lars Volker
> Fix For: format-2.4.0
>
>
> PGP keys should be longer than 32bit, as outlined on https://evil32.com/. We 
> should fix the KEYS file in parquet-format. I will push a PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-609) Add Brotli compression to Parquet format

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-609:
--
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Add Brotli compression to Parquet format
> 
>
> Key: PARQUET-609
> URL: https://issues.apache.org/jira/browse/PARQUET-609
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Affects Versions: format-2.3.1
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: format-2.4.0
>
>
> To use Brotli with Parquet, we need to add it to the format's compression 
> codec enum.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-322) Document ENUM as a logical type

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-322:
--
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Document ENUM as a logical type
> ---
>
> Key: PARQUET-322
> URL: https://issues.apache.org/jira/browse/PARQUET-322
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Jakub Kukul
> Fix For: format-2.4.0
>
>
> {{ENUM}} is used to annotate enum type in Thrift, Avro, and ProtoBuf, but 
> it's not documented anywhere in {{parquet-format}}.
> According to current (1.8-SNAPSHOT) code base, {{ENUM}} is only used to 
> annotate {{BINARY}}. For data models which lack a native enum type, {{BINARY 
> (ENUM)}} should be interpreted as a UTF-8 string.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-255) Typo in decimal type specification

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-255:
--
Fix Version/s: (was: format-2.3.2)

> Typo in decimal type specification
> --
>
> Key: PARQUET-255
> URL: https://issues.apache.org/jira/browse/PARQUET-255
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Minor
> Fix For: format-2.4.0
>
>
> The original document says:
> - {{int32}}: for 1 <= precision <= 9
> - {{int64}}: for 1 <= precision <= 18; precision <= 10 will produce a warning
> - ...
> For {{int64}}, the warning should be produced when precision < 10 (rather 
> than <= 10).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1125) Add UUID logical type

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1125:
---
Fix Version/s: format-2.4.0

> Add UUID logical type
> -
>
> Key: PARQUET-1125
> URL: https://issues.apache.org/jira/browse/PARQUET-1125
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: format-2.4.0
>
>
> I think we should add a UUID logical type that is stored in a 16-byte fixed. 
> The common string representation is 36 bytes instead of the 16 required. 
> UUIDs are commonly used as unique identifiers, so it makes sense to have a 
> good support. A binary representation will reduce memory when writing or 
> building bloom filters and will reduce cycles needed to compare values.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1102) Travis CI builds are failing for parquet-format PRs

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1102:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Travis CI builds are failing for parquet-format PRs
> ---
>
> Key: PARQUET-1102
> URL: https://issues.apache.org/jira/browse/PARQUET-1102
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-format
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Blocker
> Fix For: format-2.4.0
>
>
> Travis CI builds are failing for parquet-format PRs, probably due to the 
> migration from Ubuntu precise to trusty on Sep 1 according to [this Travis 
> official blog 
> post|https://blog.travis-ci.com/2017-08-31-trusty-as-default-status].



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1124) Add new compression codecs to the Parquet spec

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1124:
---
Fix Version/s: (was: format-2.3.2)
   format-2.4.0

> Add new compression codecs to the Parquet spec
> --
>
> Key: PARQUET-1124
> URL: https://issues.apache.org/jira/browse/PARQUET-1124
> Project: Parquet
>  Issue Type: Task
>  Components: parquet-format
>Reporter: Ryan Blue
>Assignee: Ryan Blue
> Fix For: format-2.4.0
>
>
> After [recent 
> tests|https://lists.apache.org/thread.html/2fc572ac5fd4ac414c39047b1e6e81c36c38fc0f92e85b9aa4e5493a@%3Cdev.parquet.apache.org%3E],
>  I think we should add Zstd to the spec.
> I'm also proposing we add LZ4 because it is widely available and outperforms 
> snappy. As a successor for fast compression but not necessarily good 
> compression ratios, I think it makes sense to have it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-922) Add index pages to the format to support efficient page skipping

2017-10-16 Thread Ryan Blue (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue resolved PARQUET-922.
---
   Resolution: Fixed
Fix Version/s: format-2.4.0

Merged format PR #72. Thanks for getting this pushed through [~lv]!

> Add index pages to the format to support efficient page skipping
> 
>
> Key: PARQUET-922
> URL: https://issues.apache.org/jira/browse/PARQUET-922
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-format
>Reporter: Julien Le Dem
>Assignee: Marcel Kornacker
> Fix For: format-2.4.0
>
>
> When a Parquet file is sorted we can define an index consisting of the 
> boundary values for the pages of the columns sorted on as well as the offsets 
> and length of said pages in the file.
> The goal is to optimize lookup and range scan type queries, using this to 
> read only the pages containing data matching the filter.
> We'd require the pages to be aligned accross columns.
> [~marcelk] will add a link to the google doc to discuss the spec



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Deepak Majeti (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206424#comment-16206424
 ] 

Deepak Majeti commented on PARQUET-1065:


If we treat Int96 as a primitive data type, then we must compare 
Int96(little-endian) in a reverse byte order. Then we will check the most 
significant bits first correct?

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (PARQUET-1138) [C++] Fix compilation with Arrow 0.7.1

2017-10-16 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn resolved PARQUET-1138.
--
Resolution: Fixed

Issue resolved by pull request 410
[https://github.com/apache/parquet-cpp/pull/410]

> [C++] Fix compilation with Arrow 0.7.1
> --
>
> Key: PARQUET-1138
> URL: https://issues.apache.org/jira/browse/PARQUET-1138
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Wes McKinney
> Fix For: cpp-1.3.1
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1139) Add license to cmake_modules/parquet-cppConfig.cmake.in

2017-10-16 Thread Lars Volker (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206350#comment-16206350
 ] 

Lars Volker commented on PARQUET-1139:
--

https://github.com/apache/parquet-cpp/pull/411

> Add license to cmake_modules/parquet-cppConfig.cmake.in
> ---
>
> Key: PARQUET-1139
> URL: https://issues.apache.org/jira/browse/PARQUET-1139
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.0
>Reporter: Lars Volker
>Assignee: Lars Volker
>
> The file misses a license header, RAT complains about it. I'll push a PR 
> shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1139) Add license to cmake_modules/parquet-cppConfig.cmake.in

2017-10-16 Thread Lars Volker (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lars Volker updated PARQUET-1139:
-
Affects Version/s: cpp-1.3.0

> Add license to cmake_modules/parquet-cppConfig.cmake.in
> ---
>
> Key: PARQUET-1139
> URL: https://issues.apache.org/jira/browse/PARQUET-1139
> Project: Parquet
>  Issue Type: Bug
>  Components: parquet-cpp
>Affects Versions: cpp-1.3.0
>Reporter: Lars Volker
>Assignee: Lars Volker
>
> The file misses a license header, RAT complains about it. I'll push a PR 
> shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1140) [C++] Fail on RAT errors in CI

2017-10-16 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1140:
-
Fix Version/s: (was: cpp-1.4.0)
   cpp-1.3.1

> [C++] Fail on RAT errors in CI
> --
>
> Key: PARQUET-1140
> URL: https://issues.apache.org/jira/browse/PARQUET-1140
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.3.1
>
>
> See relevant bits in CI scripts for Apache Arrow or Apache Kudu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PARQUET-1140) [C++] Fail on RAT errors in CI

2017-10-16 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn reassigned PARQUET-1140:


Assignee: Uwe L. Korn

> [C++] Fail on RAT errors in CI
> --
>
> Key: PARQUET-1140
> URL: https://issues.apache.org/jira/browse/PARQUET-1140
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
>Assignee: Uwe L. Korn
> Fix For: cpp-1.3.1
>
>
> See relevant bits in CI scripts for Apache Arrow or Apache Kudu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PARQUET-1140) [C++] Fail on RAT errors in CI

2017-10-16 Thread Uwe L. Korn (JIRA)

 [ 
https://issues.apache.org/jira/browse/PARQUET-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Uwe L. Korn updated PARQUET-1140:
-
Summary: [C++] Fail on RAT errors in CI  (was: [C++] Run RAT checks in CI)

> [C++] Fail on RAT errors in CI
> --
>
> Key: PARQUET-1140
> URL: https://issues.apache.org/jira/browse/PARQUET-1140
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.3.1
>
>
> See relevant bits in CI scripts for Apache Arrow or Apache Kudu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Release Apache Parquet C++ 1.3.1 RC0

2017-10-16 Thread Uwe L. Korn
I will cancel the release and post a new one once the PRs are merged.

I'm a bit confused though why the CI has not catched this, we have an
automatic RAT check in our chain:
https://github.com/apache/parquet-cpp/blob/master/ci/travis_script_cpp.sh#L20


On Mon, Oct 16, 2017, at 07:23 PM, Ryan Blue wrote:
> I agree that we should have a new RC. All files that can have license
> headers should have them.
> 
> On Mon, Oct 16, 2017 at 10:16 AM, Lars Volker  wrote:
> 
> > I think it'd be good to create a new RC, but I don't feel strongly about it
> > and my vote is non-binding. Maybe someone with more experience in the
> > strictness that's expected from ASF projects can weigh in.
> >
> > On Mon, Oct 16, 2017 at 10:08 AM, Wes McKinney 
> > wrote:
> >
> > > Thanks Lars for catching that. I also created about PARQUET-1140 so we
> > > can be more vigilant about RAT issues.
> > >
> > > Do we need an RC1?
> > >
> > > On Mon, Oct 16, 2017 at 12:54 PM, Lars Volker  wrote:
> > > > 0 (non-binding)
> > > >
> > > > * Verified the sha512 sum
> > > > * Verified the .asc and that it matches Uwe's key
> > > > * Built and ran the unittests manually on macOS Sierra
> > > > * Ran the rat tool over the tarball. All warnings looked expected to me
> > > > except for apache-parquet-cpp-1.3.1/cmake_modules/parquet-
> > > cppConfig.cmake.in,
> > > > which I think needs a license header.
> > > >
> > > > Would +1 after a license has been added to that file. I created
> > > > https://github.com/apache/parquet-cpp/pull/411 to fix this.
> > > >
> > > > Thank you Uwe for preparing the release!
> > > >
> > > > On Mon, Oct 16, 2017 at 12:47 AM, Uwe L. Korn 
> > wrote:
> > > >
> > > >> +1
> > > >>
> > > >> * Ran verify-release-candidate on Ubuntu 16.04
> > > >> * Ran verify-release-candidate on macOS Sierra
> > > >>
> > > >> --
> > > >>   Uwe L. Korn
> > > >>   uw...@xhochy.com
> > > >>
> > > >> On Mon, Oct 16, 2017, at 02:16 AM, Wes McKinney wrote:
> > > >> > +1
> > > >> >
> > > >> > * Ran verify-release-candidate on Ubuntu 14.04
> > > >> >
> > > >> > In trying to verify the release candidate on MSVC / Visual Studio
> > > >> > 2015, I found that arrow-reader-writer-test.cc does not compile
> > > >> > against Arrow 0.7.1. The version in ThirdpartyToolchain.cmake is
> > > >> > post-0.7.1
> > > >> >
> > > >> > I posted this fix, which also tests 0.7.1 on all the platforms:
> > > >> > https://github.com/apache/parquet-cpp/pull/410. So I don't think
> > it's
> > > >> > necessary to cancel the RC over this
> > > >> >
> > > >> > Thanks
> > > >> > Wes
> > > >> >
> > > >> > On Fri, Oct 13, 2017 at 8:24 AM, Uwe L. Korn 
> > > wrote:
> > > >> > > All,
> > > >> > >
> > > >> > > I propose that we accept the following release candidate as the
> > > >> official
> > > >> > > Apache Parquet C++ 1.3.1 release.
> > > >> > >
> > > >> > > Parquet C++ 1.3.1-rc0 includes the following:
> > > >> > > ---
> > > >> > > The CHANGELOG for the release is available at:
> > > >> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git=
> > > >> CHANGELOG=apache-parquet-cpp-1.3.1-rc0
> > > >> > >
> > > >> > > The tag used to create the release candidate is:
> > > >> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git;a=
> > > >> shortlog;h=refs/tags/apache-parquet-cpp-1.3.1-rc0
> > > >> > >
> > > >> > > The release candidate is available at:
> > > >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> > > >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz
> > > >> > >
> > > >> > > The MD5 checksum of the release candidate can be found at:
> > > >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> > > >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.md5
> > > >> > >
> > > >> > > The signature of the release candidate can be found at:
> > > >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> > > >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.asc
> > > >> > >
> > > >> > > The GPG key used to sign the release are available at:
> > > >> > > https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > > >> > >
> > > >> > > The release is based on the commit hash
> > > >> > > a1c950d889a22b267ecddaa3436d3494fcca3ae7.
> > > >> > >
> > > >> > > Please download, verify, and test.
> > > >> > >
> > > >> > > The vote will close on Mo 16. Okt 15:06:05 CEST 2017
> > > >> > >
> > > >> > > [ ] +1 Release this as Apache Parquet C++ 1.3.1
> > > >> > > [ ] +0
> > > >> > > [ ] -1 Do not release this as Apache Parquet C++ 1.3.1 because...
> > > >>
> > >
> >
> 
> 
> 
> -- 
> Ryan Blue
> Software Engineer
> Netflix


Re: [VOTE] Release Apache Parquet C++ 1.3.1 RC0

2017-10-16 Thread Ryan Blue
I agree that we should have a new RC. All files that can have license
headers should have them.

On Mon, Oct 16, 2017 at 10:16 AM, Lars Volker  wrote:

> I think it'd be good to create a new RC, but I don't feel strongly about it
> and my vote is non-binding. Maybe someone with more experience in the
> strictness that's expected from ASF projects can weigh in.
>
> On Mon, Oct 16, 2017 at 10:08 AM, Wes McKinney 
> wrote:
>
> > Thanks Lars for catching that. I also created about PARQUET-1140 so we
> > can be more vigilant about RAT issues.
> >
> > Do we need an RC1?
> >
> > On Mon, Oct 16, 2017 at 12:54 PM, Lars Volker  wrote:
> > > 0 (non-binding)
> > >
> > > * Verified the sha512 sum
> > > * Verified the .asc and that it matches Uwe's key
> > > * Built and ran the unittests manually on macOS Sierra
> > > * Ran the rat tool over the tarball. All warnings looked expected to me
> > > except for apache-parquet-cpp-1.3.1/cmake_modules/parquet-
> > cppConfig.cmake.in,
> > > which I think needs a license header.
> > >
> > > Would +1 after a license has been added to that file. I created
> > > https://github.com/apache/parquet-cpp/pull/411 to fix this.
> > >
> > > Thank you Uwe for preparing the release!
> > >
> > > On Mon, Oct 16, 2017 at 12:47 AM, Uwe L. Korn 
> wrote:
> > >
> > >> +1
> > >>
> > >> * Ran verify-release-candidate on Ubuntu 16.04
> > >> * Ran verify-release-candidate on macOS Sierra
> > >>
> > >> --
> > >>   Uwe L. Korn
> > >>   uw...@xhochy.com
> > >>
> > >> On Mon, Oct 16, 2017, at 02:16 AM, Wes McKinney wrote:
> > >> > +1
> > >> >
> > >> > * Ran verify-release-candidate on Ubuntu 14.04
> > >> >
> > >> > In trying to verify the release candidate on MSVC / Visual Studio
> > >> > 2015, I found that arrow-reader-writer-test.cc does not compile
> > >> > against Arrow 0.7.1. The version in ThirdpartyToolchain.cmake is
> > >> > post-0.7.1
> > >> >
> > >> > I posted this fix, which also tests 0.7.1 on all the platforms:
> > >> > https://github.com/apache/parquet-cpp/pull/410. So I don't think
> it's
> > >> > necessary to cancel the RC over this
> > >> >
> > >> > Thanks
> > >> > Wes
> > >> >
> > >> > On Fri, Oct 13, 2017 at 8:24 AM, Uwe L. Korn 
> > wrote:
> > >> > > All,
> > >> > >
> > >> > > I propose that we accept the following release candidate as the
> > >> official
> > >> > > Apache Parquet C++ 1.3.1 release.
> > >> > >
> > >> > > Parquet C++ 1.3.1-rc0 includes the following:
> > >> > > ---
> > >> > > The CHANGELOG for the release is available at:
> > >> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git=
> > >> CHANGELOG=apache-parquet-cpp-1.3.1-rc0
> > >> > >
> > >> > > The tag used to create the release candidate is:
> > >> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git;a=
> > >> shortlog;h=refs/tags/apache-parquet-cpp-1.3.1-rc0
> > >> > >
> > >> > > The release candidate is available at:
> > >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> > >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz
> > >> > >
> > >> > > The MD5 checksum of the release candidate can be found at:
> > >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> > >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.md5
> > >> > >
> > >> > > The signature of the release candidate can be found at:
> > >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> > >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.asc
> > >> > >
> > >> > > The GPG key used to sign the release are available at:
> > >> > > https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > >> > >
> > >> > > The release is based on the commit hash
> > >> > > a1c950d889a22b267ecddaa3436d3494fcca3ae7.
> > >> > >
> > >> > > Please download, verify, and test.
> > >> > >
> > >> > > The vote will close on Mo 16. Okt 15:06:05 CEST 2017
> > >> > >
> > >> > > [ ] +1 Release this as Apache Parquet C++ 1.3.1
> > >> > > [ ] +0
> > >> > > [ ] -1 Do not release this as Apache Parquet C++ 1.3.1 because...
> > >>
> >
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: [VOTE] Release Apache Parquet C++ 1.3.1 RC0

2017-10-16 Thread Lars Volker
I think it'd be good to create a new RC, but I don't feel strongly about it
and my vote is non-binding. Maybe someone with more experience in the
strictness that's expected from ASF projects can weigh in.

On Mon, Oct 16, 2017 at 10:08 AM, Wes McKinney  wrote:

> Thanks Lars for catching that. I also created about PARQUET-1140 so we
> can be more vigilant about RAT issues.
>
> Do we need an RC1?
>
> On Mon, Oct 16, 2017 at 12:54 PM, Lars Volker  wrote:
> > 0 (non-binding)
> >
> > * Verified the sha512 sum
> > * Verified the .asc and that it matches Uwe's key
> > * Built and ran the unittests manually on macOS Sierra
> > * Ran the rat tool over the tarball. All warnings looked expected to me
> > except for apache-parquet-cpp-1.3.1/cmake_modules/parquet-
> cppConfig.cmake.in,
> > which I think needs a license header.
> >
> > Would +1 after a license has been added to that file. I created
> > https://github.com/apache/parquet-cpp/pull/411 to fix this.
> >
> > Thank you Uwe for preparing the release!
> >
> > On Mon, Oct 16, 2017 at 12:47 AM, Uwe L. Korn  wrote:
> >
> >> +1
> >>
> >> * Ran verify-release-candidate on Ubuntu 16.04
> >> * Ran verify-release-candidate on macOS Sierra
> >>
> >> --
> >>   Uwe L. Korn
> >>   uw...@xhochy.com
> >>
> >> On Mon, Oct 16, 2017, at 02:16 AM, Wes McKinney wrote:
> >> > +1
> >> >
> >> > * Ran verify-release-candidate on Ubuntu 14.04
> >> >
> >> > In trying to verify the release candidate on MSVC / Visual Studio
> >> > 2015, I found that arrow-reader-writer-test.cc does not compile
> >> > against Arrow 0.7.1. The version in ThirdpartyToolchain.cmake is
> >> > post-0.7.1
> >> >
> >> > I posted this fix, which also tests 0.7.1 on all the platforms:
> >> > https://github.com/apache/parquet-cpp/pull/410. So I don't think it's
> >> > necessary to cancel the RC over this
> >> >
> >> > Thanks
> >> > Wes
> >> >
> >> > On Fri, Oct 13, 2017 at 8:24 AM, Uwe L. Korn 
> wrote:
> >> > > All,
> >> > >
> >> > > I propose that we accept the following release candidate as the
> >> official
> >> > > Apache Parquet C++ 1.3.1 release.
> >> > >
> >> > > Parquet C++ 1.3.1-rc0 includes the following:
> >> > > ---
> >> > > The CHANGELOG for the release is available at:
> >> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git=
> >> CHANGELOG=apache-parquet-cpp-1.3.1-rc0
> >> > >
> >> > > The tag used to create the release candidate is:
> >> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git;a=
> >> shortlog;h=refs/tags/apache-parquet-cpp-1.3.1-rc0
> >> > >
> >> > > The release candidate is available at:
> >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz
> >> > >
> >> > > The MD5 checksum of the release candidate can be found at:
> >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.md5
> >> > >
> >> > > The signature of the release candidate can be found at:
> >> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> >> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.asc
> >> > >
> >> > > The GPG key used to sign the release are available at:
> >> > > https://dist.apache.org/repos/dist/dev/parquet/KEYS
> >> > >
> >> > > The release is based on the commit hash
> >> > > a1c950d889a22b267ecddaa3436d3494fcca3ae7.
> >> > >
> >> > > Please download, verify, and test.
> >> > >
> >> > > The vote will close on Mo 16. Okt 15:06:05 CEST 2017
> >> > >
> >> > > [ ] +1 Release this as Apache Parquet C++ 1.3.1
> >> > > [ ] +0
> >> > > [ ] -1 Do not release this as Apache Parquet C++ 1.3.1 because...
> >>
>


[jira] [Commented] (PARQUET-1140) [C++] Run RAT checks in CI

2017-10-16 Thread Lars Volker (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206226#comment-16206226
 ] 

Lars Volker commented on PARQUET-1140:
--

parquet-format also runs RAT, see there for another example.

> [C++] Run RAT checks in CI
> --
>
> Key: PARQUET-1140
> URL: https://issues.apache.org/jira/browse/PARQUET-1140
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Wes McKinney
> Fix For: cpp-1.4.0
>
>
> See relevant bits in CI scripts for Apache Arrow or Apache Kudu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Release Apache Parquet C++ 1.3.1 RC0

2017-10-16 Thread Wes McKinney
Thanks Lars for catching that. I also created about PARQUET-1140 so we
can be more vigilant about RAT issues.

Do we need an RC1?

On Mon, Oct 16, 2017 at 12:54 PM, Lars Volker  wrote:
> 0 (non-binding)
>
> * Verified the sha512 sum
> * Verified the .asc and that it matches Uwe's key
> * Built and ran the unittests manually on macOS Sierra
> * Ran the rat tool over the tarball. All warnings looked expected to me
> except for apache-parquet-cpp-1.3.1/cmake_modules/parquet-cppConfig.cmake.in,
> which I think needs a license header.
>
> Would +1 after a license has been added to that file. I created
> https://github.com/apache/parquet-cpp/pull/411 to fix this.
>
> Thank you Uwe for preparing the release!
>
> On Mon, Oct 16, 2017 at 12:47 AM, Uwe L. Korn  wrote:
>
>> +1
>>
>> * Ran verify-release-candidate on Ubuntu 16.04
>> * Ran verify-release-candidate on macOS Sierra
>>
>> --
>>   Uwe L. Korn
>>   uw...@xhochy.com
>>
>> On Mon, Oct 16, 2017, at 02:16 AM, Wes McKinney wrote:
>> > +1
>> >
>> > * Ran verify-release-candidate on Ubuntu 14.04
>> >
>> > In trying to verify the release candidate on MSVC / Visual Studio
>> > 2015, I found that arrow-reader-writer-test.cc does not compile
>> > against Arrow 0.7.1. The version in ThirdpartyToolchain.cmake is
>> > post-0.7.1
>> >
>> > I posted this fix, which also tests 0.7.1 on all the platforms:
>> > https://github.com/apache/parquet-cpp/pull/410. So I don't think it's
>> > necessary to cancel the RC over this
>> >
>> > Thanks
>> > Wes
>> >
>> > On Fri, Oct 13, 2017 at 8:24 AM, Uwe L. Korn  wrote:
>> > > All,
>> > >
>> > > I propose that we accept the following release candidate as the
>> official
>> > > Apache Parquet C++ 1.3.1 release.
>> > >
>> > > Parquet C++ 1.3.1-rc0 includes the following:
>> > > ---
>> > > The CHANGELOG for the release is available at:
>> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git=
>> CHANGELOG=apache-parquet-cpp-1.3.1-rc0
>> > >
>> > > The tag used to create the release candidate is:
>> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git;a=
>> shortlog;h=refs/tags/apache-parquet-cpp-1.3.1-rc0
>> > >
>> > > The release candidate is available at:
>> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
>> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz
>> > >
>> > > The MD5 checksum of the release candidate can be found at:
>> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
>> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.md5
>> > >
>> > > The signature of the release candidate can be found at:
>> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
>> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.asc
>> > >
>> > > The GPG key used to sign the release are available at:
>> > > https://dist.apache.org/repos/dist/dev/parquet/KEYS
>> > >
>> > > The release is based on the commit hash
>> > > a1c950d889a22b267ecddaa3436d3494fcca3ae7.
>> > >
>> > > Please download, verify, and test.
>> > >
>> > > The vote will close on Mo 16. Okt 15:06:05 CEST 2017
>> > >
>> > > [ ] +1 Release this as Apache Parquet C++ 1.3.1
>> > > [ ] +0
>> > > [ ] -1 Do not release this as Apache Parquet C++ 1.3.1 because...
>>


[jira] [Created] (PARQUET-1140) [C++] Run RAT checks in CI

2017-10-16 Thread Wes McKinney (JIRA)
Wes McKinney created PARQUET-1140:
-

 Summary: [C++] Run RAT checks in CI
 Key: PARQUET-1140
 URL: https://issues.apache.org/jira/browse/PARQUET-1140
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp
Reporter: Wes McKinney
 Fix For: cpp-1.4.0


See relevant bits in CI scripts for Apache Arrow or Apache Kudu



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Release Apache Parquet C++ 1.3.1 RC0

2017-10-16 Thread Lars Volker
0 (non-binding)

* Verified the sha512 sum
* Verified the .asc and that it matches Uwe's key
* Built and ran the unittests manually on macOS Sierra
* Ran the rat tool over the tarball. All warnings looked expected to me
except for apache-parquet-cpp-1.3.1/cmake_modules/parquet-cppConfig.cmake.in,
which I think needs a license header.

Would +1 after a license has been added to that file. I created
https://github.com/apache/parquet-cpp/pull/411 to fix this.

Thank you Uwe for preparing the release!

On Mon, Oct 16, 2017 at 12:47 AM, Uwe L. Korn  wrote:

> +1
>
> * Ran verify-release-candidate on Ubuntu 16.04
> * Ran verify-release-candidate on macOS Sierra
>
> --
>   Uwe L. Korn
>   uw...@xhochy.com
>
> On Mon, Oct 16, 2017, at 02:16 AM, Wes McKinney wrote:
> > +1
> >
> > * Ran verify-release-candidate on Ubuntu 14.04
> >
> > In trying to verify the release candidate on MSVC / Visual Studio
> > 2015, I found that arrow-reader-writer-test.cc does not compile
> > against Arrow 0.7.1. The version in ThirdpartyToolchain.cmake is
> > post-0.7.1
> >
> > I posted this fix, which also tests 0.7.1 on all the platforms:
> > https://github.com/apache/parquet-cpp/pull/410. So I don't think it's
> > necessary to cancel the RC over this
> >
> > Thanks
> > Wes
> >
> > On Fri, Oct 13, 2017 at 8:24 AM, Uwe L. Korn  wrote:
> > > All,
> > >
> > > I propose that we accept the following release candidate as the
> official
> > > Apache Parquet C++ 1.3.1 release.
> > >
> > > Parquet C++ 1.3.1-rc0 includes the following:
> > > ---
> > > The CHANGELOG for the release is available at:
> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git=
> CHANGELOG=apache-parquet-cpp-1.3.1-rc0
> > >
> > > The tag used to create the release candidate is:
> > > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git;a=
> shortlog;h=refs/tags/apache-parquet-cpp-1.3.1-rc0
> > >
> > > The release candidate is available at:
> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz
> > >
> > > The MD5 checksum of the release candidate can be found at:
> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.md5
> > >
> > > The signature of the release candidate can be found at:
> > > https://dist.apache.org/repos/dist/dev/parquet/apache-parque
> t-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.asc
> > >
> > > The GPG key used to sign the release are available at:
> > > https://dist.apache.org/repos/dist/dev/parquet/KEYS
> > >
> > > The release is based on the commit hash
> > > a1c950d889a22b267ecddaa3436d3494fcca3ae7.
> > >
> > > Please download, verify, and test.
> > >
> > > The vote will close on Mo 16. Okt 15:06:05 CEST 2017
> > >
> > > [ ] +1 Release this as Apache Parquet C++ 1.3.1
> > > [ ] +0
> > > [ ] -1 Do not release this as Apache Parquet C++ 1.3.1 because...
>


[jira] [Created] (PARQUET-1139) Add license to cmake_modules/parquet-cppConfig.cmake.in

2017-10-16 Thread Lars Volker (JIRA)
Lars Volker created PARQUET-1139:


 Summary: Add license to cmake_modules/parquet-cppConfig.cmake.in
 Key: PARQUET-1139
 URL: https://issues.apache.org/jira/browse/PARQUET-1139
 Project: Parquet
  Issue Type: Bug
  Components: parquet-cpp
Reporter: Lars Volker
Assignee: Lars Volker


The file misses a license header, RAT complains about it. I'll push a PR 
shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/16/17 3:59 PM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will vary 
wildly, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

Edit: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/16/17 3:57 PM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (e.g., sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is utilized to its full 
extent, which may be a negligible fraction of all use cases. Still, the 
possibility is there.


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is used to its full extent, 
which may be a negligible fraction of all use cases. Still, the possibility is 
there.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi edited comment on PARQUET-1065 at 10/16/17 3:56 PM:
--

Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

EDIT: Actually, although INT96 timestamps have nanosecond precision, most of 
the time they are used to store less precise timestamps (sec, millisec or 
microsec). In these cases, the least significant byte is always 0x00, so 
comparison signedness does not affect the order. The problem I described above 
is only present when the INT96 timestamp precision is used to its full extent, 
which may be a negligible fraction of all use cases. Still, the possibility is 
there.


was (Author: zi):
Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1065) Deprecate type-defined sort ordering for INT96 type

2017-10-16 Thread Zoltan Ivanfi (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16205986#comment-16205986
 ] 

Zoltan Ivanfi commented on PARQUET-1065:


Unfortunately, since INT96 timestamps are stored in little endian order, the 
first byte will store the least significant byte of the timestamp and not the 
most significant one. For this reason, the value of the first byte will wildly 
vary, spanning the whole range between 0x00 and 0xFF. As a result, when 
comparing the raw bytes, signed and unsigned comparison can lead to different 
results.

> Deprecate type-defined sort ordering for INT96 type
> ---
>
> Key: PARQUET-1065
> URL: https://issues.apache.org/jira/browse/PARQUET-1065
> Project: Parquet
>  Issue Type: Bug
>Reporter: Zoltan Ivanfi
>Assignee: Zoltan Ivanfi
>
> [parquet.thrift in 
> parquet-format|https://github.com/apache/parquet-format/blob/041708da1af52e7cb9288c331b542aa25b68a2b6/src/main/thrift/parquet.thrift#L37]
>  defines the the sort order for INT96 to be signed. 
> [ParquetMetadataConverter.java in 
> parquet-mr|https://github.com/apache/parquet-mr/blob/352b906996f392030bfd53b93e3cf4adb78d1a55/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L422]
>  uses unsigned ordering instead. In practice, INT96 is only used for 
> timestamps and neither signed nor unsigned ordering of the numeric values is 
> correct for this purpose. For this reason, the INT96 sort order should be 
> specified as undefined.
> (As a special case, min == max signifies that all values are the same, and 
> can be considered valid even for undefined orderings.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [VOTE] Release Apache Parquet C++ 1.3.1 RC0

2017-10-16 Thread Uwe L. Korn
+1 

* Ran verify-release-candidate on Ubuntu 16.04
* Ran verify-release-candidate on macOS Sierra

-- 
  Uwe L. Korn
  uw...@xhochy.com

On Mon, Oct 16, 2017, at 02:16 AM, Wes McKinney wrote:
> +1
> 
> * Ran verify-release-candidate on Ubuntu 14.04
> 
> In trying to verify the release candidate on MSVC / Visual Studio
> 2015, I found that arrow-reader-writer-test.cc does not compile
> against Arrow 0.7.1. The version in ThirdpartyToolchain.cmake is
> post-0.7.1
> 
> I posted this fix, which also tests 0.7.1 on all the platforms:
> https://github.com/apache/parquet-cpp/pull/410. So I don't think it's
> necessary to cancel the RC over this
> 
> Thanks
> Wes
> 
> On Fri, Oct 13, 2017 at 8:24 AM, Uwe L. Korn  wrote:
> > All,
> >
> > I propose that we accept the following release candidate as the official
> > Apache Parquet C++ 1.3.1 release.
> >
> > Parquet C++ 1.3.1-rc0 includes the following:
> > ---
> > The CHANGELOG for the release is available at:
> > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git=CHANGELOG=apache-parquet-cpp-1.3.1-rc0
> >
> > The tag used to create the release candidate is:
> > https://git-wip-us.apache.org/repos/asf?p=parquet-cpp.git;a=shortlog;h=refs/tags/apache-parquet-cpp-1.3.1-rc0
> >
> > The release candidate is available at:
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz
> >
> > The MD5 checksum of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.md5
> >
> > The signature of the release candidate can be found at:
> > https://dist.apache.org/repos/dist/dev/parquet/apache-parquet-cpp-1.3.1-rc0/apache-parquet-cpp-1.3.1.tar.gz.asc
> >
> > The GPG key used to sign the release are available at:
> > https://dist.apache.org/repos/dist/dev/parquet/KEYS
> >
> > The release is based on the commit hash
> > a1c950d889a22b267ecddaa3436d3494fcca3ae7.
> >
> > Please download, verify, and test.
> >
> > The vote will close on Mo 16. Okt 15:06:05 CEST 2017
> >
> > [ ] +1 Release this as Apache Parquet C++ 1.3.1
> > [ ] +0
> > [ ] -1 Do not release this as Apache Parquet C++ 1.3.1 because...