svn commit: r17889 - in /dev/parquet/apache-parquet-1.8.2-rc1: ./ apache-parquet-1.8.2.tar.gz apache-parquet-1.8.2.tar.gz.asc apache-parquet-1.8.2.tar.gz.md5 apache-parquet-1.8.2.tar.gz.sha

2017-01-18 Thread blue
Author: blue Date: Thu Jan 19 03:04:40 2017 New Revision: 17889 Log: Apache Parquet MR $version RC${rc} Added: dev/parquet/apache-parquet-1.8.2-rc1/ dev/parquet/apache-parquet-1.8.2-rc1/apache-parquet-1.8.2.tar.gz (with props) dev/parquet/apache-parquet-1.8.2-rc1/apache-parquet-1.8

[42/50] [abbrv] parquet-mr git commit: PARQUET-751: Add setRequestedSchema to ParquetFileReader.

2017-01-18 Thread blue
PARQUET-751: Add setRequestedSchema to ParquetFileReader. This fixes a bug introduced by dictionary filters, which reused an existing file reader to avoid opening multiple input streams. Before that commit, a new file reader was opened and passed the projection columns from the read context. The f

[03/50] [abbrv] parquet-mr git commit: PARQUET-432: Complete a todo for method ColumnDescriptor.compareTo()

2017-01-18 Thread blue
PARQUET-432: Complete a todo for method ColumnDescriptor.compareTo() The ticket proposes to consider the case *path.length < o.path.length* in, for method ColumnDescriptor.compareTo(). Author: proflin Closes #314 from proflin/PARQUET-432 and squashes the following commits: 80ba94b [proflin] A

[48/50] [abbrv] parquet-mr git commit: PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter

2017-01-18 Thread blue
PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter Author: Patrick Woody Author: Patrick Woody Closes #394 from pwoody/pw/dictionaryUdp and squashes the following commits: d8499a0 [Patrick Woody] short circuiting and style changes 4cb9f0c [Patrick Woody] more missing imports 1ec0d39

[49/50] [abbrv] parquet-mr git commit: PARQUET-791: Add missing column support for UserDefinedPredicate

2017-01-18 Thread blue
PARQUET-791: Add missing column support for UserDefinedPredicate This extends the fixing #354 to UserDefinedPredicate. Author: Liang-Chi Hsieh Closes #389 from viirya/PARQUET-791 and squashes the following commits: d6be37d [Liang-Chi Hsieh] Address comment. 7e929c3 [Liang-Chi Hsieh] PARQUET-79

[34/50] [abbrv] parquet-mr git commit: PARQUET-372: Do not write stats larger than 4k.

2017-01-18 Thread blue
PARQUET-372: Do not write stats larger than 4k. This updates the stats conversion to check whether the min and max values for page stats are larger than 4k. If so, no statistics for a page are written. Author: Ryan Blue Closes #275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashe

[07/50] [abbrv] parquet-mr git commit: PARQUET-548: Add EncodingStats.

2017-01-18 Thread blue
PARQUET-548: Add EncodingStats. This adds `EncodingStats`, which tracks the number of pages for each encoding, separated into dictionary and data pages. It also adds convenience functions that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and `hasNonDictionaryEncodedPage

[09/50] [abbrv] parquet-mr git commit: PARQUET-484: Warn when Decimal is stored as INT64 while could be stored as INT32

2017-01-18 Thread blue
PARQUET-484: Warn when Decimal is stored as INT64 while could be stored as INT32 Below is documented in [LogicalTypes.md](https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal): > int32: for 1 <= precision <= 9 > int64: for 1 <= precision <= 18; precision < 10 will produc

[25/50] [abbrv] parquet-mr git commit: PARQUET-560: Synchronize writes to the finishCalled variable

2017-01-18 Thread blue
PARQUET-560: Synchronize writes to the finishCalled variable Reads of the `finishCalled` variable are properly synchronized, but writes are not -- so there's some sort of inconsistent synch. going on here. This PR fixes that. /cc @rdblue can you please take a look? Author: Nezih Yigitbasi Cl

[26/50] [abbrv] parquet-mr git commit: PARQUET-660: Ignore extension fields in protobuf messages.

2017-01-18 Thread blue
PARQUET-660: Ignore extension fields in protobuf messages. Currently, converting protobuf messages with extension can result in an uninformative error or a data corruption. A more detailed explanation in the corresponding [jira](https://issues.apache.org/jira/browse/PARQUET-660). This patch sim

[02/50] [abbrv] parquet-mr git commit: PARQUET-393: Update to parquet-format 2.3.1.

2017-01-18 Thread blue
PARQUET-393: Update to parquet-format 2.3.1. Author: Ryan Blue Closes #303 from rdblue/PARQUET-393-update-parquet-format-version and squashes the following commits: 0e4c798 [Ryan Blue] PARQUET-393: Add TIME_MICROS and TIMESTAMP_MICROS. ca4a741 [Ryan Blue] PARQUET-393: Update to parquet-format

[10/50] [abbrv] parquet-mr git commit: PARQUET-431: Make ParquetOutputFormat.memoryManager volatile

2017-01-18 Thread blue
PARQUET-431: Make ParquetOutputFormat.memoryManager volatile Currently ParquetOutputFormat.getRecordWriter() contains an unsynchronized lazy initialization of the non-volatile static field *memoryManager*. Because the compiler or processor may reorder instructions, threads are not guaranteed to

[31/50] [abbrv] parquet-mr git commit: PARQUET-358: Add support for Avro's logical types API.

2017-01-18 Thread blue
PARQUET-358: Add support for Avro's logical types API. This adds support for Avro's logical types API to parquet-avro. * The logical types API was introduced in Avro 1.8.0, so this bumps the Avro dependency version to 1.8.0. * Types supported are: decimal, date, time-millis, time-micros, timest

[44/50] [abbrv] parquet-mr git commit: PARQUET-400: Replace CompatibilityUtil with SeekableInputStream.

2017-01-18 Thread blue
PARQUET-400: Replace CompatibilityUtil with SeekableInputStream. This fixes PARQUET-400 by replacing `CompatibilityUtil` with `SeekableInputStream` that's implemented for hadoop-1 and hadoop-2. The benefit of this approach is that `SeekableInputStream` can be used for non-Hadoop file systems in

[46/50] [abbrv] parquet-mr git commit: PARQUET-220: Unnecessary warning in ParquetRecordReader.initialize

2017-01-18 Thread blue
PARQUET-220: Unnecessary warning in ParquetRecordReader.initialize Rather than querying the COUNTER_METHOD up front, the counter method is resolved per object. This allows us to use the 'getCounter' method on any TaskAttemptContext with the correct signature (ignoring versions where TaskAttemptC

[22/50] [abbrv] parquet-mr git commit: PARQUET-423: Replace old Log class with SLF4J Logging

2017-01-18 Thread blue
PARQUET-423: Replace old Log class with SLF4J Logging And make writing files less noisy Author: Niels Basjes Closes #369 from nielsbasjes/PARQUET-423-2 and squashes the following commits: b31e30f [Niels Basjes] Merge branch 'master' of github.com:apache/parquet-mr into PARQUET-423-2 2d4db4b [

[40/50] [abbrv] parquet-mr git commit: PARQUET-511: Integer overflow when counting values in column.

2017-01-18 Thread blue
PARQUET-511: Integer overflow when counting values in column. This commit fixes an issue when the number of entries in a column page is larger than the size of an integer. No exception is thrown directly, but the def level is set incorrectly, leading to a null value being returned during read.

[13/50] [abbrv] parquet-mr git commit: PARQUET-580: Switch int[] initialization in IntList to be lazy

2017-01-18 Thread blue
PARQUET-580: Switch int[] initialization in IntList to be lazy Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s

parquet-mr git commit: [maven-release-plugin] prepare for next development iteration

2017-01-18 Thread blue
Repository: parquet-mr Updated Branches: refs/heads/parquet-1.8.x c65227886 -> 4297134dc [maven-release-plugin] prepare for next development iteration Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/4297134d Tree

[24/50] [abbrv] parquet-mr git commit: PARQUET-623: Fix DeltaByteArrayReader#skip.

2017-01-18 Thread blue
PARQUET-623: Fix DeltaByteArrayReader#skip. Previously, this passed the skip to the underlying readers, but would not update previous and would corrupt values or cause exceptions. Author: Ryan Blue Closes #366 from rdblue/PARQUET-623-fix-delta-byte-array-skip and squashes the following commits

[21/50] [abbrv] parquet-mr git commit: PARQUET-423: Replace old Log class with SLF4J Logging

2017-01-18 Thread blue
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/8e2009b8/parquet-column/src/main/java/org/apache/parquet/io/RecordConsumerLoggingWrapper.java -- diff --git a/parquet-column/src/main/java/org/apache/parquet/io/RecordConsumer

[47/50] [abbrv] parquet-mr git commit: PARQUET-783: Close the underlying stream when an H2SeekableInputStream is closed

2017-01-18 Thread blue
PARQUET-783: Close the underlying stream when an H2SeekableInputStream is closed This PR addresses https://issues.apache.org/jira/browse/PARQUET-783. `ParquetFileReader` opens a `SeekableInputStream` to read a footer. In the process, it opens a new `FSDataInputStream` and wraps it. However, `H2

[14/50] [abbrv] parquet-mr git commit: PARQUET-384: Add dictionary filtering.

2017-01-18 Thread blue
PARQUET-384: Add dictionary filtering. This builds on #286 from @danielcweeks and cleans up some of the interfaces. It introduces `DictionaryPageReadStore` to expose dictionary pages to the filters and cleans up some internal calls by passing `ParquetFileReader`. When committed, this closes #28

[50/50] [abbrv] parquet-mr git commit: [maven-release-plugin] prepare release apache-parquet-1.8.2

2017-01-18 Thread blue
[maven-release-plugin] prepare release apache-parquet-1.8.2 Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/c6522788 Tree: http://git-wip-us.apache.org/repos/asf/parquet-mr/tree/c6522788 Diff: http://git-wip-us.apach

[parquet-mr] Git Push Summary

2017-01-18 Thread blue
Repository: parquet-mr Updated Tags: refs/tags/apache-parquet-1.8.2 [created] beaf00345

[43/50] [abbrv] parquet-mr git commit: PARQUET-669: allow reading footers from provided file listing and streams

2017-01-18 Thread blue
PARQUET-669: allow reading footers from provided file listing and streams The use case is that I want to reuse existing listing of files and avoid doing it again when opening streams. This is in case where filesystem.open is expensive but you have other means of obtaining input stream for a file

[05/50] [abbrv] parquet-mr git commit: PARQUET-415: Fix ByteBuffer Binary serialization.

2017-01-18 Thread blue
PARQUET-415: Fix ByteBuffer Binary serialization. This also adds a test to validate that serialization works for all Binary objects that are already test cases. Author: Ryan Blue Closes #305 from rdblue/PARQUET-415-fix-bytebuffer-binary-serialization and squashes the following commits: 4e75d5

[06/50] [abbrv] parquet-mr git commit: PARQUET-585: Slowly ramp up sizes of int[]s in IntList to keep sizes small when data sets are small

2017-01-18 Thread blue
PARQUET-585: Slowly ramp up sizes of int[]s in IntList to keep sizes small when data sets are small One of the follow up items from PR - https://github.com/apache/parquet-mr/pull/339 was to slowly ramp up the size of the int[] created in IntList to ensure we don't allocate 64K arrays right off

[11/50] [abbrv] parquet-mr git commit: PARQUET-430: Change to use Locale parameterized version of String.toUpperCase()/toLowerCase

2017-01-18 Thread blue
PARQUET-430: Change to use Locale parameterized version of String.toUpperCase()/toLowerCase A String is being converted to upper or lowercase, using the platform's default encoding. This may result in improper conversions when used with international characters. For instance, "TITLE".toLowerCa

[41/50] [abbrv] parquet-mr git commit: PARQUET-686: Do not return min/max for the wrong order.

2017-01-18 Thread blue
PARQUET-686: Do not return min/max for the wrong order. Min and max are currently calculated using the default Java ordering that uses signed comparison for all values. This is not correct for binary types like strings and decimals or for unsigned numeric types. This commit prevents statistics acc

[18/50] [abbrv] parquet-mr git commit: PARQUET-571: Fix potential leak in ParquetFileReader.close()

2017-01-18 Thread blue
PARQUET-571: Fix potential leak in ParquetFileReader.close() If an exception occurs when closing the input stream `f`, the codecs will not be released. This may cause native memory leaks for some codecs. \cc @rdblue Author: Nezih Yigitbasi Closes #338 from nezihyigitbasi/leak-fix and squashes

[45/50] [abbrv] parquet-mr git commit: PARQUET-753: Fixed GroupType.union() to handle original type

2017-01-18 Thread blue
PARQUET-753: Fixed GroupType.union() to handle original type also fixed GroupType.equals() to compare the original type and 2 unit tests that weren't setting the original type properly on the expected results Author: adeneche Author: adeneche Closes #380 from adeneche/fix-grouptype-union and

[38/50] [abbrv] parquet-mr git commit: PARQUET-612: Add compression codec to FileEncodingsIT.

2017-01-18 Thread blue
PARQUET-612: Add compression codec to FileEncodingsIT. Author: Ryan Blue Closes #343 from rdblue/PARQUET-612-test-compression and squashes the following commits: a5b7dbb [Ryan Blue] PARQUET-612: Add compression codec to FileEncodingsIT. Project: http://git-wip-us.apache.org/repos/asf/parquet

[37/50] [abbrv] parquet-mr git commit: PARQUET-743: Fix DictionaryFilter when compressed dictionaries are reused.

2017-01-18 Thread blue
PARQUET-743: Fix DictionaryFilter when compressed dictionaries are reused. BytesInput is not supposed to be held and reused, but decompressed dictionary pages do this. Reusing the dictionary will cause a failure, so the cleanest option is to keep the bytes around once the underlying stream has bee

[27/50] [abbrv] parquet-mr git commit: PARQUET-389: Support predicate push down on missing columns.

2017-01-18 Thread blue
PARQUET-389: Support predicate push down on missing columns. Predicate push-down will complain when predicates reference columns that aren't in a file's schema. This makes it difficult to implement predicate push-down in engines where schemas evolve because each task needs to process the predica

[17/50] [abbrv] parquet-mr git commit: PARQUET-581: Fix two instances of the conflation of the min and max row

2017-01-18 Thread blue
PARQUET-581: Fix two instances of the conflation of the min and max row count for page size check in ParquetOutputFormat.java Author: Michael Allman Closes #340 from mallman/fix_minmax_conflation and squashes the following commits: 79331a5 [Michael Allman] PARQUET-581: Fix two instances of th

[33/50] [abbrv] parquet-mr git commit: PARQUET-544: Add closed flag to allow for closeable contract adherence

2017-01-18 Thread blue
PARQUET-544: Add closed flag to allow for closeable contract adherence The closeable interface states: > Closes this stream and releases any system resources associated with it. If > the stream is already closed then invoking this method has no effect. As InternalParquetRecordWriter implements t

[19/50] [abbrv] parquet-mr git commit: PARQUET-423: Replace old Log class with SLF4J Logging

2017-01-18 Thread blue
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/8e2009b8/parquet-thrift/src/main/java/org/apache/parquet/thrift/ParquetWriteProtocol.java -- diff --git a/parquet-thrift/src/main/java/org/apache/parquet/thrift/ParquetWritePr

[39/50] [abbrv] parquet-mr git commit: PARQUET-685 - Deprecated ParquetInputSplit constructor passes paramet…

2017-01-18 Thread blue
PARQUET-685 - Deprecated ParquetInputSplit constructor passes paramet… The problem was not discovered because the test was bugous. Updated both sides. Author: Gabor Szadovszky Closes #372 from gszadovszky/PARQUET-685 and squashes the following commits: 9cbeee2 [Gabor Szadovszky] PARQUET-685

[15/50] [abbrv] parquet-mr git commit: PARQUET-528: Fix flush() for RecordConsumer and implementations

2017-01-18 Thread blue
PARQUET-528: Fix flush() for RecordConsumer and implementations `flush()` was added in `RecordConsumer` and `MessageColumnIO` to help implementing nulls caching. However, other `RecordConsumer` implementations should also implements `flush()` properly. For instance, `RecordConsumerLoggingWrappe

[20/50] [abbrv] parquet-mr git commit: PARQUET-423: Replace old Log class with SLF4J Logging

2017-01-18 Thread blue
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/8e2009b8/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java -- diff --git a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter

[12/50] [abbrv] parquet-mr git commit: PARQUET-529: Avoid evoking job.toString() in ParquetLoader

2017-01-18 Thread blue
PARQUET-529: Avoid evoking job.toString() in ParquetLoader When ran under hadoop2 environment and log level setting to `DEBUG`, ParquetLoader would evoke `job.toString()` in several methods, which might cause the whole application to stop due to : ``` java.lang.IllegalStateException: Job in sta

[29/50] [abbrv] parquet-mr git commit: PARQUET-674: Add InputFile abstraction for openable files.

2017-01-18 Thread blue
PARQUET-674: Add InputFile abstraction for openable files. Author: Ryan Blue Closes #368 from rdblue/PARQUET-674-add-data-source and squashes the following commits: 8c689e9 [Ryan Blue] PARQUET-674: Implement review comments. 4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for opena

[36/50] [abbrv] parquet-mr git commit: PARQUET-645: Fix null handling in DictionaryFilter.

2017-01-18 Thread blue
PARQUET-645: Fix null handling in DictionaryFilter. This fixes how null is handled by `DictionaryFilter` for equals predicates. Null is never in the dictionary and is encoded by the definition level, so the `DictionaryFilter` would never find the value in the dictionary and would incorrectly fi

[08/50] [abbrv] parquet-mr git commit: PARQUET-569: Separate metadata filtering for ranges and offsets.

2017-01-18 Thread blue
PARQUET-569: Separate metadata filtering for ranges and offsets. Range filtering should use the row group midpoint and offset filtering should use the start offset. Author: Ryan Blue Closes #337 from rdblue/PARQUET-569-fix-metadata-filter and squashes the following commits: 6171af4 [Ryan Blue

[32/50] [abbrv] parquet-mr git commit: PARQUET-654: Add option to disable record-level filtering.

2017-01-18 Thread blue
PARQUET-654: Add option to disable record-level filtering. This can be used by frameworks that use codegen for filtering to avoid running filters within Parquet. Author: Ryan Blue Closes #353 from rdblue/PARQUET-654-add-record-level-filter-option and squashes the following commits: b497e7f [R

[28/50] [abbrv] parquet-mr git commit: PARQUET-642: Improve performance of ByteBuffer based read / write paths

2017-01-18 Thread blue
PARQUET-642: Improve performance of ByteBuffer based read / write paths While trying out the newest Parquet version, we noticed that the changes to start using ByteBuffers: https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8 and https://github.com/apache/parque

[16/50] [abbrv] parquet-mr git commit: PARQUET-382: Add methods to append encoded data to files.

2017-01-18 Thread blue
PARQUET-382: Add methods to append encoded data to files. This allows appending encoded data blocks to open ParquetFileWriters, which makes it possible to merge multiple Parquet files without re-encoding all of the records. This works by finding the column chunk for each column in the file schema

[35/50] [abbrv] parquet-mr git commit: PARQUET-651: Improve Avro's isElementType check.

2017-01-18 Thread blue
PARQUET-651: Improve Avro's isElementType check. The Avro implementation needs to check whether the read schema that is passed by the user (or automatically converted from the file schema) expects an extra 1-field layer to be returned, which matches the previous behavior of Avro when reading a 3-l

[23/50] [abbrv] parquet-mr git commit: PARQUET-726: Increase max difference of testMemoryManagerUpperLimit to 10%

2017-01-18 Thread blue
PARQUET-726: Increase max difference of testMemoryManagerUpperLimit to 10% Author: Niels Basjes Closes #370 from nielsbasjes/PARQUET-726 and squashes the following commits: f385ede [Niels Basjes] PARQUET-726: Increase max difference of testMemoryManagerUpperLimit to 10% Project: http://git-w

[30/50] [abbrv] parquet-mr git commit: PARQUET-358: Add support for Avro's logical types API.

2017-01-18 Thread blue
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/36e14294/parquet-avro/src/test/java/org/apache/parquet/avro/TestCircularReferences.java -- diff --git a/parquet-avro/src/test/java/org/apache/parquet/avro/TestCircularReferenc

[01/50] [abbrv] parquet-mr git commit: PARQUET-422: Fix a potential bug in MessageTypeParser where we ignore…

2017-01-18 Thread blue
Repository: parquet-mr Updated Branches: refs/heads/parquet-1.8.x [created] c65227886 PARQUET-422: Fix a potential bug in MessageTypeParser where we ignore… … and overwrite the initial value of a method parameter In org.apache.parquet.schema.MessageTypeParser, for addGroupType() and addP

[04/50] [abbrv] parquet-mr git commit: PARQUET-495: Fix mismatches in Types class comments

2017-01-18 Thread blue
PARQUET-495: Fix mismatches in Types class comments To produce > required group User { required int64 id; **optional** binary email (UTF8); } we should do: > Types.requiredGroup() .required(INT64).named("id") .~~**required** (BINARY).as(UTF8).named("email")~~ .**optiona