Author: blue
Date: Thu Jan 19 03:04:40 2017
New Revision: 17889
Log:
Apache Parquet MR $version RC${rc}
Added:
dev/parquet/apache-parquet-1.8.2-rc1/
dev/parquet/apache-parquet-1.8.2-rc1/apache-parquet-1.8.2.tar.gz (with
props)
dev/parquet/apache-parquet-1.8.2-rc1/apache-parquet-1.8
PARQUET-751: Add setRequestedSchema to ParquetFileReader.
This fixes a bug introduced by dictionary filters, which reused an
existing file reader to avoid opening multiple input streams. Before
that commit, a new file reader was opened and passed the projection
columns from the read context. The f
PARQUET-432: Complete a todo for method ColumnDescriptor.compareTo()
The ticket proposes to consider the case *path.length < o.path.length* in, for
method ColumnDescriptor.compareTo().
Author: proflin
Closes #314 from proflin/PARQUET-432 and squashes the following commits:
80ba94b [proflin] A
PARQUET-801: Allow UserDefinedPredicates in DictionaryFilter
Author: Patrick Woody
Author: Patrick Woody
Closes #394 from pwoody/pw/dictionaryUdp and squashes the following commits:
d8499a0 [Patrick Woody] short circuiting and style changes
4cb9f0c [Patrick Woody] more missing imports
1ec0d39
PARQUET-791: Add missing column support for UserDefinedPredicate
This extends the fixing #354 to UserDefinedPredicate.
Author: Liang-Chi Hsieh
Closes #389 from viirya/PARQUET-791 and squashes the following commits:
d6be37d [Liang-Chi Hsieh] Address comment.
7e929c3 [Liang-Chi Hsieh] PARQUET-79
PARQUET-372: Do not write stats larger than 4k.
This updates the stats conversion to check whether the min and max
values for page stats are larger than 4k. If so, no statistics for a
page are written.
Author: Ryan Blue
Closes #275 from rdblue/PARQUET-372-fix-min-max-for-long-values and squashe
PARQUET-548: Add EncodingStats.
This adds `EncodingStats`, which tracks the number of pages for each encoding,
separated into dictionary and data pages. It also adds convenience functions
that are useful for dictionary filtering, like `hasDictionaryEncodedPages` and
`hasNonDictionaryEncodedPage
PARQUET-484: Warn when Decimal is stored as INT64 while could be stored as INT32
Below is documented in
[LogicalTypes.md](https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#decimal):
> int32: for 1 <= precision <= 9
> int64: for 1 <= precision <= 18; precision < 10 will produc
PARQUET-560: Synchronize writes to the finishCalled variable
Reads of the `finishCalled` variable are properly synchronized, but writes are
not -- so there's some sort of inconsistent synch. going on here. This PR fixes
that.
/cc @rdblue can you please take a look?
Author: Nezih Yigitbasi
Cl
PARQUET-660: Ignore extension fields in protobuf messages.
Currently, converting protobuf messages with extension can result in an
uninformative error or a data corruption. A more detailed explanation in the
corresponding [jira](https://issues.apache.org/jira/browse/PARQUET-660).
This patch sim
PARQUET-393: Update to parquet-format 2.3.1.
Author: Ryan Blue
Closes #303 from rdblue/PARQUET-393-update-parquet-format-version and squashes
the following commits:
0e4c798 [Ryan Blue] PARQUET-393: Add TIME_MICROS and TIMESTAMP_MICROS.
ca4a741 [Ryan Blue] PARQUET-393: Update to parquet-format
PARQUET-431: Make ParquetOutputFormat.memoryManager volatile
Currently ParquetOutputFormat.getRecordWriter() contains an unsynchronized lazy
initialization of the non-volatile static field *memoryManager*.
Because the compiler or processor may reorder instructions, threads are not
guaranteed to
PARQUET-358: Add support for Avro's logical types API.
This adds support for Avro's logical types API to parquet-avro.
* The logical types API was introduced in Avro 1.8.0, so this bumps the Avro
dependency version to 1.8.0.
* Types supported are: decimal, date, time-millis, time-micros,
timest
PARQUET-400: Replace CompatibilityUtil with SeekableInputStream.
This fixes PARQUET-400 by replacing `CompatibilityUtil` with
`SeekableInputStream` that's implemented for hadoop-1 and hadoop-2. The benefit
of this approach is that `SeekableInputStream` can be used for non-Hadoop file
systems in
PARQUET-220: Unnecessary warning in ParquetRecordReader.initialize
Rather than querying the COUNTER_METHOD up front, the counter method is
resolved per object. This allows us to use the
'getCounter' method on any TaskAttemptContext with the correct signature
(ignoring versions where TaskAttemptC
PARQUET-423: Replace old Log class with SLF4J Logging
And make writing files less noisy
Author: Niels Basjes
Closes #369 from nielsbasjes/PARQUET-423-2 and squashes the following commits:
b31e30f [Niels Basjes] Merge branch 'master' of github.com:apache/parquet-mr
into PARQUET-423-2
2d4db4b [
PARQUET-511: Integer overflow when counting values in column.
This commit fixes an issue when the number of entries in a column page is
larger than the size of an integer. No exception is thrown directly, but the
def level is set incorrectly, leading to a null value being returned during
read.
PARQUET-580: Switch int[] initialization in IntList to be lazy
Noticed that for a dataset that we were trying to import that had a lot of
columns (few thousand) that weren't being used, we ended up allocating a lot of
unnecessary int arrays (each 64K in size). Heap footprint for all those int[]s
Repository: parquet-mr
Updated Branches:
refs/heads/parquet-1.8.x c65227886 -> 4297134dc
[maven-release-plugin] prepare for next development iteration
Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/4297134d
Tree
PARQUET-623: Fix DeltaByteArrayReader#skip.
Previously, this passed the skip to the underlying readers, but would
not update previous and would corrupt values or cause exceptions.
Author: Ryan Blue
Closes #366 from rdblue/PARQUET-623-fix-delta-byte-array-skip and squashes the
following commits
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/8e2009b8/parquet-column/src/main/java/org/apache/parquet/io/RecordConsumerLoggingWrapper.java
--
diff --git
a/parquet-column/src/main/java/org/apache/parquet/io/RecordConsumer
PARQUET-783: Close the underlying stream when an H2SeekableInputStream is closed
This PR addresses https://issues.apache.org/jira/browse/PARQUET-783.
`ParquetFileReader` opens a `SeekableInputStream` to read a footer. In the
process, it opens a new `FSDataInputStream` and wraps it. However,
`H2
PARQUET-384: Add dictionary filtering.
This builds on #286 from @danielcweeks and cleans up some of the interfaces. It
introduces `DictionaryPageReadStore` to expose dictionary pages to the filters
and cleans up some internal calls by passing `ParquetFileReader`.
When committed, this closes #28
[maven-release-plugin] prepare release apache-parquet-1.8.2
Project: http://git-wip-us.apache.org/repos/asf/parquet-mr/repo
Commit: http://git-wip-us.apache.org/repos/asf/parquet-mr/commit/c6522788
Tree: http://git-wip-us.apache.org/repos/asf/parquet-mr/tree/c6522788
Diff: http://git-wip-us.apach
Repository: parquet-mr
Updated Tags: refs/tags/apache-parquet-1.8.2 [created] beaf00345
PARQUET-669: allow reading footers from provided file listing and streams
The use case is that I want to reuse existing listing of files and avoid doing
it again when opening streams. This is in case where filesystem.open is
expensive but you have other means of obtaining input stream for a file
PARQUET-415: Fix ByteBuffer Binary serialization.
This also adds a test to validate that serialization works for all
Binary objects that are already test cases.
Author: Ryan Blue
Closes #305 from rdblue/PARQUET-415-fix-bytebuffer-binary-serialization and
squashes the following commits:
4e75d5
PARQUET-585: Slowly ramp up sizes of int[]s in IntList to keep sizes small when
data sets are small
One of the follow up items from PR -
https://github.com/apache/parquet-mr/pull/339 was to slowly ramp up the size of
the int[] created in IntList to ensure we don't allocate 64K arrays right off
PARQUET-430: Change to use Locale parameterized version of
String.toUpperCase()/toLowerCase
A String is being converted to upper or lowercase, using the platform's default
encoding. This may result in improper conversions when used with international
characters.
For instance, "TITLE".toLowerCa
PARQUET-686: Do not return min/max for the wrong order.
Min and max are currently calculated using the default Java ordering
that uses signed comparison for all values. This is not correct for
binary types like strings and decimals or for unsigned numeric types.
This commit prevents statistics acc
PARQUET-571: Fix potential leak in ParquetFileReader.close()
If an exception occurs when closing the input stream `f`, the codecs
will not be released. This may cause native memory leaks for some codecs. \cc
@rdblue
Author: Nezih Yigitbasi
Closes #338 from nezihyigitbasi/leak-fix and squashes
PARQUET-753: Fixed GroupType.union() to handle original type
also fixed GroupType.equals() to compare the original type and 2 unit tests
that weren't setting the original type properly on the expected results
Author: adeneche
Author: adeneche
Closes #380 from adeneche/fix-grouptype-union and
PARQUET-612: Add compression codec to FileEncodingsIT.
Author: Ryan Blue
Closes #343 from rdblue/PARQUET-612-test-compression and squashes the following
commits:
a5b7dbb [Ryan Blue] PARQUET-612: Add compression codec to FileEncodingsIT.
Project: http://git-wip-us.apache.org/repos/asf/parquet
PARQUET-743: Fix DictionaryFilter when compressed dictionaries are reused.
BytesInput is not supposed to be held and reused, but decompressed
dictionary pages do this. Reusing the dictionary will cause a failure,
so the cleanest option is to keep the bytes around once the underlying
stream has bee
PARQUET-389: Support predicate push down on missing columns.
Predicate push-down will complain when predicates reference columns that aren't
in a file's schema. This makes it difficult to implement predicate push-down in
engines where schemas evolve because each task needs to process the predica
PARQUET-581: Fix two instances of the conflation of the min and max row
count for page size check in ParquetOutputFormat.java
Author: Michael Allman
Closes #340 from mallman/fix_minmax_conflation and squashes the following
commits:
79331a5 [Michael Allman] PARQUET-581: Fix two instances of th
PARQUET-544: Add closed flag to allow for closeable contract adherence
The closeable interface states:
> Closes this stream and releases any system resources associated with it. If
> the stream is already closed then invoking this method has no effect.
As InternalParquetRecordWriter implements t
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/8e2009b8/parquet-thrift/src/main/java/org/apache/parquet/thrift/ParquetWriteProtocol.java
--
diff --git
a/parquet-thrift/src/main/java/org/apache/parquet/thrift/ParquetWritePr
PARQUET-685 - Deprecated ParquetInputSplit constructor passes parametâ¦
The problem was not discovered because the test was bugous. Updated both sides.
Author: Gabor Szadovszky
Closes #372 from gszadovszky/PARQUET-685 and squashes the following commits:
9cbeee2 [Gabor Szadovszky] PARQUET-685
PARQUET-528: Fix flush() for RecordConsumer and implementations
`flush()` was added in `RecordConsumer` and `MessageColumnIO` to help
implementing nulls caching.
However, other `RecordConsumer` implementations should also implements
`flush()` properly. For instance, `RecordConsumerLoggingWrappe
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/8e2009b8/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java
--
diff --git
a/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter
PARQUET-529: Avoid evoking job.toString() in ParquetLoader
When ran under hadoop2 environment and log level setting to `DEBUG`,
ParquetLoader would evoke `job.toString()` in several methods, which might
cause the whole application to stop due to :
```
java.lang.IllegalStateException: Job in sta
PARQUET-674: Add InputFile abstraction for openable files.
Author: Ryan Blue
Closes #368 from rdblue/PARQUET-674-add-data-source and squashes the following
commits:
8c689e9 [Ryan Blue] PARQUET-674: Implement review comments.
4a7c327 [Ryan Blue] PARQUET-674: Add DataSource abstraction for opena
PARQUET-645: Fix null handling in DictionaryFilter.
This fixes how null is handled by `DictionaryFilter` for equals predicates.
Null is never in the dictionary and is encoded by the definition level, so the
`DictionaryFilter` would never find the value in the dictionary and would
incorrectly fi
PARQUET-569: Separate metadata filtering for ranges and offsets.
Range filtering should use the row group midpoint and offset filtering
should use the start offset.
Author: Ryan Blue
Closes #337 from rdblue/PARQUET-569-fix-metadata-filter and squashes the
following commits:
6171af4 [Ryan Blue
PARQUET-654: Add option to disable record-level filtering.
This can be used by frameworks that use codegen for filtering to avoid
running filters within Parquet.
Author: Ryan Blue
Closes #353 from rdblue/PARQUET-654-add-record-level-filter-option and squashes
the following commits:
b497e7f [R
PARQUET-642: Improve performance of ByteBuffer based read / write paths
While trying out the newest Parquet version, we noticed that the changes to
start using ByteBuffers:
https://github.com/apache/parquet-mr/commit/6b605a4ea05b66e1a6bf843353abcb4834a4ced8
and
https://github.com/apache/parque
PARQUET-382: Add methods to append encoded data to files.
This allows appending encoded data blocks to open ParquetFileWriters,
which makes it possible to merge multiple Parquet files without
re-encoding all of the records.
This works by finding the column chunk for each column in the file
schema
PARQUET-651: Improve Avro's isElementType check.
The Avro implementation needs to check whether the read schema that is
passed by the user (or automatically converted from the file schema)
expects an extra 1-field layer to be returned, which matches the
previous behavior of Avro when reading a 3-l
PARQUET-726: Increase max difference of testMemoryManagerUpperLimit to 10%
Author: Niels Basjes
Closes #370 from nielsbasjes/PARQUET-726 and squashes the following commits:
f385ede [Niels Basjes] PARQUET-726: Increase max difference of
testMemoryManagerUpperLimit to 10%
Project: http://git-w
http://git-wip-us.apache.org/repos/asf/parquet-mr/blob/36e14294/parquet-avro/src/test/java/org/apache/parquet/avro/TestCircularReferences.java
--
diff --git
a/parquet-avro/src/test/java/org/apache/parquet/avro/TestCircularReferenc
Repository: parquet-mr
Updated Branches:
refs/heads/parquet-1.8.x [created] c65227886
PARQUET-422: Fix a potential bug in MessageTypeParser where we ignoreâ¦
⦠and overwrite the initial value of a method parameter
In org.apache.parquet.schema.MessageTypeParser, for addGroupType() and
addP
PARQUET-495: Fix mismatches in Types class comments
To produce
> required group User {
required int64 id;
**optional** binary email (UTF8);
}
we should do:
>
Types.requiredGroup()
.required(INT64).named("id")
.~~**required** (BINARY).as(UTF8).named("email")~~
.**optiona
53 matches
Mail list logo