[ https://issues.apache.org/jira/browse/ORC-426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669373#comment-16669373 ]
ASF GitHub Bot commented on ORC-426: ------------------------------------ omalley closed pull request #329: ORC-426: Fix errors in ORC specification. URL: https://github.com/apache/orc/pull/329 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md index 32ce14a151..b4fea4e81b 100644 --- a/site/specification/ORCv0.md +++ b/site/specification/ORCv0.md @@ -725,7 +725,7 @@ DIRECT | PRESENT | Yes | Boolean RLE ## Map Columns Maps are encoded as the PRESENT stream and a length stream with number -of items in each list. They have a child column for the key and +of items in each map. They have a child column for the key and another child column for the value. Encoding | Stream Kind | Optional | Contents diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md index fb90c8353c..5dbd3d027f 100644 --- a/site/specification/ORCv1.md +++ b/site/specification/ORCv1.md @@ -581,8 +581,6 @@ the index values and the additional value bits. bit is set, the entire value is negated. * Data values (W * L bits padded to the byte) - A sequence of W bit positive values that are added to the base value. -* Data values (W * L bits padded to the byte) - A sequence of W bit positive - values that are added to the base value. * Patch list (PLL * (PGW + PW) bytes) - A list of patches for values that didn't fit within W bits. Each entry in the list consists of a gap, which is the number of elements skipped from the previous @@ -899,7 +897,7 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE ## Map Columns Maps are encoded as the PRESENT stream and a length stream with number -of items in each list. They have a child column for the key and +of items in each map. They have a child column for the key and another child column for the value. Encoding | Stream Kind | Optional | Contents @@ -978,7 +976,7 @@ group (default to 10,000 rows) in a column. Only the row groups that satisfy min/max row index evaluation will be evaluated against the bloom filter index. -Each BloomFilterEntry stores the number of hash functions ('k') used +Each bloom filter entry stores the number of hash functions ('k') used and the bitset backing the bloom filter. The original encoding (pre ORC-101) of bloom filters used the bitset field encoded as a repeating sequence of longs in the bitset field with a little endian encoding diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md index 76ee571f0e..d91139c0fe 100644 --- a/site/specification/ORCv2.md +++ b/site/specification/ORCv2.md @@ -601,8 +601,6 @@ the index values and the additional value bits. bit is set, the entire value is negated. * Data values (W * L bits padded to the byte) - A sequence of W bit positive values that are added to the base value. -* Data values (W * L bits padded to the byte) - A sequence of W bit positive - values that are added to the base value. * Patch list (PLL * (PGW + PW) bytes) - A list of patches for values that didn't fit within W bits. Each entry in the list consists of a gap, which is the number of elements skipped from the previous @@ -916,7 +914,7 @@ DIRECT_V2 | PRESENT | Yes | Boolean RLE ## Map Columns Maps are encoded as the PRESENT stream and a length stream with number -of items in each list. They have a child column for the key and +of items in each map. They have a child column for the key and another child column for the value. Encoding | Stream Kind | Optional | Contents @@ -995,7 +993,7 @@ group (default to 10,000 rows) in a column. Only the row groups that satisfy min/max row index evaluation will be evaluated against the bloom filter index. -Each BloomFilterEntry stores the number of hash functions ('k') used +Each bloom filter entry stores the number of hash functions ('k') used and the bitset backing the bloom filter. The original encoding (pre ORC-101) of bloom filters used the bitset field encoded as a repeating sequence of longs in the bitset field with a little endian encoding ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Errors in ORC Specification > --------------------------- > > Key: ORC-426 > URL: https://issues.apache.org/jira/browse/ORC-426 > Project: ORC > Issue Type: Bug > Components: documentation > Reporter: Fang Zheng > Priority: Minor > Labels: documentation > > There are some errors in the ORC format specifications: > 1. In specification/ORCv1.md and specification/ORCv2.md, the following > sentence appears twice in the description of "Patched Base”: > Data values (W * L bits padded to the byte) - A sequence of W bit positive > values that are added to the base value. > 2. In specification/ORCv0.md, specification/ORCv1.md, and > specification/ORCv2.md, there is an error in the description of “Map Columns”: > Maps are encoded as the PRESENT stream and a length stream with number > of items in each list. —> The last word “list” should be changed to “map” > 3. In specification/ORCv1.md and specification/ORCv2.md, the word > “BloomFilterEntry” should be changed to “bloom filter entry”, as > “BloomFilterEntry” does not exist in the source code or ProtocolBuffer > definition. -- This message was sent by Atlassian JIRA (v7.6.3#76005)