ORC-2 Create a website for the project (omalley)
Project: http://git-wip-us.apache.org/repos/asf/orc/repo Commit: http://git-wip-us.apache.org/repos/asf/orc/commit/ad182311 Tree: http://git-wip-us.apache.org/repos/asf/orc/tree/ad182311 Diff: http://git-wip-us.apache.org/repos/asf/orc/diff/ad182311 Branch: refs/heads/master Commit: ad182311ac6fab3c9ca6dbbd72aae440d84cbe9a Parents: 30e8e73 Author: Owen O'Malley <[email protected]> Authored: Tue May 12 20:40:31 2015 -0700 Committer: Owen O'Malley <[email protected]> Committed: Tue May 12 20:40:31 2015 -0700 ---------------------------------------------------------------------- site/_config.yml | 2 + site/_data/docs.yml | 5 +- site/_docs/acid.md | 117 ++++++++- site/_docs/compression.md | 27 +- site/_docs/encodings.md | 203 +++++++++++++++ site/_docs/file-tail.md | 313 +++++++++++++++++++++++- site/_docs/hive-config.md | 177 ++++++++++++-- site/_docs/hive-ddl.md | 2 +- site/_docs/index.md | 34 ++- site/_docs/indexes.md | 27 +- site/_docs/run-length.md | 230 ++++++++++++++++- site/_docs/self-contained.md | 7 - site/_docs/spec-index.md | 111 +++++++++ site/_docs/spec-intro.md | 21 +- site/_docs/stripes.md | 82 ++++++- site/_docs/types.md | 58 ++++- site/_includes/docs_option.html | 2 +- site/_posts/2015-04-22-orc-as-tlp.markdown | 10 - site/_posts/2015-04-22-orc-as-tlp.md | 42 ++++ site/_posts/2015-05-11-new-committers.md | 18 ++ site/_sass/_style.scss | 4 + site/develop/index.md | 21 +- site/doap_orc.rdf | 55 +++++ site/downloads/index.md | 7 + site/img/BloomFilter.png | Bin 0 -> 61887 bytes site/img/CompressionStream.png | Bin 0 -> 91623 bytes site/img/OrcFileLayout.png | Bin 0 -> 127908 bytes site/img/TreeWriters.png | Bin 0 -> 134465 bytes site/news/index.html | 4 +- site/talks/index.md | 9 + 30 files changed, 1523 insertions(+), 65 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_config.yml ---------------------------------------------------------------------- diff --git a/site/_config.yml b/site/_config.yml index ecba4e8..933b27e 100644 --- a/site/_config.yml +++ b/site/_config.yml @@ -5,6 +5,8 @@ excerpt_separator: "" repository: https://github.com/apache/orc destination: target +exclude: [README.md,Gemfile*] +keep_files: [.git] collections: docs: http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_data/docs.yml ---------------------------------------------------------------------- diff --git a/site/_data/docs.yml b/site/_data/docs.yml index 9c1e3d6..5e2dfe0 100644 --- a/site/_data/docs.yml +++ b/site/_data/docs.yml @@ -1,9 +1,8 @@ - title: Overview docs: - index - - self-contained - - indexes - types + - indexes - acid - title: Hive Usage @@ -18,3 +17,5 @@ - compression - run-length - stripes + - encodings + - spec-index \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/acid.md ---------------------------------------------------------------------- diff --git a/site/_docs/acid.md b/site/_docs/acid.md index 202398d..a79d38a 100644 --- a/site/_docs/acid.md +++ b/site/_docs/acid.md @@ -4,4 +4,119 @@ title: ACID support permalink: /docs/acid.html --- -ACID support body \ No newline at end of file +Historically, the only way to atomically add data to a table in Hive +was to add a new partition. Updating or deleting data in partition +required removing the old partition and adding it back with the new +data and it wasn't possible to do atomically. + +However, user's data is continually changing and as Hive matured, +users required reliability guarantees despite the churning data +lake. Thus, we needed to implement ACID transactions that guarantee +atomicity, consistency, isolation, and durability. Although we support +ACID transactions, they are not designed to support OLTP requirements. +It can support millions of rows updated per a transaction, but it can +not support millions of transactions an hour. + +Additionally, we wanted to support streaming ingest in to Hive tables where +streaming applications like Flume or Storm could write data into Hive and +have transactions commit once a minute and queries would either see all of +a transaction or none of it. + +HDFS is a write once file system and ORC is a write-once file format, so edits +were implemented using base files and delta files where insert, update, and +delete operations are recorded. + +Hive tables without ACID enabled have each partition in HDFS look like: + +Filename | Contents +:------- | :-------- +00000_0 | Bucket 0 +00001_0 | Bucket 1 + +With ACID enabled, the system will add delta directories: + +Filename | Contents +:------- | :-------- +00000_0 | Bucket 0 base +00001_0 | Bucket 1 base +delta_0000005_0000005/bucket_00000 | Transaction 5 to 5, bucket 0 delta +delta_0000005_0000005/bucket_00001 | Transaction 5 to 5, bucket 1 delta + +When too many deltas have been created, a minor compaction will automatically +run and merge a set of transactions into a single delta: + +Filename | Contents +:------- | :-------- +00000_0 | Bucket 0 base +00001_0 | Bucket 1 base +delta_0000005_0000010/bucket_00000 | Transaction 5 to 10, bucket 0 delta +delta_0000005_0000010/bucket_00001 | Transaction 5 to 10, bucket 1 delta + +When the deltas get large enough, major compaction will re-write the base +to incorporate the deltas. + +Filename | Contents +:------- | :-------- +base_0000010/bucket_00000 | Transactions upto 10, bucket 0 base +base_0000010/bucket_00001 | Transactions upto 10, bucket 1 base + +Reads and compactions do not require locks and thus compactions can +not destructively modify their inputs, but rather write new +directories. + +All rows are given an automatic assigned row id, which is the triple of +original transaction id, bucket, and row id, that is guaranteed to be unique. +All update and delete operations refer to that triple. + +The ORC files in an ACID table are extended with several column. They +are the operation (insert, update, or delete), the triple that +uniquely identifies the row (originalTransaction, bucket, rowId), and +the current transaction. + +```struct< + operation: int, + originalTransaction: bigInt, + bucket: int, + rowId: bitInt, + currentTransaction: bigInt, + row: struct<...> +> +``` + +The serialization for the operation codes is: + +Operation | Serialization +:-------- | :------------ +INSERT | 0 +UPDATE | 1 +DELETE | 2 + +When a application or query reads the ACID table, the reader provides +the list of committed transactions to include. This list is produced +by the Hive metastore when a query starts. The task does a merge +sort. Each of the files is sorted by (originalTransaction ascending, +bucket ascending, rowId ascending, and currentTransaction +descending). Only the first record with a currentTransaction that is +in the list of transactions to read is returned, which corresponds to +the last visible update to a row. + +To support streaming ingest, we add two additional features. ORC files +may have additional footers written in to their body that is parsable +as a complete ORC file that only includes the records already +written. As the file is later extended the preliminary file footer +becomes dead space within the file. Secondly, a side file named +"*_flush_length" is a small file that contains a set of 8 byte +values. The last complete 8 byte value is the end of the last +preliminary footer. + +Two properties are added to the metadata for ORC files to speed up the +processing of the ACID tables. In particular, when a task is reading +part of the base file for a bucket, it will use the first and last +rowIds to find the corresponding spots in the delta files. The +hive.acid.key.index lets the reader skip over stripes in the delta +file that don't need to be read in this task. + +Key | Meaning +:------------------ | :----------- +hive.acid.stats | Number of inserts, updates, and deletes comma separated +hive.acid.key.index | The last originalTransaction, bucket, rowId for each stripe \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/compression.md ---------------------------------------------------------------------- diff --git a/site/_docs/compression.md b/site/_docs/compression.md index 5756973..62cc199 100644 --- a/site/_docs/compression.md +++ b/site/_docs/compression.md @@ -4,4 +4,29 @@ title: Compression permalink: /docs/compression.html --- -Compression body \ No newline at end of file +If the ORC file writer selects a generic compression codec (zlib or +snappy), every part of the ORC file except for the Postscript is +compressed with that codec. However, one of the requirements for ORC +is that the reader be able to skip over compressed bytes without +decompressing the entire stream. To manage this, ORC writes compressed +streams in chunks with headers as in the figure below. +To handle uncompressable data, if the compressed data is larger than +the original, the original is stored and the isOriginal flag is +set. Each header is 3 bytes long with (compressedLength * 2 + +isOriginal) stored as a little endian value. For example, the header +for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d, +0x03]. The header for 5 bytes that did not compress would be [0x0b, +0x00, 0x00]. Each compression chunk is compressed independently so +that as long as a decompressor starts at the top of a header, it can +start decompressing without the previous bytes. + + + +The default compression chunk size is 256K, but writers can choose +their own value less than 223. Larger chunks lead to better +compression, but require more memory. The chunk size is recorded in +the Postscript so that readers can allocate appropriately sized +buffers. + +ORC files without generic compression write each stream directly +with no headers. http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/encodings.md ---------------------------------------------------------------------- diff --git a/site/_docs/encodings.md b/site/_docs/encodings.md new file mode 100644 index 0000000..09c7ca3 --- /dev/null +++ b/site/_docs/encodings.md @@ -0,0 +1,203 @@ +--- +layout: docs +title: Column Encodings +permalink: /docs/encodings.html +--- + +## SmallInt, Int, and BigInt Columns + +All of the 16, 32, and 64 bit integer column types use the same set of +potential encodings, which is basically whether they use RLE v1 or +v2. If the PRESENT stream is not included, all of the values are +present. For values that have false bits in the present stream, no +values are included in the data stream. + +Encoding | Stream Kind | Optional | Contents +:-------- | :---------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v2 + +## Float and Double Columns + +Floating point types are stored using IEEE 754 floating point bit +layout. Float columns use 4 bytes per value and double columns use 8 +bytes. + +Encoding | Stream Kind | Optional | Contents +:-------- | :---------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | IEEE 754 floating point representation + +## String, Char, and VarChar Columns + +String columns are adaptively encoded based on whether the first +10,000 values are sufficiently distinct. In all of the encodings, the +PRESENT stream encodes whether the value is null. + +For direct encoding the UTF-8 bytes are saved in the DATA stream and +the length of each value is written into the LENGTH stream. In direct +encoding, if the values were ["Nevada", "California"]; the DATA +would be "NevadaCalifornia" and the LENGTH would be [6, 10]. + +For dictionary encodings the dictionary is sorted and UTF-8 bytes of +each unique value are placed into DICTIONARY_DATA. The length of each +item in the dictionary is put into the LENGTH stream. The DATA stream +consists of the sequence of references to the dictionary elements. + +In dictionary encoding, if the values were ["Nevada", +"California", "Nevada", "California", and "Florida"]; the +DICTIONARY_DATA would be "CaliforniaFloridaNevada" and LENGTH would +be [10, 7, 6]. The DATA would be [2, 0, 2, 0, 1]. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | String contents + | LENGTH | No | Unsigned Integer RLE v1 +DICTIONARY | PRESENT | Yes | Boolean RLE + | DATA | No | Unsigned Integer RLE v1 + | DICTIONARY_DATA | No | String contents + | LENGTH | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | String contents + | LENGTH | No | Unsigned Integer RLE v2 +DICTIONARY_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Unsigned Integer RLE v2 + | DICTIONARY_DATA | No | String contents + | LENGTH | No | Unsigned Integer RLE v2 + +## Boolean Columns + +Boolean columns are rare, but have a simple encoding. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Boolean RLE + +## TinyInt Columns + +TinyInt (byte) columns use byte run length encoding. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Byte RLE + +## Binary Columns + +Binary data is encoded with a PRESENT stream, a DATA stream that records +the contents, and a LENGTH stream that records the number of bytes per a +value. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | String contents + | LENGTH | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | String contents + | LENGTH | No | Unsigned Integer RLE v2 + +## Decimal Columns + +Decimal was introduced in Hive 0.11 with infinite precision (the total +number of digits). In Hive 0.13, the definition was change to limit +the precision to a maximum of 38 digits, which conveniently uses 127 +bits plus a sign bit. The current encoding of decimal columns stores +the integer representation of the value as an unbounded length zigzag +encoded base 128 varint. The scale is stored in the SECONDARY stream +as an unsigned integer. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Unbounded base 128 varints + | SECONDARY | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Unbounded base 128 varints + | SECONDARY | No | Unsigned Integer RLE v2 + +## Date Columns + +Date data is encoded with a PRESENT stream, a DATA stream that records +the number of days after January 1, 1970 in UTC. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v2 + +## Timestamp Columns + +Timestamp records times down to nanoseconds as a PRESENT stream that +records non-null values, a DATA stream that records the number of +seconds after 1 January 2015, and a SECONDARY stream that records the +number of nanoseconds. + +Because the number of nanoseconds often has a large number of trailing +zeros, the number has trailing decimal zero digits removed and the +last three bits are used to record how many zeros were removed. Thus +1000 nanoseconds would be serialized as 0x0b and 100000 would be +serialized as 0x0d. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v1 + | SECONDARY | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | DATA | No | Signed Integer RLE v2 + | SECONDARY | No | Unsigned Integer RLE v2 + +## Struct Columns + +Structs have no data themselves and delegate everything to their child +columns except for their PRESENT stream. They have a child column +for each of the fields. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + +## List Columns + +Lists are encoded as the PRESENT stream and a length stream with +number of items in each list. They have a single child column for the +element values. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | LENGTH | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | LENGTH | No | Unsigned Integer RLE v2 + +## Map Columns + +Maps are encoded as the PRESENT stream and a length stream with number +of items in each list. They have a child column for the key and +another child column for the value. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | LENGTH | No | Unsigned Integer RLE v1 +DIRECT_V2 | PRESENT | Yes | Boolean RLE + | LENGTH | No | Unsigned Integer RLE v2 + +## Union Columns + +Unions are encoded as the PRESENT stream and a tag stream that controls which +potential variant is used. They have a child column for each variant of the +union. Currently ORC union types are limited to 256 variants, which matches +the Hive type model. + +Encoding | Stream Kind | Optional | Contents +:------------ | :-------------- | :------- | :------- +DIRECT | PRESENT | Yes | Boolean RLE + | DIRECT | No | Byte RLE http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/file-tail.md ---------------------------------------------------------------------- diff --git a/site/_docs/file-tail.md b/site/_docs/file-tail.md index 75a4ea0..f636a3a 100644 --- a/site/_docs/file-tail.md +++ b/site/_docs/file-tail.md @@ -4,4 +4,315 @@ title: File Tail permalink: /docs/file-tail.html --- -File tail body \ No newline at end of file +Since HDFS does not support changing the data in a file after it is +written, ORC stores the top level index at the end of the file. The +overall structure of the file is given in the figure above. The +file's tail consists of 3 parts; the file metadata, file footer and +postscript. + +The metadata for ORC is stored using +[Protocol Buffers](http://s.apache.org/protobuf_encoding), which provides +the ability to add new fields without breaking readers. This document +incorporates the Protobuf definition from the +[ORC source code](http://s.apache.org/orc_proto) and the +reader is encouraged to review the Protobuf encoding if they need to +understand the byte-level encoding + +# Postscript + +The Postscript section provides the necessary information to interpret +the rest of the file including the length of the file's Footer and +Metadata sections, the version of the file, and the kind of general +compression used (eg. none, zlib, or snappy). The Postscript is never +compressed and ends one byte before the end of the file. The version +stored in the Postscript is the lowest version of Hive that is +guaranteed to be able to read the file and it stored as a sequence of +the major and minor version. There are currently two versions that are +used: [0,11] for Hive 0.11, and [0,12] for Hive 0.12 or later. + +The process of reading an ORC file works backwards through the +file. Rather than making multiple short reads, the ORC reader reads +the last 16k bytes of the file with the hope that it will contain both +the Footer and Postscript sections. The final byte of the file +contains the serialized length of the Postscript, which must be less +than 256 bytes. Once the Postscript is parsed, the compressed +serialized length of the Footer is known and it can be decompressed +and parsed. + +```message PostScript { + // the length of the footer section in bytes + optional uint64 footerLength = 1; + // the kind of generic compression used + optional CompressionKind compression = 2; + // the maximum size of each compression chunk + optional uint64 compressionBlockSize = 3; + // the version of the writer + repeated uint32 version = 4 [packed = true]; + // the length of the metadata section in bytes + optional uint64 metadataLength = 5; + // the fixed string "ORC" + optional string magic = 8000; +} +``` + +```enum CompressionKind { + NONE = 0; + ZLIB = 1; + SNAPPY = 2; + LZO = 3; +} +``` + +# Footer + +The Footer section contains the layout of the body of the file, the +type schema information, the number of rows, and the statistics about +each of the columns. + +The file is broken in to three parts- Header, Body, and Tail. The +Header consists of the bytes "ORC'' to support tools that want to +scan the front of the file to determine the type of the file. The Body +contains the rows and indexes, and the Tail gives the file level +information as described in this section. + +```message Footer { + // the length of the file header in bytes (always 3) + optional uint64 headerLength = 1; + // the length of the file header and body in bytes + optional uint64 contentLength = 2; + // the information about the stripes + repeated StripeInformation stripes = 3; + // the schema information + repeated Type types = 4; + // the user metadata that was added + repeated UserMetadataItem metadata = 5; + // the total number of rows in the file + optional uint64 numberOfRows = 6; + // the statistics of each column across the file + repeated ColumnStatistics statistics = 7; + // the maximum number of rows in each index entry + optional uint32 rowIndexStride = 8; +} +``` + +## Stripe Information + +The body of the file is divided into stripes. Each stripe is self +contained and may be read using only its own bytes combined with the +file's Footer and Postscript. Each stripe contains only entire rows so +that rows never straddle stripe boundaries. Stripes have three +sections: a set of indexes for the rows within the stripe, the data +itself, and a stripe footer. Both the indexes and the data sections +are divided by columns so that only the data for the required columns +needs to be read. + +```message StripeInformation { + // the start of the stripe within the file + optional uint64 offset = 1; + // the length of the indexes in bytes + optional uint64 indexLength = 2; + // the length of the data in bytes + optional uint64 dataLength = 3; + // the length of the footer in bytes + optional uint64 footerLength = 4; + // the number of rows in the stripe + optional uint64 numberOfRows = 5; +} +``` + +## Type Information + +All of the rows in an ORC file must have the same schema. Logically +the schema is expressed as a tree as in the figure below, where +the compound types have subcolumns under them. + + + +The equivalent Hive DDL would be: + +```create table Foobar ( + myInt int, + myMap map<string, + struct<myString : string, + myDouble: double>>, + myTime timestamp +); +``` + +The type tree is flattened in to a list via a pre-order traversal +where each type is assigned the next id. Clearly the root of the type +tree is always type id 0. Compound types have a field named subtypes +that contains the list of their children's type ids. + +```message Type { + enum Kind { + BOOLEAN = 0; + BYTE = 1; + SHORT = 2; + INT = 3; + LONG = 4; + FLOAT = 5; + DOUBLE = 6; + STRING = 7; + BINARY = 8; + TIMESTAMP = 9; + LIST = 10; + MAP = 11; + STRUCT = 12; + UNION = 13; + DECIMAL = 14; + DATE = 15; + VARCHAR = 16; + CHAR = 17; + } + // the kind of this type + required Kind kind = 1; + // the type ids of any subcolumns for list, map, struct, or union + repeated uint32 subtypes = 2 [packed=true]; + // the list of field names for struct + repeated string fieldNames = 3; + // the maximum length of the type for varchar or char + optional uint32 maximumLength = 4; + // the precision and scale for decimal + optional uint32 precision = 5; + optional uint32 scale = 6; +} +``` + +## Column Statistics + +The goal of the column statistics is that for each column, the writer +records the count and depending on the type other useful fields. For +most of the primitive types, it records the minimum and maximum +values; and for numeric types it additionally stores the sum. +From Hive 1.1.0 onwards, the column statistics will also record if +there are any null values within the row group by setting the hasNull flag. +The hasNull flag is used by ORC's predicate pushdown to better answer +'IS NULL' queries. + +```message ColumnStatistics { + // the number of values + optional uint64 numberOfValues = 1; + // At most one of these has a value for any column + optional IntegerStatistics intStatistics = 2; + optional DoubleStatistics doubleStatistics = 3; + optional StringStatistics stringStatistics = 4; + optional BucketStatistics bucketStatistics = 5; + optional DecimalStatistics decimalStatistics = 6; + optional DateStatistics dateStatistics = 7; + optional BinaryStatistics binaryStatistics = 8; + optional TimestampStatistics timestampStatistics = 9; + optional bool hasNull = 10; +} +``` + +For integer types (tinyint, smallint, int, bigint), the column +statistics includes the minimum, maximum, and sum. If the sum +overflows long at any point during the calculation, no sum is +recorded. + +```message IntegerStatistics { + optional sint64 minimum = 1; + optional sint64 maximum = 2; + optional sint64 sum = 3; +} +``` + +For floating point types (float, double), the column statistics +include the minimum, maximum, and sum. If the sum overflows a double, +no sum is recorded. + +```message DoubleStatistics { + optional double minimum = 1; + optional double maximum = 2; + optional double sum = 3; +} +``` + +For strings, the minimum value, maximum value, and the sum of the +lengths of the values are recorded. + +```message StringStatistics { + optional string minimum = 1; + optional string maximum = 2; + // sum will store the total length of all strings + optional sint64 sum = 3; +} +``` + +For booleans, the statistics include the count of false and true values. + +```message BucketStatistics { + repeated uint64 count = 1 [packed=true]; +} +``` + +For decimals, the minimum, maximum, and sum are stored. + +```message DecimalStatistics { + optional string minimum = 1; + optional string maximum = 2; + optional string sum = 3; +} +``` + +Date columns record the minimum and maximum values as the number of +days since the epoch (1/1/2015). + +```message DateStatistics { + // min,max values saved as days since epoch + optional sint32 minimum = 1; + optional sint32 maximum = 2; +} +``` + +Timestamp columns record the minimum and maximum values as the number of +milliseconds since the epoch (1/1/2015). + +```message TimestampStatistics { + // min,max values saved as milliseconds since epoch + optional sint64 minimum = 1; + optional sint64 maximum = 2; +} +``` + +Binary columns store the aggregate number of bytes across all of the values. + +```message BinaryStatistics { + // sum will store the total binary blob length + optional sint64 sum = 1; +} +``` + +## User Metadata + +The user can add arbitrary key/value pairs to an ORC file as it is +written. The contents of the keys and values are completely +application defined, but the key is a string and the value is +binary. Care should be taken by applications to make sure that their +keys are unique and in general should be prefixed with an organization +code. + +```message UserMetadataItem { + // the user defined key + required string name = 1; + // the user defined binary value + required bytes value = 2; +} +``` + +## File Metadata + +The file Metadata section contains column statistics at the stripe +level granularity. These statistics enable input split elimination +based on the predicate push-down evaluated per a stripe. + +```message StripeStatistics { + repeated ColumnStatistics colStats = 1; +} +``` + +```message Metadata { + repeated StripeStatistics stripeStats = 1; +} +``` \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/hive-config.md ---------------------------------------------------------------------- diff --git a/site/_docs/hive-config.md b/site/_docs/hive-config.md index 7dc4cf0..8ff2e96 100644 --- a/site/_docs/hive-config.md +++ b/site/_docs/hive-config.md @@ -12,11 +12,11 @@ with the same options. Key | Default | Notes :----------------------- | :---------- | :------------------------ -orc.compress | ZLIB | compression {NONE, ZLIB, SNAPPY} +orc.compress | ZLIB | high level compression = {NONE, ZLIB, SNAPPY} orc.compress.size | 262,144 | compression chunk size -orc.stripe.size | 268,435,456 | memory buffer size for writing +orc.stripe.size | 268,435,456 | memory buffer size in bytes for writing orc.row.index.stride | 10,000 | number of rows between index entries -orc.create.index | true | should indexes be created +orc.create.index | true | create indexes? orc.bloom.filter.columns | "" | comma separated list of column names orc.bloom.filter.fpp | 0.05 | bloom filter false positive rate @@ -32,26 +32,151 @@ For example, to create an ORC table without high level compression: There are many Hive configuration properties related to ORC files: -Key | Default | Notes -:--- | :------ | :----------- -hive.default.fileformat | TextFile | This is the default file format for new tables. If it is set to ORC, new tables will default to ORC. -hive.stats.gather.num.threads | 10 | Number of threads used by partialscan/noscan analyze command for partitioned tables. This is applicable only for file formats that implement the StatsProvidingRecordReader interface (like ORC). -hive.exec.orc.memory.pool | 0.5 | Maximum fraction of heap that can be used by ORC file writers. -hive.exec.orc.write.format | NULL | Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the latest version. -hive.exec.orc.default.stripe.size | 67,108,864 | Define the default size of ORC writer buffers in bytes. -hive.exec.orc.default.block.size | 268,435,456 | Define the default file system block size for ORC files. -hive.exec.orc.dictionary.key.size.threshold | 0.8 | If the number of keys in a dictionary is greater than this fraction of the total number of non-null rows, turn off dictionary encoding. Use 1.0 to always use dictionary encoding. -hive.exec.orc.default.row.index.stride | 10,000 | Define the default number of rows between row index entries. -hive.exec.orc.default.buffer.size | 262,144 | Define the default ORC buffer size, in bytes. -hive.exec.orc.default.block.padding | true | Should ORC file writers pad stripes to minimize stripes that cross HDFS block boundaries. -hive.exec.orc.block.padding.tolerance | 0.05 | Define the tolerance for block padding as a decimal fraction of stripe size (for example, the default value 0.05 is 5% of the stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS blocks, a maximum of 3.2Mb will be reserved for padding within the 256Mb block with the default hive.exec.orc.block.padding.tolerance. In that case, if the available size within the block is more than 3.2Mb, a new smaller stripe will be inserted to fit within that space. This will make sure that no stripe written will cross block boundaries and cause remote reads within a node local task. -hive.exec.orc.default.compress | ZLIB | Define the default compression codec for ORC file. -hive.exec.orc.encoding.strategy | SPEED | Define the encoding strategy to use while writing data. Changing this will only affect the light weight encoding for integers. This flag will not change the compression level of higher level compression codec (like ZLIB). Possible options are SPEED and COMPRESSION. -hive.orc.splits.include.file.footer | false | If turned on, splits generated by ORC will include metadata about the stripes in the file. This data is read remotely (from the client or HiveServer2 machine) and sent to all the tasks. -hive.orc.cache.stripe.details.size | 10,000 | Cache size for keeping meta information about ORC splits cached in the client. -hive.orc.compute.splits.num.threads | 10 | How many threads ORC should use to create splits in parallel. -hive.exec.orc.skip.corrupt.data | false | If ORC reader encounters corrupt data, this value will be used to determine whether to skip the corrupt data or throw an exception. The default behavior is to throw an exception. -hive.exec.orc.zerocopy | false | Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.) -hive.merge.orcfile.stripe.level | true | When hive.merge.mapfiles, hive.merge.mapredfiles or hive.merge.tezfiles is enabled while writing a table with ORC file format, enabling this configuration property will do stripe-level fast merge for small ORC files. Note that enabling this configuration property will not honor the padding tolerance configuration (hive.exec.orc.block.padding.tolerance). -hive.orc.row.index.stride.dictionary.check | true | If enabled dictionary check will happen after first row index stride (default 10000 rows) else dictionary check will happen before writing first stripe. In both cases, the decision to use dictionary or not will be retained thereafter. -hive.exec.orc.compression.strategy | SPEED | Define the compression strategy to use while writing data. This changes the compression level of higher level compression codec. Value can be SPEED or COMPRESSION. +<table class="configtable"> +<tr> + <th>Key</th> + <th>Default</th> + <th>Notes</th> +</tr> +<tr> + <td>hive.default.fileformat</td> + <td>TextFile</td> + <td>This is the default file format for new tables. If it is set to ORC, + new tables will default to ORC.</td> +</tr> +<tr> + <td>hive.stats.gather.num.threads</td> + <td>10</td> + <td>Number of threads used by partialscan/noscan analyze command for + partitioned tables. This is applicable only for file formats that + implement the StatsProvidingRecordReader interface (like ORC).</td> +</tr> +<tr> + <td>hive.exec.orc.memory.pool</td> + <td>0.5</td> + <td>Maximum fraction of heap that can be used by ORC file writers.</td> +</tr> +<tr> + <td>hive.exec.orc.write.format</td> + <td>NULL</td> + <td>Define the version of the file to write. Possible values are 0.11 and + 0.12. If this parameter is not defined, ORC will use the latest + version.</td> +</tr> +<tr> + <td>hive.exec.orc.default.stripe.size</td> + <td>67,108,864</td> + <td>Define the default size of ORC writer buffers in bytes.</td> +</tr> +<tr> + <td>hive.exec.orc.default.block.size</td> + <td>268,435,456</td> + <td>Define the default file system block size for ORC files.</td> +</tr> +<tr> + <td>hive.exec.orc.dictionary.key.size.threshold</td> + <td>0.8</td> + <td>If the number of keys in a dictionary is greater than this + fraction of the total number of non-null rows, turn off + dictionary encoding. Use 1.0 to always use dictionary encoding.</td> +</tr> +<tr> + <td>hive.exec.orc.default.row.index.stride</td> + <td>10,000</td> + <td>Define the default number of rows between row index entries.</td> +</tr> +<tr> + <td>hive.exec.orc.default.buffer.size</td> + <td>262,144</td> + <td>Define the default ORC buffer size, in bytes.</td> +</tr> +<tr> + <td>hive.exec.orc.default.block.padding</td> + <td>true</td> + <td>Should ORC file writers pad stripes to minimize stripes that cross HDFS + block boundaries.</td> +</tr> +<tr> + <td>hive.exec.orc.block.padding.tolerance</td> + <td>0.05</td> + <td>Define the tolerance for block padding as a decimal fraction of + stripe size (for example, the default value 0.05 is 5% of the + stripe size). For the defaults of 64Mb ORC stripe and 256Mb HDFS + blocks, a maximum of 3.2Mb will be reserved for padding within + the 256Mb block with the default + hive.exec.orc.block.padding.tolerance. In that case, if the + available size within the block is more than 3.2Mb, a new + smaller stripe will be inserted to fit within that space. This + will make sure that no stripe written will cross block + boundaries and cause remote reads within a node local task.</t> +</tr> +<tr> + <td>hive.exec.orc.default.compress</td> + <td>ZLIB</td> + <td>Define the default compression codec for ORC file.</td> +</tr> +<tr> + <td>hive.exec.orc.encoding.strategy</td> + <td>SPEED</td> + <td>Define the encoding strategy to use while writing data. Changing + this will only affect the light weight encoding for + integers. This flag will not change the compression level of + higher level compression codec (like ZLIB). Possible options are + SPEED and COMPRESSION.</td> +</tr> +<tr> + <td>hive.orc.splits.include.file.footer</td> + <td>false</td> + <td>If turned on, splits generated by ORC will include metadata + about the stripes in the file. This data is read remotely (from + the client or HiveServer2 machine) and sent to all the tasks.</td> +</tr> +<tr> + <td>hive.orc.cache.stripe.details.size</td> + <td>10,000</td> + <td>Cache size for keeping meta information about ORC splits cached in the + client.</td> +</tr> +<tr> + <td>hive.orc.compute.splits.num.threads</td> + <td>10</td> + <td>How many threads ORC should use to create splits in parallel.</td> +</tr> +<tr> + <td>hive.exec.orc.skip.corrupt.data</td> + <td>false</td> + <td>If ORC reader encounters corrupt data, this value will be used + to determine whether to skip the corrupt data or throw an + exception. The default behavior is to throw an exception.</td> +</tr> +<tr> + <td>hive.exec.orc.zerocopy</td> + <td>false</td> + <td>Use zerocopy reads with ORC. (This requires Hadoop 2.3 or later.)</td> +</tr> +<tr> + <td>hive.merge.orcfile.stripe.level</td> + <td>true</td> + <td>When hive.merge.mapfiles, hive.merge.mapredfiles or + hive.merge.tezfiles is enabled while writing a table with ORC + file format, enabling this configuration property will do + stripe-level fast merge for small ORC files. Note that enabling + this configuration property will not honor the padding tolerance + configuration (hive.exec.orc.block.padding.tolerance).</td> +</tr> +<tr> + <td>hive.orc.row.index.stride.dictionary.check</td> + <td>true</td> + <td>If enabled dictionary check will happen after first row index stride + (default 10000 rows) else dictionary check will happen before writing + first stripe. In both cases, the decision to use dictionary or not will + be retained thereafter.</td> +</tr> +<tr> + <td>hive.exec.orc.compression.strategy</td> + <td>SPEED</td> + <td>Define the compression strategy to use while writing data. This changes + the compression level of higher level compression codec. Value can be + SPEED or COMPRESSION.</td> +</tr> +</table> http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/hive-ddl.md ---------------------------------------------------------------------- diff --git a/site/_docs/hive-ddl.md b/site/_docs/hive-ddl.md index 3f0c910..e3c6288 100644 --- a/site/_docs/hive-ddl.md +++ b/site/_docs/hive-ddl.md @@ -26,7 +26,7 @@ files will be merged at the stripe level without reserializatoin. ```ALTER TABLE istari [PARTITION partition_spec] CONCATENATE; ``` -To get information about an ORC file, there is the orcfiledump command. +To get information about an ORC file, use the orcfiledump command. ```% hive --orcfiledump <path_to_file> ``` http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/index.md ---------------------------------------------------------------------- diff --git a/site/_docs/index.md b/site/_docs/index.md index 6bd6d8a..78e5cc3 100644 --- a/site/_docs/index.md +++ b/site/_docs/index.md @@ -1,7 +1,37 @@ --- layout: docs -title: Getting Started +title: Background permalink: /docs/index.html --- -Getting started body \ No newline at end of file +Back in January 2013, we created ORC files as part of the initiative +to massively speed up Apache Hive and improve the storage efficiency +of data stored in Apache Hadoop. The focus was on enabling high speed +processing and reducing file sizes. + +ORC is a self-describing type-aware columnar file format designed for +Hadoop workloads. It is optimized for large streaming reads, but with +integrated support for finding required rows quickly. Storing data in +a columnar format lets the reader read, decompress, and process only +the values that are required for the current query. Because ORC files +are type-aware, the writer chooses the most appropriate encoding for +the type and builds an internal index as the file is written. + +Predicate pushdown uses those indexes to determine which stripes in a +file need to be read for a particular query and the row indexes can +narrow the search to a particular set of 10,000 rows. ORC supports the +complete set of types in Hive, including the complex types: structs, +lists, maps, and unions. + +Many large Hadoop users have adopted ORC. For instance, Facebook uses +ORC to [save tens of petabytes](http://s.apache.org/fb-scaling-300-pb) +in their data warehouse and demonstrated that ORC is [significantly +faster](http://s.apache.org/presto-orc) than RC File or Parquet. Yahoo +uses ORC to store their production data and has released some of their +[benchmark results](http://s.apache.org/yahoo-orc). + +ORC files are divided in to *stripes* that are roughly 64MB by +default. The stripes in a file are independent of each other and form +the natural unit of distributed work. Within each stripe, the columns +are separated from each other so the reader can read just the columns +that are required. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/indexes.md ---------------------------------------------------------------------- diff --git a/site/_docs/indexes.md b/site/_docs/indexes.md index 202c19a..f6c1a8c 100644 --- a/site/_docs/indexes.md +++ b/site/_docs/indexes.md @@ -4,4 +4,29 @@ title: Indexes permalink: /docs/indexes.html --- -Indexes body \ No newline at end of file +ORC provides three level of indexes within each file: + +* file level - statistics about the values in each column across the entire + file +* stripe level - statistics about the values in each column for each stripe +* row level - statistics about the values in each column for each set of + 10,000 rows within a stripe + +The file and stripe level column statistics are in the file footer so +that they are easy to access to determine if the rest of the file +needs to be read at all. Row level indexes include both the column +statistics for each row group and the position for seeking to the +start of the row group. + +Column statistics always contain the count of values and whether there +are null values present. Most other primitive types include the +minimum and maximum values and for numeric types the sum. As of Hive +1.2, the indexes can include bloom filters, which provide a much more +selective filter. + +The indexes at all levels are used by the reader using Search +ARGuments or SARGs, which are simplified expressions that restrict the +rows that are of interest. For example, if a query was looking for +people older than 100 years old, the SARG would be "age > 100" and +only files, stripes, or row groups that had people over 100 years old +would be read. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/run-length.md ---------------------------------------------------------------------- diff --git a/site/_docs/run-length.md b/site/_docs/run-length.md index 28521f3..0566feb 100644 --- a/site/_docs/run-length.md +++ b/site/_docs/run-length.md @@ -4,4 +4,232 @@ title: Run Length Encoding permalink: /docs/run-length.html --- -Run Length body \ No newline at end of file +# Base 128 Varint + +Variable width integer encodings take advantage of the fact that most +numbers are small and that having smaller encodings for small numbers +shrinks the overall size of the data. ORC uses the varint format from +Protocol Buffers, which writes data in little endian format using the +low 7 bits of each byte. The high bit in each byte is set if the +number continues into the next byte. + +Unsigned Original | Serialized +:---------------- | :--------- +0 | 0x00 +1 | 0x01 +127 | 0x7f +128 | 0x80, 0x01 +129 | 0x81, 0x01 +16,383 | 0xff, 0x7f +16,384 | 0x80, 0x80, 0x01 +16,385 | 0x81, 0x80, 0x01 + +For signed integer types, the number is converted into an unsigned +number using a zigzag encoding. Zigzag encoding moves the sign bit to +the least significant bit using the expression (val << 1) ^ (val >> +63) and derives its name from the fact that positive and negative +numbers alternate once encoded. The unsigned number is then serialized +as above. + +Signed Original | Unsigned +:-------------- | :------- +0 | 0 +-1 | 1 +1 | 2 +-2 | 3 +2 | 4 + +# Byte Run Length Encoding + +For byte streams, ORC uses a very light weight encoding of identical +values. + +* Run - a sequence of at least 3 identical values +* Literals - a sequence of non-identical values + +The first byte of each group of values is a header than determines +whether it is a run (value between 0 to 127) or literal list (value +between -128 to -1). For runs, the control byte is the length of the +run minus the length of the minimal run (3) and the control byte for +literal lists is the negative length of the list. For example, a +hundred 0's is encoded as [0x61, 0x00] and the sequence 0x44, 0x45 +would be encoded as [0xfe, 0x44, 0x45]. The next group can choose +either of the encodings. + +# Boolean Run Length Encoding + +For encoding boolean types, the bits are put in the bytes from most +significant to least significant. The bytes are encoded using byte run +length encoding as described in the previous section. For example, +the byte sequence [0xff, 0x80] would be one true followed by +seven false values. + +# Integer Run Length Encoding, version 1 + +In Hive 0.11 ORC files used Run Length Encoding version 1 (RLEv1), +which provides a lightweight compression of signed or unsigned integer +sequences. RLEv1 has two sub-encodings: + +* Run - a sequence of values that differ by a small fixed delta +* Literals - a sequence of varint encoded values + +Runs start with an initial byte of 0x00 to 0x7f, which encodes the +length of the run - 3. A second byte provides the fixed delta in the +range of -128 to 127. Finally, the first value of the run is encoded +as a base 128 varint. + +For example, if the sequence is 100 instances of 7 the encoding would +start with 100 - 3, followed by a delta of 0, and a varint of 7 for +an encoding of [0x61, 0x00, 0x07]. To encode the sequence of numbers +running from 100 to 1, the first byte is 100 - 3, the delta is -1, +and the varint is 100 for an encoding of [0x61, 0xff, 0x64]. + +Literals start with an initial byte of 0x80 to 0xff, which corresponds +to the negative of number of literals in the sequence. Following the +header byte, the list of N varints is encoded. Thus, if there are +no runs, the overhead is 1 byte for each 128 integers. The first 5 +prime numbers [2, 3, 4, 7, 11] would encoded as [0xfb, 0x02, 0x03, +0x04, 0x07, 0xb]. + +# Integer Run Length Encoding, version 2 + +In Hive 0.12, ORC introduced Run Length Encoding version 2 (RLEv2), +which has improved compression and fixed bit width encodings for +faster expansion. RLEv2 uses four sub-encodings based on the data: + +* Short Repeat - used for short sequences with repeated values +* Direct - used for random sequences with a fixed bit width +* Patched Base - used for random sequences with a variable bit width +* Delta - used for monotonically increasing or decreasing sequences + +## Short Repeat + +The short repeat encoding is used for short repeating integer +sequences with the goal of minimizing the overhead of the header. All +of the bits listed in the header are from the first byte to the last +and from most significant bit to least significant bit. If the type is +signed, the value is zigzag encoded. + +* 1 byte header + * 2 bits for encoding type (0) + * 3 bits for width (W) of repeating value (1 to 8 bytes) + * 3 bits for repeat count (3 to 10 values) +* W bytes in big endian format, which is zigzag encoded if they type + is signed + +The unsigned sequence of [10000, 10000, 10000, 10000, 10000] would be +serialized with short repeat encoding (0), a width of 2 bytes (1), and +repeat count of 5 (2) as [0x0a, 0x27, 0x10]. + +## Direct + +The direct encoding is used for integer sequences whose values have a +relatively constant bit width. It encodes the values directly using a +fixed width big endian encoding. The width of the values is encoded +using the table below. + +The 5 bit width encoding table for RLEv2: + +Width in Bits | Encoded Value | Notes +:------------ | :------------ | :---- +0 | 0 | for delta encoding +1 | 0 | for non-delta encoding +2 | 1 +4 | 3 +8 | 7 +16 | 15 +24 | 23 +32 | 27 +40 | 28 +48 | 29 +56 | 30 +64 | 31 +3 | 2 | deprecated +5 <= x <= 7 | x - 1 | deprecated +9 <= x <= 15 | x - 1 | deprecated +17 <= x <= 21 | x - 1 | deprecated +26 | 24 | deprecated +28 | 25 | deprecated +30 | 26 | deprecated + +* 2 bytes header + * 2 bits for encoding type (1) + * 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit + width encoding table + * 9 bits for length (L) (1 to 512 values) +* W * L bits (padded to the next byte) encoded in big endian format, which is + zigzag encoding if the type is signed + +The unsigned sequence of [23713, 43806, 57005, 48879] would be +serialized with direct encoding (1), a width of 16 bits (15), and +length of 4 (3) as [0x5e, 0x03, 0x5c, 0xa1, 0xab, 0x1e, 0xde, 0xad, +0xbe, 0xef]. + +## Patched Base + +The patched base encoding is used for integer sequences whose bit +widths varies a lot. The minimum signed value of the sequence is found +and subtracted from the other values. The bit width of those adjusted +values is analyzed and the 90 percentile of the bit width is chosen +as W. The 10\% of values larger than W use patches from a patch list +to set the additional bits. Patches are encoded as a list of gaps in +the index values and the additional value bits. + +* 4 bytes header + * 2 bits for encoding type (2) + * 5 bits for encoded width (W) of values (1 to 64 bits) using the 5 bit + width encoding table + * 9 bits for length (L) (1 to 512 values) + * 3 bits for base value width (BW) (1 to 8 bytes) + * 5 bits for patch width (PW) (1 to 64 bits) using the 5 bit width + encoding table + * 3 bits for patch gap width (PGW) (1 to 8 bits) + * 5 bits for patch list length (PLL) (0 to 31 patches) +* Base value (BW bytes) - The base value is stored as a big endian value + with negative values marked by the most significant bit set. If it that + bit is set, the entire value is negated. +* Data values (W * L bits padded to the byte) - A sequence of W bit positive + values that are added to the base value. +* Data values (W * L bits padded to the byte) - A sequence of W bit positive + values that are added to the base value. +* Patch list (PLL * (PGW + PW) bytes) - A list of patches for values + that didn't fit within W bits. Each entry in the list consists of a + gap, which is the number of elements skipped from the previous + patch, and a patch value. Patches are applied by logically or'ing + the data values with the relevant patch shifted W bits left. If a + patch is 0, it was introduced to skip over more than 255 items. The + combined length of each patch (PGW + PW) must be less or equal to + 64. + +The unsigned sequence of [2030, 2000, 2020, 1000000, 2040, 2050, 2060, +2070, 2080, 2090] has a minimum of 2000, which makes the adjusted +sequence [30, 0, 20, 998000, 40, 50, 60, 70, 80, 90]. It has an +encoding of patched base (2), a bit width of 8 (7), a length of 10 +(9), a base value width of 2 bytes (1), a patch width of 12 bits (11), +patch gap width of 2 bits (1), and a patch list length of 1 (1). The +base value is 2000 and the combined result is [0x8e, 0x09, 0x2b, 0x21, +0x07, 0xd0, 0x1e, 0x00, 0x14, 0x70, 0x28, 0x32, 0x3c, 0x46, 0x50, +0x5a, 0xfc, 0xe8] + +## Delta + +The Delta encoding is used for monotonically increasing or decreasing +sequences. The first two numbers in the sequence can not be identical, +because the encoding is using the sign of the first delta to determine +if the series is increasing or decreasing. + +* 2 bytes header + * 2 bits for encoding type (3) + * 5 bits for encoded width (W) of deltas (0 to 64 bits) using the 5 bit + width encoding table + * 9 bits for run length (L) (1 to 512 values) +* Base value - encoded as (signed or unsigned) varint +* Delta base - encoded as signed varint +* Delta values $W * (L - 2)$ bytes - encode each delta after the first + one. If the delta base is positive, the sequence is increasing and if it is + negative the sequence is decreasing. + +The unsigned sequence of [2, 3, 5, 7, 11, 13, 17, 19, 23, 29] would be +serialized with delta encoding (3), a width of 4 bits (3), length of +10 (9), a base of 2 (2), and first delta of 1 (2). The resulting +sequence is [0xc6, 0x09, 0x02, 0x02, 0x22, 0x42, 0x42, 0x46]. http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/self-contained.md ---------------------------------------------------------------------- diff --git a/site/_docs/self-contained.md b/site/_docs/self-contained.md deleted file mode 100644 index 26d2e83..0000000 --- a/site/_docs/self-contained.md +++ /dev/null @@ -1,7 +0,0 @@ ---- -layout: docs -title: Metadata -permalink: /docs/self-contained.html ---- - -Self-Contained body \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/spec-index.md ---------------------------------------------------------------------- diff --git a/site/_docs/spec-index.md b/site/_docs/spec-index.md new file mode 100644 index 0000000..009df59 --- /dev/null +++ b/site/_docs/spec-index.md @@ -0,0 +1,111 @@ +--- +layout: docs +title: Indexes +permalink: /docs/spec-index.html +--- + +# Row Group Index + +The row group indexes consist of a ROW_INDEX stream for each primitive +column that has an entry for each row group. Row groups are controlled +by the writer and default to 10,000 rows. Each RowIndexEntry gives the +position of each stream for the column and the statistics for that row +group. + +The index streams are placed at the front of the stripe, because in +the default case of streaming they do not need to be read. They are +only loaded when either predicate push down is being used or the +reader seeks to a particular row. + +```message RowIndexEntry { + repeated uint64 positions = 1 [packed=true]; + optional ColumnStatistics statistics = 2; +} +``` + +```message RowIndex { + repeated RowIndexEntry entry = 1; +} +``` + +To record positions, each stream needs a sequence of numbers. For +uncompressed streams, the position is the byte offset of the RLE run's +start location followed by the number of values that need to be +consumed from the run. In compressed streams, the first number is the +start of the compression chunk in the stream, followed by the number +of decompressed bytes that need to be consumed, and finally the number +of values consumed in the RLE. + +For columns with multiple streams, the sequences of positions in each +stream are concatenated. That was an unfortunate decision on my part +that we should fix at some point, because it makes code that uses the +indexes error-prone. + +Because dictionaries are accessed randomly, there is not a position to +record for the dictionary and the entire dictionary must be read even +if only part of a stripe is being read. + +# Bloom Filter Index + +Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards. +Predicate pushdown can make use of bloom filters to better prune +the row groups that do not satisfy the filter condition. +The bloom filter indexes consist of a BLOOM_FILTER stream for each +column specified through 'orc.bloom.filter.columns' table properties. +A BLOOM_FILTER stream records a bloom filter entry for each row +group (default to 10,000 rows) in a column. Only the row groups that +satisfy min/max row index evaluation will be evaluated against the +bloom filter index. + +Each BloomFilterEntry stores the number of hash functions ('k') used and +the bitset backing the bloom filter. The bitset is serialized as repeated +longs from which the number of bits ('m') for the bloom filter can be derived. +m = bitset.length * 64. + +```message BloomFilter { + optional uint32 numHashFunctions = 1; + repeated fixed64 bitset = 2; +} +``` + +```message BloomFilterIndex { + repeated BloomFilter bloomFilter = 1; +} +``` + +Bloom filter internally uses two different hash functions to map a key +to a position in the bit set. For tinyint, smallint, int, bigint, float +and double types, Thomas Wang's 64-bit integer hash function is used. +Floats are converted to IEEE-754 32 bit representation +(using Java's Float.floatToIntBits(float)). Similary, Doubles are +converted to IEEE-754 64 bit representation (using Java's +Double.doubleToLongBits(double)). All these primitive types +are cast to long base type before being passed on to the hash function. +For strings and binary types, Murmur3 64 bit hash algorithm is used. +The 64 bit variant of Murmur3 considers only the most significant +8 bytes of Murmur3 128-bit algorithm. The 64 bit hashcode generated +from the above algorithms is used as a base to derive 'k' different +hash functions. We use the idea mentioned in the paper "Less Hashing, +Same Performance: Building a Better Bloom Filter" by Kirsch et. al. to +quickly compute the k hashcodes. + +The algorithm for computing k hashcodes and setting the bit position +in a bloom filter is as follows: + +1. Get 64 bit base hash code from Murmur3 or Thomas Wang's hash algorithm. +2. Split the above hashcode into two 32-bit hashcodes (say hash1 and hash2). +3. k'th hashcode is obtained by (where k > 0): + * combinedHash = hash1 + (k * hash2) +4. If combinedHash is negative flip all the bits: + * combinedHash = ~combinedHash +5. Bit set position is obtained by performing modulo with m: + * position = combinedHash % m +6. Set the position in bit set. The LSB 6 bits identifies the long index + within bitset and bit position within the long uses little endian order. + * bitset[position >>> 6] \|= (1L << position); + +Bloom filter streams are interlaced with row group indexes. This placement +makes it convenient to read the bloom filter stream and row index stream +together in single read operation. + + http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/spec-intro.md ---------------------------------------------------------------------- diff --git a/site/_docs/spec-intro.md b/site/_docs/spec-intro.md index 223bed5..b92cf37 100644 --- a/site/_docs/spec-intro.md +++ b/site/_docs/spec-intro.md @@ -4,4 +4,23 @@ title: Introduction permalink: /docs/spec-intro.html --- -Spec intro body \ No newline at end of file +Hive's RCFile was the standard format for storing tabular data in +Hadoop for several years. However, RCFile has limitations because it +treats each column as a binary blob without semantics. In Hive 0.11 we +added a new file format named Optimized Row Columnar (ORC) file that +uses and retains the type information from the table definition. ORC +uses type specific readers and writers that provide light weight +compression techniques such as dictionary encoding, bit packing, delta +encoding, and run length encoding -- resulting in dramatically smaller +files. Additionally, ORC can apply generic compression using zlib, or +Snappy on top of the lightweight compression for even smaller +files. However, storage savings are only part of the gain. ORC +supports projection, which selects subsets of the columns for reading, +so that queries reading only one column read only the required +bytes. Furthermore, ORC files include light weight indexes that +include the minimum and maximum values for each column in each set of +10,000 rows and the entire file. Using pushdown filters from Hive, the +file reader can skip entire sets of rows that aren't important for +this query. + + http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/stripes.md ---------------------------------------------------------------------- diff --git a/site/_docs/stripes.md b/site/_docs/stripes.md index 3c52111..d53f709 100644 --- a/site/_docs/stripes.md +++ b/site/_docs/stripes.md @@ -4,4 +4,84 @@ title: Stripes permalink: /docs/stripes.html --- -Stripes body \ No newline at end of file +The body of ORC files consists of a series of stripes. Stripes are +large (typically ~200MB) and independent of each other and are often +processed by different tasks. The defining characteristic for columnar +storage formats is that the data for each column is stored separately +and that reading data out of the file should be proportional to the +number of columns read. + +In ORC files, each column is stored in several streams that are stored +next to each other in the file. For example, an integer column is +represented as two streams PRESENT, which uses one with a bit per +value recording if the value is non-null, and DATA, which records the +non-null values. If all of a column's values in a stripe are non-null, +the PRESENT stream is omitted from the stripe. For binary data, ORC +uses three streams PRESENT, DATA, and LENGTH, which stores the length +of each value. The details of each type will be presented in the +following subsections. + +# Stripe Footer + +The stripe footer contains the encoding of each column and the +directory of the streams including their location. + +```message StripeFooter { + // the location of each stream + repeated Stream streams = 1; + // the encoding of each column + repeated ColumnEncoding columns = 2; +} +``` + +To describe each stream, ORC stores the kind of stream, the column id, +and the stream's size in bytes. The details of what is stored in each stream +depends on the type and encoding of the column. + +```message Stream { + enum Kind { + // boolean stream of whether the next value is non-null + PRESENT = 0; + // the primary data stream + DATA = 1; + // the length of each value for variable length data + LENGTH = 2; + // the dictionary blob + DICTIONARY\_DATA = 3; + // deprecated prior to Hive 0.11 + // It was used to store the number of instances of each value in the + // dictionary + DICTIONARY_COUNT = 4; + // a secondary data stream + SECONDARY = 5; + // the index for seeking to particular row groups + ROW_INDEX = 6; + } + required Kind kind = 1; + // the column id + optional uint32 column = 2; + // the number of bytes in the file + optional uint64 length = 3; +} +``` + +Depending on their type several options for encoding are possible. The +encodings are divided into direct or dictionary-based categories and +further refined as to whether they use RLE v1 or v2. + +```message ColumnEncoding { + enum Kind { + // the encoding is mapped directly to the stream using RLE v1 + DIRECT = 0; + // the encoding uses a dictionary of unique values using RLE v1 + DICTIONARY = 1; + // the encoding is direct using RLE v2 + DIRECT\_V2 = 2; + // the encoding is dictionary-based using RLE v2 + DICTIONARY\_V2 = 3; + } + required Kind kind = 1; + // for dictionary encodings, record the size of the dictionary + optional uint32 dictionarySize = 2; +} +``` http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_docs/types.md ---------------------------------------------------------------------- diff --git a/site/_docs/types.md b/site/_docs/types.md index 5fb565c..c58580e 100644 --- a/site/_docs/types.md +++ b/site/_docs/types.md @@ -4,4 +4,60 @@ title: Types permalink: /docs/types.html --- -types body \ No newline at end of file +ORC files are completely self-describing and do not depend on the Hive +Metastore or any other external metadata. The file includes all of the +type and encoding information for the objects stored in the file. Because the +file is self-contained, it does not depend on the user's environment to +correctly interpret the file's contents. + +ORC provides a rich set of scalar and compound types: + +* Integer + * boolean (1 bit) + * tinyint (8 bit) + * smallint (16 bit) + * int (32 bit) + * bigint (64 bit) +* Floating point + * float + * double +* String types + * string + * char + * varchar +* Binary blobs + * binary +* Date/time + * timestamp + * date +* Compound types + * struct + * list + * map + * union + +All ORC file are logically sequences of identically typed objects. Hive +always uses a struct with a field for each of the top-level columns as +the root object type, but that is not required. All types in ORC can take +null values including the compound types. + +Compound types have children columns that hold the values for their +sub-elements. For example, a struct column has one child column for +each field of the struct. Lists always have a single child column for +the element values and maps always have two child columns. Union +columns have one child column for each of the variants. + +Given the following definition of the table Foobar, the columns in the +file would form the given tree. + +```create table Foobar ( + myInt int, + myMap map<string, + struct<myString : string, + myDouble: double>>, + myTime timestamp +); +``` + + + http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_includes/docs_option.html ---------------------------------------------------------------------- diff --git a/site/_includes/docs_option.html b/site/_includes/docs_option.html index c259397..0913a59 100644 --- a/site/_includes/docs_option.html +++ b/site/_includes/docs_option.html @@ -1,7 +1,7 @@ {% assign items = include.items %} {% for item in items %} - {% assign item_url = item | prepend:"/docs/" | append:"/" %} + {% assign item_url = item | prepend:"/docs/" | append:".html" %} {% for p in site.docs %} {% if p.url == item_url %} http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_posts/2015-04-22-orc-as-tlp.markdown ---------------------------------------------------------------------- diff --git a/site/_posts/2015-04-22-orc-as-tlp.markdown b/site/_posts/2015-04-22-orc-as-tlp.markdown deleted file mode 100644 index 2e1d6b1..0000000 --- a/site/_posts/2015-04-22-orc-as-tlp.markdown +++ /dev/null @@ -1,10 +0,0 @@ ---- -layout: news_item -title: "ORC becomes an Apache Top Level Project" -date: "2015-04-22 20:00:00 -0800" -author: omalley ---- - -Today Apache ORC became a top level project at the Apache Software -Foundation. This step represents a major step forward for the project, -and is representative of its momentum. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_posts/2015-04-22-orc-as-tlp.md ---------------------------------------------------------------------- diff --git a/site/_posts/2015-04-22-orc-as-tlp.md b/site/_posts/2015-04-22-orc-as-tlp.md new file mode 100644 index 0000000..0ed295a --- /dev/null +++ b/site/_posts/2015-04-22-orc-as-tlp.md @@ -0,0 +1,42 @@ +--- +layout: news_item +title: "ORC becomes an Apache Top Level Project" +date: "2015-04-22 20:00:00 -0800" +author: omalley +categories: [project] +--- + +Today Apache ORC became a top level project at the Apache Software +Foundation. This step represents a major step forward for the project, +and is representative of its momentum. + +Back in January 2013, we created ORC files as part of the initiative +to massively speed up Apache Hive and improve the storage efficiency +of data stored in Apache Hadoop. We added it as a feature of Hive for +two reasons: + +1. To ensure that it would be well integrated with Hive +2. To ensure that storing data in ORC format would be as simple as + stating âstored as ORCâ to your table definition. + +In the last two years, many of the features that weâve added to Hive, +such as vectorization, ACID, predicate push down and LLAP, support ORC +first, and follow up with other storage formats later. + +The growing use and acceptance of ORC has encouraged additional Hadoop +execution engines, such as Apache Pig, Map-Reduce, Cascading, and +Apache Spark to support reading and writing ORC. However, there are +concerns that depending on the large Hive jar that contains ORC pulls +in a lot of other projects that Hive depends on. To better support +these non-Hive users, we decided to split off from Hive and become a +separate project. This will not only allow us to support Hive, but +also provide a much more streamlined jar, documentation and help for +users outside of Hive. + +Although Hadoop and its ecosystem are largely written in Java, there +are a lot of applications in other languages that would like to +natively access ORC files in HDFS. Hortonworks, HP, and Microsoft are +developing a pure C++ ORC reader and writer that enables C++ +applications to read and write ORC files efficiently without +Java. That code will also be moved into Apache ORC and released +together with the Java implementation. http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_posts/2015-05-11-new-committers.md ---------------------------------------------------------------------- diff --git a/site/_posts/2015-05-11-new-committers.md b/site/_posts/2015-05-11-new-committers.md new file mode 100644 index 0000000..3d16a7e --- /dev/null +++ b/site/_posts/2015-05-11-new-committers.md @@ -0,0 +1,18 @@ +--- +layout: news_item +title: "ORC adds 7 committers" +date: "2015-05-11 17:23:00 -0800" +author: omalley +categories: [team] +--- + +The ORC project management committee today added seven new committers +for their work on ORC. Welcome all! + +* Gunther Hagleitner +* Aliaksei Sandryhaila +* Sergey Shelukhin +* Gopal Vijayaraghavan +* Stephen Walkauskas +* Kevin Wilfong +* Xuefu Zhang http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/_sass/_style.scss ---------------------------------------------------------------------- diff --git a/site/_sass/_style.scss b/site/_sass/_style.scss index 64c5a8d..bf58227 100644 --- a/site/_sass/_style.scss +++ b/site/_sass/_style.scss @@ -818,6 +818,10 @@ code.output { .note { margin-left: -50px; } } +.configtable { + font-size: 14px; +} + .note { &:before { content: ""; http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/develop/index.md ---------------------------------------------------------------------- diff --git a/site/develop/index.md b/site/develop/index.md index aebe6bc..59fd32b 100644 --- a/site/develop/index.md +++ b/site/develop/index.md @@ -7,13 +7,20 @@ Want to help add a feature or fix a bug? ## Project Members -Name | Apache Id | Role -:-------------------- | :-------- | :--- -Chris Douglas | cdouglas | PMC -Alan Gates | gates | PMC -Prasanth Jayachandran | prasanthj | PMC -Lefty Leverenz | leftyl | PMC -Owen O'Malley | omalley | PMC +Name | Apache Id | Role +:-------------------- | :----------- | :--- +Chris Douglas | cdouglas | PMC +Alan Gates | gates | PMC +Gunther Hagleitner | gunther | Committer +Prasanth Jayachandran | prasanthj | PMC +Lefty Leverenz | leftyl | PMC +Owen O'Malley | omalley | PMC +Aliaksei Sandryhaila | ? | Committer +Sergey Shelukhin | sershe | Committer +Gopal Vijayaraghavan | gopalv | Committer +Stephen Walkauskas | ? | Committer +Kevin Wilfong | kevinwilfong | Committer +Xuefu Zhang | xuefu | Committer ## Mailing Lists http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/doap_orc.rdf ---------------------------------------------------------------------- diff --git a/site/doap_orc.rdf b/site/doap_orc.rdf new file mode 100644 index 0000000..26817ee --- /dev/null +++ b/site/doap_orc.rdf @@ -0,0 +1,55 @@ +<?xml version="1.0"?> +<?xml-stylesheet type="text/xsl"?> +<rdf:RDF xml:lang="en" + xmlns="http://usefulinc.com/ns/doap#" + xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" + xmlns:asfext="http://projects.apache.org/ns/asfext#" + xmlns:foaf="http://xmlns.com/foaf/0.1/"> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + <Project rdf:about="https://orc.apache.org"> + <created>2015-05-12</created> + <license rdf:resource="http://spdx.org/licenses/Apache-2.0" /> + <name>Apache Orc</name> + <homepage rdf:resource="https://orc.apache.org" /> + <asfext:pmc rdf:resource="https://orc.apache.org" /> + <shortdesc>A high-performance columnar file format for Hadoop workloads.</shortdesc> + <description>ORC is a self-describing type-aware columnar file format designed for +Hadoop workloads. It is optimized for large streaming reads, but with +integrated support for finding required rows quickly. Storing data in +a columnar format lets the reader read, decompress, and process only +the values that are required for the current query.</description> + <bug-database rdf:resource="https://issues.apache.org/jira/browse/orc" /> + <mailing-list rdf:resource="https://orc.apache.org/develop" /> + <download-page rdf:resource="https://orc.apache.org/downloads" /> + <programming-language>Java</programming-language> + <programming-language>C++</programming-language> + <category rdf:resource="http://projects.apache.org/category/big-data" /> + <repository> + <GitRepository> + <location rdf:resource="https://git-wip-us.apache.org/repos/asf/orc.git"/> + <browse rdf:resource="https://github.com/apache/orc"/> + </GitRepository> + </repository> + <maintainer> + <foaf:Person> + <foaf:name>Owen O'Malley</foaf:name> + <foaf:mbox rdf:resource="mailto:[email protected]"/> + </foaf:Person> + </maintainer> + </Project> +</rdf:RDF> http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/downloads/index.md ---------------------------------------------------------------------- diff --git a/site/downloads/index.md b/site/downloads/index.md new file mode 100644 index 0000000..297127f --- /dev/null +++ b/site/downloads/index.md @@ -0,0 +1,7 @@ +--- +layout: page +title: Downloads +--- + +We haven't made any releases as a separate project yet. Please download +the Hive 1.1 release and use the hive-exec.jar. \ No newline at end of file http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/img/BloomFilter.png ---------------------------------------------------------------------- diff --git a/site/img/BloomFilter.png b/site/img/BloomFilter.png new file mode 100644 index 0000000..702fdf0 Binary files /dev/null and b/site/img/BloomFilter.png differ http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/img/CompressionStream.png ---------------------------------------------------------------------- diff --git a/site/img/CompressionStream.png b/site/img/CompressionStream.png new file mode 100644 index 0000000..85b88bc Binary files /dev/null and b/site/img/CompressionStream.png differ http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/img/OrcFileLayout.png ---------------------------------------------------------------------- diff --git a/site/img/OrcFileLayout.png b/site/img/OrcFileLayout.png new file mode 100644 index 0000000..ca0d456 Binary files /dev/null and b/site/img/OrcFileLayout.png differ http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/img/TreeWriters.png ---------------------------------------------------------------------- diff --git a/site/img/TreeWriters.png b/site/img/TreeWriters.png new file mode 100644 index 0000000..395e99d Binary files /dev/null and b/site/img/TreeWriters.png differ http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/news/index.html ---------------------------------------------------------------------- diff --git a/site/news/index.html b/site/news/index.html index 8a3c3a0..3317496 100644 --- a/site/news/index.html +++ b/site/news/index.html @@ -14,4 +14,6 @@ author: all <h2>ORC Twitter</h2> <p>The official <a href="https://twitter.com/apacheorc">@ApacheOrc</a> -Twitter account pushes announcements about ORC.</p> +Twitter account pushes announcements about ORC. If you give a talk about +ORC, let us know and we'll tweet it out and add it to the news section +of the website.</p> http://git-wip-us.apache.org/repos/asf/orc/blob/ad182311/site/talks/index.md ---------------------------------------------------------------------- diff --git a/site/talks/index.md b/site/talks/index.md index 1bdb607..95ae75c 100644 --- a/site/talks/index.md +++ b/site/talks/index.md @@ -27,6 +27,15 @@ Watch some presentations and read through some slide decks about ORC. src="https://www.youtube.com/embed/Jd4lFEK8gCM" frameborder="0" allowfullscreen></iframe> +* Adding ACID Updates to Hive + [slides](http://www.slideshare.net/alanfgates/hive-acidupdatessummitsjc2014) + + Owen O'Malley and Alan Gates at Hadoop Summit: San Jose 2014 + + <iframe width="560" height="315" + src="https://www.youtube.com/embed/_A-GMkIygtU" frameborder="0" + allowfullscreen></iframe> + ## Slides * ORC Files
