[jira] [Resolved] (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Roger Meier (Resolved) (JIRA) Mon, 09 Apr 2012 11:25:43 -0700

     [ 
https://issues.apache.org/jira/browse/THRIFT-111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Roger Meier resolved THRIFT-111.
--------------------------------

    Resolution: Won't Fix

issue is too old, please reopen or create a new issue and patch if you need 
this.
see http://thrift.apache.org/docs/HowToContribute/
                
> TRecordStream: a robust transport for writing records  with (optional) 
> CRCs/Compression and ability to skip over corrupted data
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: THRIFT-111
>                 URL: https://issues.apache.org/jira/browse/THRIFT-111
>             Project: Thrift
>          Issue Type: New Feature
>            Reporter: Pete Wyckoff
>            Priority: Minor
>
> Design Document for TRecordStream (this is basically the design doc 
> circulated on the public thrift lists under the name TRobustOfflineStream in 
> May 08 with the addition of the requirement of handling small synchronous 
> writes)
> TRecordStream is a Thrift transport that encodes data in a format
> suitable for storage in a file (not synchronous communication).
> TRecordStream achieves following design goals:
> - Be self-describing and extensible.  A file containing a TRecordStream
>   must contain enough metadata for an application to read it with no other
>   context.  It should be possible to add new features without breaking
>   backwards and forwards compatibility.  It should be possible to completely
>   change the format without confusing old or programs.
> - Be robust against disk corruption.  All data and metadata must (optionally)
>   be checksummed.  It must be possible to recover and continue reading
>   uncorrupted data after corruption is encountered.
> - Be (optionally) human-readable.  TRecordStream will also be used for
>   plan-text, line-oriented, human-readable data.  Allowing a plain-text,
>   line-oriented, human-readable header format will be advantageous for this
>   use case.
> - Support asynchronous file I/O.  This feature will not be implemented in the
>   first version of TRecordStream, but the implementation must support
>   the eventual inclusion of this feature.
> - Be performant.  No significant sacrifice of speed should be made in order to
>   achieve any of the other design goals.
> - Support small synchronous writes
> TRecordStream will not do any I/O itself, but will instead focus on
> preparing the data format and depend on an underlying transport (TFDTransport,
> for example) to write the data to a file.
> TRecordStream will have two distinct formats: binary and plain text.
> Binary-format streams shall begin with a format version number, encoded as a
> 32-bit big-endian integer.  The version number must not exceed 2^24-1, so the
> first byte of a TRecordStream will always be 0. The version number
> shall be repeated once to guard against corruption.  If the two copies of the
> version number do not match, the stream must be considered corrupt, and
> recovery should proceed as described below (TODO).
> Plain-text streams shall begin with the string ASCII "TROS: " (that is a space
> after the colon), followed by the decimal form of the version number
> (ASCII-encoded), followed by a linefeed (ASCII 0x0a) character.  The full
> version line shall be repeated.
> This document describes version 1 of the format.  Version 1 streams are
> composed of series of chunks.  Variable-length chunks are supported, but their
> use is discoraged because they make recovering from corrupt chunk headers
> difficult.  Each chunk begins with the redundant version identifiers described
> above.
> Following the version numbers, a binary-format stream shall contain the
> following fields, in order and with no padding:
> - The (32-bit) CRC-32 of the header length + header data.
> - The 32-bit big endian header length.
> - A variable-length header, which is a TBinaryProtocol-serialized Thrift
>   structure (whose exact structure is defined in
>   robust_offline_stream.thrift).
> A plain-text stream should follow the versions with:
> - The string "Header-Checksum: "
> - The eight-character (leading-zero-padded) hexadecimal encoding of the
>   unsigned CRC-32 of the header (which does *not* include the CRC-32).
> - A linefeed (0x0a).
> - A header consisting of zero or more entries, where each entry consists of
>   - An entry name, which is an ASCII string consisting of alphanumeric
>     characters, dashes ("-"), underscores, and periods (full-stops).
>   - A colon followed by a space.
>   - An entry value, which is a printable ASCII string not including any
>     linefeeds.
>   - A linefeed.
> - A linefeed.
> Header entry names may be repeated.  The handling of repeated names is
> dependent on the particular name.  Unless otherwise specified, all entries
> with a given name other than the last are ignored.
> The actual data will be stored in sub-chunks, which may optionally be
> compressed.  (The chunk header will define the compression format used.)  The
> chunk header will specify the following fields for each sub-chunk:
>  - (optional) Offset within the chunk.  If ommitted, it should be assumed to
>    immediately follow the previous sub-chunk.
>  - (required) Length of the (optionally) compressed sub-chunk.  This is the
>    physical number of bytes in the stream taken up by the sub-cunk.
>  - (optional) Uncompressed length of the sub-chunk.  Used as an optimization
>    hint.
>  - (optional) CRC-32 of the (optionally compressed) sub-chunk.
>  - (optional) CRC-32 of the uncompressed sub-chunk.
> If no compression format is specified, the sub-chunks should be assumed to be
> in "raw" format.
> {code:title=TRecordStream.thrift|borderStyle=solid}
> namespace cpp    facebook.thrift.transport.record_stream
> namespace java   com.facebook.thrift.transport.recrod_stream
> namespace python thrift.transport.recrod_stream
> /*
>  * enums in plain-text headers should be represented as strings, not numbers.
>  * Each enum value should specify the string used in plain text.
>  */
> enum CompressionType {
>   /**
>    * "raw": No compression.
>    *
>    * The data written to the TRecordStream object appears byte-for-byte
>    * in the stream.  Raw format streams ignore the uncompressed length and
>    * uncompressed checksum of the sub-chunks.  It is strongly advised to use
>    * checksums when writing raw sub-chunks.
>    */
>   COMPRESSION_RAW = 0,
>   /**
>    * "zlib": zlib compression.
>    *
>    * The compressed data is a zlib stream compressed with the "deflate"
>    * algorithm.  This format is specified by RFCs 1950 and 1951, and is
>    * produced by zlib's "compress" or "deflate" functions.  Note that this is
>    * *not* a raw "deflate" stream nor a gzip file.
>    */
>   COMPRESSION_ZLIB = 1,
> }
> enum RecordType {
>   /**
>    * (Absent in plain text.) Unspecified record type.
>    */
>   RECORD_UNKNOWN = 0,
>   /**
>    * "struct": Thrift structures, serialized back-to-back.
>    */
>   RECORD_STRUCT = 1,
>   /**
>    * "call": Thrift method calls, produced by send_method();
>    */
>   RECORD_CALL = 2,
>   /**
>    * "lines": Line-oriented text data.
>    */
>   RECORD_LINES = 3,
> }
> enum ProtocolType {
>   /** (Absent in plain text.) */
>   PROTOCOL_UNKNOWN     = 0;
>   /** "binary" */
>   PROTOCOL_BINARY      = 1;
>   /** "dense" */
>   PROTOCOL_DENSE       = 2;
>   /** "json" */
>   PROTOCOL_JSON        = 3;
>   /** "simple_json" */
>   PROTOCOL_SIMPLE_JSON = 4;
>   /** "csv" */
>   PROTOCOL_CSV         = 5;
> }
> /**
>  * The structure used to represent metadata about a sub-chunk.
>  * In plain text, this structure is included as the value of a "Sub-Chunk"
>  * header entry.  Each of these fields should be included, represented
>  * according to the comment for ChunkHeader.  Fields should be in order and
>  * separated by a single space.  Absent fields should be included as a single
>  * dash ("-").
>  */
> struct SubChunkHeader {
>   1: optional i32 offset;
>   2: required i32 length;
>   3: optional i32 checksum;
>   4: optional i32 uncompressed_length;
>   5: optional i32 uncompressed_checksum;
> }
> /**
>  * This is the top-level structure encoded as the chunk header.
>  * Unless otherwise specified, field will be represented in plain text by
>  * uppercasing each word in the field name and replacing underscores with
>  * hyphens, producing the field name.  Integers should be ASCII-encoded
>  * decimal, except for checksums which should be ASCII-encoded hexadecimal
>  * unsigned.
>  */
> struct ChunkHeader {
>   /**
>    * Number of bytes per chunk.
>    * Recommended to be a power of 2.
>    */
>   1: required i32 chunk_size;
>   /**
>    * Type of compression used for sub-chunks.
>    * Assumed to be RAW if absent.
>    */
>   3: optional CompressionType compression_type = COMPRESSION_RAW;
>   /**
>    * Type of records encoded in the sub-chunks.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   4: optional RecordType record_type = RECORD_UNKNOWN;
>   /**
>    * Protocol used for serializing records.
>    * This information is made accessible to applications,
>    * but is otherwise uninterpreted by the transport.
>    */
>   5: optional ProtocolType protocol_type = PROTOCOL_UNKNOWN;
>   /**
>    * The metadata for the individual sub-chunks,
>    * in the order they should be read.
>    *
>    * In the plain-text format, each of these is written as a separate
>    * "Sub-Chunk" header entry, in order.
>    */
>   2: required list<SubChunkHeader> sub_chunk_headers;
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (THRIFT-111) TRecordStream: a robust transport for writing records with (optional) CRCs/Compression and ability to skip over corrupted data

Reply via email to