[jira] [Comment Edited] (LOG4J2-1305) Binary Layout

Remko Popma (JIRA) Thu, 03 Mar 2016 05:04:32 -0800

    [ 
https://issues.apache.org/jira/browse/LOG4J2-1305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15177777#comment-15177777
 ]


Remko Popma edited comment on LOG4J2-1305 at 3/3/16 1:03 PM:
-------------------------------------------------------------

The ThreadContext information is actually variable-length since there can be 
any number of key/value pairs. (The example layout allows access to 
fixed-length fields within the record by dead reckoning. I agree this is a 
desirable property.) That said, I don't mind moving the ThreadContext fields to 
precede Message. 


was (Author: [email protected]):
The ThreadContext information is actually variable-length since there can be 
any number of key/value pairs. (The example layout allows access to 
fixed-length fields within the record by dead reckoning. I agree this is a 
desirable property.)

> Binary Layout
> -------------
>
>                 Key: LOG4J2-1305
>                 URL: https://issues.apache.org/jira/browse/LOG4J2-1305
>             Project: Log4j 2
>          Issue Type: New Feature
>          Components: Layouts
>            Reporter: Remko Popma
>              Labels: binary
>
> Logging in a binary format instead of in text can give large performance 
> improvements. 
> Logging text means going from a LogEvent object to formatted text, and then 
> converting this text to bytes. Performance investigations with text-based 
> logging formats like PatternLayout (see LOG4J2-930), and encoding Strings to 
> bytes (LOG4J2-935, LOG4J2-1151) suggest that formatting and encoding text is 
> expensive and imposes limits on the performance that can be achieved. 
> A different approach would be to convert the LogEvent to a binary 
> representation directly without creating a text representation first. This 
> would result in extremely compact log files that are fast to write. The 
> trade-off is that a binary log cannot easily be read in a general-purpose 
> editor like VI or Notepad. A specialized tool would be necessary to either 
> display or convert to human-readable form. 
> This ticket proposes a simple BinaryLayout, where each LogEvent is logged in 
> a binary format.
> *Example BinaryLayout log event record format*
> ||Offset||Type||Log Event Record Field Description||
> |0|long|TimeMillis|
> |8|long|NanoTime|
> |16|int|Level|
> |20|int|Logger name index - string value in separate file|
> |24|int|Thread name index - string value in separate file|
> |28|long|Thread ID|
> |36|int|Thread priority|
> |40|int|Marker index - value & hierarchy in separate file|
> |44|int|Message length|
> |48|int|Message encoder FQCN index|
> |52|byte[]|Message data - below offset assumes 18 bytes of message data|
> |70|int| Throwable data length|
> |74|byte[]|Throwable data - below offset assumes 26 bytes of Throwable data|
> |100|int|ThreadContext key/value pair count|
> |104|int|ThreadContext key index - string value in separate file|
> |108|int|ThreadContext value index - string value in separate file|
> *Repeating String Data*
> Repeating String data like thread names, logger names, marker names and 
> ThreadContextMap keys and values should be logged only once, after which they 
> can be referenced by their index.
> One way to do this is to save string data to a separate file. The main log 
> file contains an index (the line number, zero-based) into the string-data 
> file instead of the full string. Index -1 means the String value was 
> {{null}}. The format of the string-data file can simply be: each unique 
> string on a separate line (separated by '\n' (0x0A) character). Any '\n' 
> characters embedded in the string value are Unicode escaped and writen as 
> "\u000A".
> An alternative to separate files is interspersing "string-data" records with 
> "log event" records. Records could be prefixed with a single byte indicating 
> their record type (e.g. '#' (0x23)=header, '\n' (0x0A)=log event, '$' 
> (0x24)=string data).
> String-data record format:
> ||Offset||Type||String-Data Record Field Description||
> |0|int|index of the string (each unique String has a unique index)|
> |4|byte[]|the String value, encoded in the standard Java [modified 
> UTF-8|https://docs.oracle.com/javase/8/docs/api/java/io/DataInput.html#modified-utf-8]
>  format used by 
> [DataOutput.writeUTF(String)|https://docs.oracle.com/javase/8/docs/api/java/io/DataOutput.html#writeUTF-java.lang.String-]|
> *Custom Messages*
> Note: custom Messages that implement the {{Encoder}} interface (introduced 
> with LOG4J2-1274) can be written in binary form directly without first being 
> converted to text (LOG4J2-506). Any specialized tool for reading binary log 
> files should handle messages of type "text" out of the box, but could have 
> some plugin mechanism for decoding custom messages.
> A more flexible and less intrusive variation of this is to have a registry of 
> Encoders that map Classes to the associated Encoder. That would allow not 
> only custom Messages, but also the content of any ObjectMessage to be encoded 
> in custom binary format. Domain classes then no longer need to implement the 
> Message interface.
> *Markers*
> TBD: as Matt points out in the comments, Markers are special since they are 
> hierarchic. One way to deal with this is to manage a separate file to save 
> the Marker hierarchy. Another way is to do something similar to 
> PatternLayout: treat it as a String value, where the string includes 
> hierarchy information. I like the simplicity of the latter approach.
> *Versioning*
> The binary file must start with a header, indicating version information and 
> perhaps schema information providing meta data on the log record. Schema 
> information may make it possible to include/exclude fields. For version 1.0, 
> the schema can either be fixed like the above example, or it could be a 
> simple bitmask for the fields mentioned above.
> *Byte Order*
> TBD: Are multi-byte values like ints and longs written in big Endian or 
> little Endian? This could be specified in the header, or we could fix it to 
> either one. Exchange protocols like ITCH tend to select a fixed byte order 
> (ITCH uses big Endian - network byte order). I like the simplicity of this 
> approach.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LOG4J2-1305) Binary Layout

Reply via email to