[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204549#comment-15204549 ]
Renaud Delbru commented on LUCENE-6966: --------------------------------------- Karl, the patch will not include a ready to use FSDirectory implementation, but the doc value format is based on an encrypted index input and output implementation which can easily be reused in an implementation of FSDirectory. > Contribution: Codec for index-level encryption > ---------------------------------------------- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other > Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly increased by 35%. > There was no significant difference in term of index size. > h2. Query Throughput > With respect to query throughput, we observed no significant impact on the > following queries: Term query, boolean query, phrase query, numeric range > query. > We observed the following performance impact for queries that needs to scan a > larger portion of the term dictionary: > - prefix query: decrease of ~25% > - wildcard query (e.g., “fu*r”): decrease of ~60% > - fuzzy query (distance 1): decrease of ~40% > - fuzzy query (distance 2): decrease of ~80% > We can see that the decrease of performance is relative to the size of the > dictionary scan. > h2. Document Retrieval > We observed a decrease of performance that is relative to the size of the set > of documents to be retrieved: > - ~20% when retrieving a medium set of documents (100) > - ~30/40% when retrieving a large set of documents (1000) > h1. Known Limitations > - compressed stored field do not keep order of fields since non-encrypted and > encrypted fields are stored in separated blocks. > - the current implementation of the cipher factory does not enforce the use > of AES/CBC. We are planning to add this to the final version of the patch. > - the current implementation does not change the IV per segment. We are > planning to add this to the final version of the patch. > - the current implementation of compressed stored fields decrypts a full > compressed block even if a small portion is decompressed (high impact when > storing very small documents). We are planning to add this optimisation to > the final version of the patch. The overall document retrieval performance > might increase with this optimisation. > The codec has been implemented as a contrib. Given that most of the classes > were final, we had to copy most of the original code from the extended > formats. At a later stage, we could think of opening some of these classes to > extend them properly in order to reduce code duplication and simplify code > maintenance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org