[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204549#comment-15204549
 ] 

Renaud Delbru commented on LUCENE-6966:
---------------------------------------

Karl, the patch will not include a ready to use FSDirectory implementation, but 
the doc value format is based on an encrypted index input and output 
implementation which can easily be reused in an implementation of FSDirectory.

> Contribution: Codec for index-level encryption
> ----------------------------------------------
>
>                 Key: LUCENE-6966
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6966
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/other
>            Reporter: Renaud Delbru
>              Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query throughput, we observed no significant impact on the 
> following queries: Term query, boolean query, phrase query, numeric range 
> query. 
> We observed the following performance impact for queries that needs to scan a 
> larger portion of the term dictionary:
> - prefix query: decrease of ~25%
> - wildcard query (e.g., “fu*r”): decrease of ~60%
> - fuzzy query (distance 1): decrease of ~40%
> - fuzzy query (distance 2): decrease of ~80%
> We can see that the decrease of performance is relative to the size of the 
> dictionary scan.
> h2. Document Retrieval
> We observed a decrease of performance that is relative to the size of the set 
> of documents to be retrieved:
> - ~20% when retrieving a medium set of documents (100) 
> - ~30/40% when retrieving a large set of documents (1000) 
> h1. Known Limitations
> - compressed stored field do not keep order of fields since non-encrypted and 
> encrypted fields are stored in separated blocks.
> - the current implementation of the cipher factory does not enforce the use 
> of AES/CBC. We are planning to add this to the final version of the patch.
> - the current implementation does not change the IV per segment. We are 
> planning to add this to the final version of the patch.
> - the current implementation of compressed stored fields decrypts a full 
> compressed block even if a small portion is decompressed (high impact when 
> storing very small documents). We are planning to add this optimisation to 
> the final version of the patch. The overall document retrieval performance 
> might increase with this optimisation.
> The codec has been implemented as a contrib. Given that most of the classes 
> were final, we had to copy most of the original code from the extended 
> formats. At a later stage, we could think of opening some of these classes to 
> extend them properly in order to reduce code duplication and simplify code 
> maintenance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to