[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Delbru updated LUCENE-6966: -- Attachment: Encryption Codec Documentation.pdf An initial technical documentation. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, > LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly increased by 35%. > There was no significant difference in term of index size. > h2. Query Throughput > With respect to query throughput, we observed no significant impact on the > following
[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Delbru updated LUCENE-6966: -- Attachment: LUCENE-6966-2-docvalues.patch Here is a separate patch (to apply on top of LUCENE-6966-2) for the doc values format. It is a prototype based on an encrypted index input/output. The encrypted index output writes encrypted data blocks of fixed size. Each data block has its own initialization vector. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, > LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly
[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Delbru updated LUCENE-6966: -- Attachment: LUCENE-6966-2.patch This patch includes changes so that every encrypted data block uses a new iv. The iv is encoded in the header of the data block. The CipherFactory has been extended so that people can decide on how to instantiate a cipher and how to generate new ivs. The performance impact of storing and using a unique iv per block is minimal. The results of the benchmark below (performed on the full wikipedia dataset) show that there is no significant difference in qps: {noformat} TaskQPS 6966-before StdDevQPS 6966-after StdDev Pct diff Respell 20.56 (11.2%) 19.18 (7.9%) -6.7% ( -23% - 13%) Fuzzy2 33.98 (11.7%) 32.76 (11.0%) -3.6% ( -23% - 21%) Fuzzy1 31.13 (11.2%) 30.05 (8.2%) -3.5% ( -20% - 17%) PKLookup 125.62 (13.0%) 121.38 (8.8%) -3.4% ( -22% - 21%) Wildcard 35.10 (11.7%) 34.36 (8.2%) -2.1% ( -19% - 20%) OrNotHighMed 25.90 (11.4%) 25.86 (10.5%) -0.2% ( -19% - 24%) OrNotHighHigh 15.26 (12.1%) 15.28 (10.8%) 0.2% ( -20% - 26%) OrHighNotHigh9.80 (12.4%)9.82 (12.0%) 0.2% ( -21% - 28%) OrHighNotMed 13.01 (13.4%) 13.06 (13.0%) 0.4% ( -22% - 30%) LowTerm 252.64 (12.5%) 253.90 (8.7%) 0.5% ( -18% - 24%) OrHighNotLow 35.63 (13.5%) 35.83 (13.4%) 0.6% ( -23% - 31%) Prefix3 21.70 (13.3%) 21.86 (9.7%) 0.7% ( -19% - 27%) MedTerm 83.04 (11.7%) 83.73 (8.0%) 0.8% ( -16% - 23%) AndHighHigh 15.41 (10.6%) 15.61 (7.9%) 1.3% ( -15% - 22%) LowSloppyPhrase 68.89 (12.5%) 69.90 (9.0%) 1.5% ( -17% - 26%) AndHighLow 294.02 (11.6%) 299.04 (8.3%) 1.7% ( -16% - 24%) OrHighMed 10.92 (14.4%) 11.13 (10.8%) 1.9% ( -20% - 31%) OrHighHigh9.45 (14.6%)9.63 (10.9%) 1.9% ( -20% - 32%) MedSpanNear 69.01 (11.9%) 70.39 (8.4%) 2.0% ( -16% - 25%) AndHighMed 45.16 (12.4%) 46.17 (9.1%) 2.2% ( -17% - 27%) HighTerm 16.61 (13.3%) 16.99 (9.5%) 2.3% ( -18% - 28%) LowPhrase3.03 (11.1%)3.10 (9.2%) 2.3% ( -16% - 25%) HighPhrase 11.82 (13.0%) 12.10 (9.6%) 2.4% ( -17% - 28%) MedPhrase7.49 (12.1%)7.67 (9.1%) 2.4% ( -16% - 26%) OrNotHighLow 424.80 (11.1%) 434.97 (8.2%) 2.4% ( -15% - 24%) OrHighLow 25.08 (12.0%) 25.70 (11.7%) 2.5% ( -18% - 29%) HighSloppyPhrase4.01 (13.7%)4.11 (9.7%) 2.5% ( -18% - 30%) MedSloppyPhrase6.61 (12.9%)6.78 (9.2%) 2.5% ( -17% - 28%) LowSpanNear 15.52 (11.8%) 15.91 (8.6%) 2.5% ( -16% - 26%) IntNRQ3.76 (16.4%)3.86 (13.1%) 2.7% ( -23% - 38%) HighSpanNear4.40 (12.8%)4.52 (9.1%) 2.8% ( -16% - 28%) {noformat} I have took the occasion to run another benchmark to compare this patch against lucene's master. We can see that queries on low frequency terms (probably because the dictionary lookup becomes more costly than reading of the posting list) and queries that needs to scan a large portion of the dictionary are the most impacted. {noformat} Task QPS master StdDevQPS 6966 StdDev Pct diff Fuzzy1 55.08 (15.5%) 35.89 (8.2%) -34.8% ( -50% - -13%) Respell 39.31 (16.9%) 28.47 (8.2%) -27.6% ( -45% - -3%) Fuzzy2 35.33 (16.8%) 28.21 (8.8%) -20.1% ( -39% -6%) Wildcard 11.13 (18.9%)9.95 (7.9%) -10.6% ( -31% - 19%) AndHighLow 304.79 (17.7%) 277.30 (10.4%) -9.0% ( -31% - 23%) OrNotHighLow 240.56 (16.8%) 226.64 (10.2%) -5.8% ( -28% - 25%) PKLookup 129.54 (20.1%) 122.47 (8.3%) -5.5% ( -28% - 28%)
[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Renaud Delbru updated LUCENE-6966: -- Attachment: LUCENE-6966-1.patch This patch contains the current state of the codec for index-level encryption. It is up to date with the latest version of the lucene-solr master branch. This patch does not include yet the ability for the users to choose which cipher to use. I'll submit a new patch that will tackle this issue in the next coming week. The full lucene test suite has been executed against this codec using the command: {code} ant -Dtests.codec=EncryptedLucene60 test {code} Only one test fails, TestSizeBoundedForceMerge#testByteSizeLimit, which is expected. This test is incompatible with the codec. The doc values format (prototype based on an encrypted index output) is not included in this patch, and will be submitted as a separate patch in the next coming days. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: LUCENE-6966-1.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a