[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068178#comment-16068178 ] Jan Høydahl commented on LUCENE-6966: - [~rendel] I believe the reason you are not seeing more traction on this is not because it is not quality work or useful, but rather that 1) It is only a tiny percentage of Lucene users who need this level of security 2) The patch is huge and complex, so most committers won't have bandwidth (or expertise) to QA it. There is obviously also a concern about future maintenance load if this needs to be touched for each version, and for each new index feature, with the risk of introducing a bug that breaks security. I'm sure that if a couple of developers with in-depth knowledge of the feature and security expertise were willing to contribute long-term on this you would probably be nominated as committers and the feature would have a safer future. Have you considered starting by maintaining the project on GitHub, and produce releases (and maven artifacts), along with Lucene and Solr usage instructions? This would bring more focus, attract PRs, and I would expect it to be a popular project very soon. Of course, if there are lucene-core changes that are needed for the plungins to work, those would need to be committed first. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, > LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the >
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685728#comment-15685728 ] Otis Gospodnetic commented on LUCENE-6966: -- Uh, silence. :( I have not looked into the implementation and have only skimmed comments here in the past. My general feeling though is that until/unless this gets committed most people won't bother looking (I think we saw similar behaviour with Solr CDCR which was WIP in JIRA for a while and was labeled as such for a long time but now that it's in I hear more and more people using it http://search-lucene.com/?q=cdcr ) and once it's in it may get worked on by more interested parties. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, > LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > t
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667213#comment-15667213 ] Renaud Delbru commented on LUCENE-6966: --- Is there still interest from the community in considering this patch as a contribution ? Even if there are limitations and therefore this will not cover all possible scenarios, we think this provides an initial set of core features and a good starting point for future work. We received multiples personal request for this patch which shows there is a certain interest for such a feature. I am attaching also an initial technical documentation that explains how to use the codec and clarifies its current known limitations. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, > LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia da
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291095#comment-15291095 ] Joel Bernstein commented on LUCENE-6966: I have a couple of issues with the design from a security standpoint: 1) The security tradeoffs of leaving the posting list in the clear is unknown. 2) Encryption at the codec level makes encryption part of the schema design. This leaves open opportunities for users to design insecure schemas that they believe are secure. For example data can leak from encrypted fields to un-encrypted fields as it's copied around to support sorting, faceting, suggestion, multi-language search etc... > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, > LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted.
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284788#comment-15284788 ] Renaud Delbru commented on LUCENE-6966: --- I think the latest patch is ready for commit, any objections ? > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, > LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly increased by 35%. > There was no significant difference in term of index size. > h2. Query Throughput > With respect to query throughput, we observed no significant impact on the > following queries:
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262520#comment-15262520 ] Renaud Delbru commented on LUCENE-6966: --- Hi [~joel.bernstein], {quote} 1) With the latest patch do you feel the major concerns have been addressed. {quote} Yes, the latest patch does not reuse IVs anymore but instead use a different IV for each data block. It also introduces an API so that one can have control on how IVs are generated and how the cipher is instantiated. {quote} 2) From my initial reading of the patch it seemed like everything in the patch was pluggable. Does this need to be committed to be usable? Or can it be hosted on another project? 3) Because it's such a large patch and codecs change over time, does it present a burden to maintain with the core Lucene project? Along these lines is it more appropriate from a maintenance standpoint to be maintained by people who are really motivated to have this feature. Alfresco engineers would likely participate in an outside project if one existed. {quote} The patch follows the standard rules of Lucene codecs, so yes, it is fully pluggable. Similar to other codecs, however, the burden to maintain it will be low. It is a set of Lucene's *Format classes that are loosely coupled with other part of the Lucene code. It will likely require maintenance only when the high-level Lucene's Codec and Format API changes. The patch is large because we had to make a copy of some of the original lucene *Format classes, as those classes were final and not extensible. If one wants to update them with the latest improvements made in the original classes, this might require a bit more effort, but from my personal experience it was so far straightforward. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: LUCENE-6966-1.patch, LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256800#comment-15256800 ] Joel Bernstein commented on LUCENE-6966: Hi [~rendel], Thanks for your work on this. I've read through patch and it's quite a large piece of work. I mentioned earlier in the ticket that Alfresco is interested in this, so I wanted to ask some questions and see if I could understand it better. 1) With the latest patch do you feel the major concerns have been addressed. I'll copy a few of them below: bq. On the other hand, from your description: reusing IVs per segment and so on, that is no CBC mode, sorry its essentially ECB mode: this is just not secure. bq. I am not sure where some of these ideas like "postings lists don't need to be encrypted" came from, but most of the design presented on this issue is completely insecure. Please, if you want to do this stuff in lucene, it needs to be a standardized scheme (like XTS or ESSIV) with all the known tradeoffs already computed. You can be 100% sure that if "crypto is invented here" that I'm gonna make comments on the issue, because it is the right thing to do. 2) From my initial reading of the patch it seemed like everything in the patch could be pluggable. Does this need to be committed to be usable? Or can be hosted on another project? 3) Because it's such a large patch and codecs change over time, does it present a burden to maintain with the core Lucene project? Along these lines is it more appropriate from a maintenance standpoint to be maintained by people who are really motivated to have this feature. Alfresco engineers would likely participate in an outside project if one existed. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > Attachments: LUCENE-6966-1.patch, LUCENE-6966-2.patch > > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlik
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204549#comment-15204549 ] Renaud Delbru commented on LUCENE-6966: --- Karl, the patch will not include a ready to use FSDirectory implementation, but the doc value format is based on an encrypted index input and output implementation which can easily be reused in an implementation of FSDirectory. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly increased by 35%. > There was no significant difference in term of index size. > h2. Query Throughput > With respect to query through
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197431#comment-15197431 ] Renaud Delbru commented on LUCENE-6966: --- Thanks for all of the feedback. Based on everyone's comments, it seems like different encryption algorithms might be better depending on the situation. Rather than implement a one-size-fits-all solution then, perhaps it would be better not to enforce any one cipher and instead leave some flexibility for users to choose the cipher they find more appropriate. If everyone is okay with this approach, I will update the code appropriately. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202629#comment-15202629 ] Karl Sanders commented on LUCENE-6966: -- {quote}Rather than implement a one-size-fits-all solution then, perhaps it would be better not to enforce any one cipher and instead leave some flexibility for users to choose the cipher they find more appropriate.{quote} I think this is extremely reasonable. I would like to ask if this patch will also provide "FSDirectory-level encryption" like [LUCENE-2228|https://issues.apache.org/jira/browse/LUCENE-2228]. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less tha
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192849#comment-15192849 ] Karl Sanders commented on LUCENE-6966: -- There's an apparently abandoned project that might be of interest: https://code.google.com/archive/p/lucenetransform/ It appears to be implementing compression and encryption for Lucene indexes. I also found a couple of related links. - Some considerations about how it's being used in another project: https://github.com/muzima/documentation/wiki/Security-regarding-stored-data-by-using-Lucene - A discussion about ensuring that indexes aren't tampered with: http://permalink.gmane.org/gmane.comp.jakarta.lucene.user/50495 > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were t
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191059#comment-15191059 ] Thomas Mueller commented on LUCENE-6966: The approach taken in LUCENE-2228 sounds sensible to me: "AESDirectory extends FSDirectory". Even thought the patch would need to be improved: nowadays XTS should be used. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly increased by 35%. > There was no significant difference in term of index size. > h2. Query Throughput > With respect to query throughput, we observed no significant impact on the > follo
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189351#comment-15189351 ] Thomas Mueller commented on LUCENE-6966: > More importantly, with file-level encryption, data would reside in an > unencrypted form in memory which is not acceptable to our security team You could use [homomorphic encryption|https://en.wikipedia.org/wiki/Homomorphic_encryption] (just joking). The best you can realistically do is overwrite the plain text encryption password in memory as soon as you have hashed it, but even that is [a challenge because the JVM garbage collector could copy the password|http://security.stackexchange.com/questions/6753/deleting-a-java-object-securely]. I agree and understand it would be good to have encryption / decryption done in Lucene itself. With [filesystem-level encryption|https://en.wikipedia.org/wiki/Filesystem-level_encryption] you wouldn't need any changes in Lucene, but is a bit challenging for other reasons (backup and restore for example). There are still [challenges with key management|http://stackoverflow.com/questions/2664099/how-can-we-store-password-other-than-plain-text] of course. One option is to read the encryption password from a config file at startup, and delete that file after use. To start the application, use a script that copies the config file (with the password) using user "x", but run the application as user "y" (with lower privileges). That way, the (plain text) password is not there during normal operation. Or ask the operator to type in the password at startup, or ask for the password on a web page... But it's more flexible than with filesystem-level encryption. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictio
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189316#comment-15189316 ] Karl Sanders commented on LUCENE-6966: -- I would like to let you know that I made the H2 and HSQLDB communities aware about this interest in providing a secure backend for Lucene. H2's author, Thomas Mueller, has already been so kind as to provide his feedback. Both these databases provide encryption at the file level. In both cases I mentioned the possibility to use the database as an encrypted storage for Lucene data. Or even to go as far to add (or improve, in the case of H2) Lucene integration. Maybe some of the companies interested in adding this capability to Lucene might want to reach out to them too, showing that there's the concrete possibility for a partnership or to simply do some contract work. I sincerely hope that Lucene file-level encryption can become a reality. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189211#comment-15189211 ] Thomas Mueller commented on LUCENE-6966: Your proposal doesn't sound very secure. I would recommend to rename this feature as "scrambling", and not use the term "encryption". > you are correct that ECB mode is not secure if blocks are not unique. > However, in our use case we can ensure that all blocks are unique and our > security team would argue that this make it equivalent to using CTR or CBC > mode, making it secure. Even thought I'm not an expert, I'm almost 100% sure that no, this is not secure. > it would not be possible to choose which field to encrypt or not How important is this? Why can't you use two indexes, one encrypted (properly, with XTS) and the other one is not? As for XTS, it is fairly simple to implement. If you like, I can contribute [the XTS code I have written for the H2 database|https://github.com/h2database/h2database/blob/master/h2/src/main/org/h2/store/fs/FilePathEncrypt.java]. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181311#comment-15181311 ] Joel Bernstein commented on LUCENE-6966: Alfresco is also interested in this ticket. I'd like to see if there is a way to reach consensus on an approach for moving this forward. This will likely mean making changes in the patch to address security concerns. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and compressed stored fields format were tested at that time. > h2. Indexing > The indexing throughput slightly decreased and is roughly 15% less than with > the base Lucene. > The merge time slightly increased by 35%. > There was no significant difference in term of index size. > h2. Query Throughput > With respect to query throughput, we o
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180675#comment-15180675 ] Adam Williams commented on LUCENE-6966: --- Iron Mountain is also interested in this solution. We have been following this ticket and hoping that it will become a reality. We have over 90,000 customers with varying security rules ranging from banks to healthcare and government. Ultimately, our responsibility is to ensure that we meet the needs of our customers. We are solr cloud based with 4.8 billion indexed documents on over 120 virtual machines and 26 clouds. Many of our large customers have encryption requirements, while some of our other customers have none. For us to do disk based encryption on shared storage is not ideal. The cost for TBs of data in teir-1 storage is high. Also, this approach allows us to set encryption for those who need it and turn it off for those who do not. By allowing flexibility in the key generation and crypto provider, we can provide a solution to meet the security needs of many customers. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > go
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176749#comment-15176749 ] Gennadiy Geyfman commented on LUCENE-6966: -- To provide a little more background, there are several reasons why we thought a file system based encryption scheme would not work for our case. One of the core challenges we are facing is the management of a large number of user keys across our infrastructure. We have a complex security environment with key management procedures driven by customers with different security requirements. As a result, different indices have to be encrypted with different keys and with different crypto providers - all of which we need to manage. Second, we also need to control the index life cycle which includes applying of a new encryption keys, transfer index over the network and manage index backups that are stored in the cloud. For this reason, it will be much less complicated to have control over the key providers and data encryption in one central point (Lucene). Given these requirements, our evaluation showed that file-system encryption would lead us to larger effort and investment in order to keep secure control across the our infrastructure. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144305#comment-15144305 ] Robert Muir commented on LUCENE-6966: - {quote} More importantly, with file-level encryption, data would reside in an unencrypted form in memory which is not acceptable to our security team and, therefore, a non-starter for us. {quote} This speaks volumes. You should fire your security team! You are wasting your time worrying about this: if you are using lucene, your data will be in memory, in plaintext, in ways you cannot control, and there is nothing you can do about that! Trying to guarantee anything better than "at rest" is serious business, sounds like your team is over their head. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fields (id, title, body, date) were encrypted. Only the block tree terms > and comp
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134867#comment-15134867 ] Gennadiy Geyfman commented on LUCENE-6966: -- Thanks for your feedback, Robert. Just to give you some background, this feature is being developed for Salesforce, where we have customers that not only require robust encryption, but that also have strong compliance requirements for key management. We are not currently users of Solr Cloud, which I mention if just to assure you that none of the differences between Solr vs. Solr Cloud have had any impact on this design. As far as the solution itself is concerned, you are correct that ECB mode is not secure if blocks are not unique. However, in our use case we can ensure that all blocks are unique and our security team would argue that this make it equivalent to using CTR or CBC mode, making it secure. Nonetheless, our case would be stronger if the uniqueness property was actually guaranteed rather than suggested, so we will seek to refine the design in this way. Also, with regard to your other concerns, we evaluated encryption at the filesystem level, but our conclusion was that this will bring even more complexity than index-level encryption, especially when one considers typical compliance requirements for key management. Encryption at the filesystem level would also require thoughtful planning for our backup / restore operations to ensure that backups are encrypted as well. More importantly, with file-level encryption, data would reside in an unencrypted form in memory which is not acceptable to our security team and, therefore, a non-starter for us. Hopefully this gives you a better idea of the thinking that went into this proposed design. We agree that for other users of Solr, encryption by the OS could easily make more sense. Moreover, nothing in our proposal would stop anyone from pursuing that path. But for our use case and for others needing enterprise search functionality, this solution gives more granular control of the keys as well as the end-to-end encryption process. If you have any additional questions or concerns, please let me know and I'll try to answer them as best I can. Thank you, again, for taking the time to work with us on this much-needed feature. Thanks Salesforce Search Team Gennadiy Geyfman Sr. Director of Engineering > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, t
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089614#comment-15089614 ] Robert Muir commented on LUCENE-6966: - https://www.nccgroup.trust/us/about-us/newsroom-and-events/blog/2009/july/if-youre-typing-the-letters-a-e-s-into-your-code-youre-doing-it-wrong/ I am not sure where some of these ideas like "postings lists don't need to be encrypted" came from, but most of the design presented on this issue is completely insecure. Please, if you want to do this stuff in lucene, it needs to be a standardized scheme (like XTS or ESSIV) with all the known tradeoffs already computed. You can be 100% sure that if "crypto is invented here" that I'm gonna make comments on the issue, because it is the right thing to do. The many justifications for doing it in a complicated way in the codec level seems to revolve around limitations in solrcloud, rather than from good design. Because you really can put different indexes in different directories and let the operating system do it for "multitenancy". Because Lucene has stuff like ParallelReader and different fields can be in different indexes if you really need that, etc, etc. Alternative everywhere which would allow you to still "let the OS do it", be secure, and have a working filesystem cache (be fast). > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks.
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089007#comment-15089007 ] Renaud Delbru commented on LUCENE-6966: --- I agree with you that if we add encryption to Lucene, it should always be secure. That's why I opened up the discussion with the commnunity in order to review and agree on which approach to adopt. With respect to IV reuse with CBC mode, a potential leak of information occurs when two messages share a common prefix, as it will reveal the presence and length of that prefix. Now if we look at each format separately and at what type of messages is encrypted in each one, we can assess the risk: - Term Dictionary Index: the entire term dictionary index in a segment will be encrypted as one single message - risk is null - Term Dictionary Data: each suffixes bytes blob is encrypted as one message - I would assume that the probability of having two suffixes bytes blobs sharing the same prefix or being identical is pretty low. But I might be wrong. - Stored Fields Format: each compressed doc chunk is encrypted as one message - a doc chunk can contain the exact same data (e.g., if multiple documents contain the same exact fields and values). This is more likely to happen but it sounds like more an edge case. - Terms Vector: each compress terms and payloads bytes blob of doc chunk is encrypted as one message - same issue than with Stored Fields Format The risk of reusing IV seems to reside in Stored Fields / Terms Vector is not acceptable, one solution is to add a random generated header to each compressed doc chunk that will serve as a unique IV. What do you think ? > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary inde
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087165#comment-15087165 ] Robert Muir commented on LUCENE-6966: - {quote} It is true that if the filesystem caches unencrypted pages, then with a warm cache you will likely get better performance. However, this also means that most of the index data will reside in memory in an unencrypted form. If the server is compromised, then this will make life easier for the attacker. {quote} These are the correct tradeoffs to make though. It is fast and makes it work "at rest". On the other hand, from your description: reusing IVs per segment and so on, that is no CBC mode, sorry its essentially ECB mode: this is just not secure. If we are going to add encryption to lucene, it should actually be secure! > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can have > multiple segments, each one encrypted using a different key version. The key > version for a segment is stored in the segment info. > The provided codec is abstract, and a subclass is responsible in providing an > implementation of the cipher factory. The cipher factory is responsible of > the creation of a cipher instance based on a given key version. > h2. Encryption Model > The encryption model is based on AES/CBC with padding. Initialisation vector > (IV) is reused for performance reason, but only on a per format and per > segment basis. > While IV reuse is usually considered a bad practice, the CBC mode is somehow > resilient to IV reuse. The only "leak" of information that this could lead to > is being able to know that two encrypted blocks of data starts with the same > prefix. However, it is unlikely that two data blocks in an index segment will > start with the same data: > - Stored Fields Format: Each encrypted data block is a compressed block > (~4kb) of one or more documents. It is unlikely that two compressed blocks > start with the same data prefix. > - Term Vectors: Each encrypted data block is a compressed block (~4kb) of > terms and payloads from one or more documents. It is unlikely that two > compressed blocks start with the same data prefix. > - Term Dictionary Index: The term dictionary index is encoded and encrypted > in one single data block. > - Term Dictionary Data: Each data block of the term dictionary encodes a set > of suffixes. It is unlikely to have two dictionary data blocks sharing the > same prefix within the same segment. > - DocValues: A DocValues file will be composed of multiple encrypted data > blocks. It is unlikely to have two data blocks sharing the same prefix within > the same segment (each one will encodes a list of values associated to a > field). > To the best of our knowledge, this model should be safe. However, it would be > good if someone with security expertise in the community could review and > validate it. > h1. Performance > We report here a performance benchmark we did on an early prototype based on > Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all > the fi
[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption
[ https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087144#comment-15087144 ] Renaud Delbru commented on LUCENE-6966: --- Discussion copied from the following [dev thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/201601.mbox/%3C568D2289.5080408@siren.solutions%3E] {quote} I would strongly recommend against "invent your own mode", and instead using standardized schemes/modes (e.g. XTS). Separate from that, I don't understand the reasoning to do it at the codec level. seems quite a bit more messy and complicated than the alternatives, such as block device level (e.g. dm-crypt), or filesystem level (e.g. ext4 filesystem encryption), which have the advantage of the filesystem cache actually working. {quote} [~rcmuir], Yes, you are right. This approach is more complex than plain fs level encryption, but this enables more fine-grained control on what is encrypted. For example, it would not be possible to choose which field to encrypt or not. Also, with fs level encryption, all the data is encrypted regardless if it is sensitive or not. For example, in such a scenario, the full posting lists will be encrypted which is unnecessary, and you'll pay the cost of encrypting the posting lists. It is true that if the filesystem caches unencrypted pages, then with a warm cache you will likely get better performance. However, this also means that most of the index data will reside in memory in an unencrypted form. If the server is compromised, then this will make life easier for the attacker. You have also the (small) issue with the swap which can end up with a large portion of the index unencrypted. This can be solved by using an encrypted swap, but this means that the data is now encrypted using a unique key and not a per-user key. Also, this adds complexity in the management of the system. Highly sensitive installations can make the trade-off between performance and security. There are some applications for Solr that are not served by the other approaches. This codec was developed in the context of a large multi-tenant architecture, where each user has its own index / collection. Each user has its own key, and can update his key at any time. While it seems it would be possible with ext4 to handle a per-user key (e.g., one key per directory), it makes the key and index management more complex (especially in SolrCloud). This is not adequate for some environments. Also, it does not allow the management of multiple key versions in one index. If the user changes his key, we have to re-encrypt the full directory which is not acceptable wrt performance for some environments. The codec level encryption approach is more adequate for some environments than the fs level encryption approach. Also, it is to be noted that this codec does not affect the rest of Lucene/Solr. Users will be able to choose which approach is more adequate for their environment. This gives more options to Lucene/Solr users. > Contribution: Codec for index-level encryption > -- > > Key: LUCENE-6966 > URL: https://issues.apache.org/jira/browse/LUCENE-6966 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/other >Reporter: Renaud Delbru > Labels: codec, contrib > > We would like to contribute a codec that enables the encryption of sensitive > data in the index that has been developed as part of an engagement with a > customer. We think that this could be of interest for the community. > Below is a description of the project. > h1. Introduction > In comparison with approaches where all data is encrypted (e.g., file system > encryption, index output / directory encryption), encryption at a codec level > enables more fine-grained control on which block of data is encrypted. This > is more efficient since less data has to be encrypted. This also gives more > flexibility such as the ability to select which field to encrypt. > Some of the requirements for this project were: > * The performance impact of the encryption should be reasonable. > * The user can choose which field to encrypt. > * Key management: During the life cycle of the index, the user can provide a > new version of his encryption key. Multiple key versions should co-exist in > one index. > h1. What is supported ? > - Block tree terms index and dictionary > - Compressed stored fields format > - Compressed term vectors format > - Doc values format (prototype based on an encrypted index output) - this > will be submitted as a separated patch > - Index upgrader: command to upgrade all the index segments with the latest > key version available. > h1. How it is implemented ? > h2. Key Management > One index segment is encrypted with a single key version. An index can h