[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2017-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068178#comment-16068178
 ] 

Jan Høydahl commented on LUCENE-6966:
-

[~rendel] I believe the reason you are not seeing more traction on this is not 
because it is not quality work or useful, but rather that 1) It is only a tiny 
percentage of Lucene users who need this level of security 2) The patch is huge 
and complex, so most committers won't have bandwidth (or expertise) to QA it.

There is obviously also a concern about future maintenance load if this needs 
to be touched for each version, and for each new index feature, with the risk 
of introducing a bug that breaks security. I'm sure that if a couple of 
developers with in-depth knowledge of the feature and security expertise were 
willing to contribute long-term on this you would probably be nominated as 
committers and the feature would have a safer future.

Have you considered starting by maintaining the project on GitHub, and produce 
releases (and maven artifacts), along with Lucene and Solr usage instructions? 
This would bring more focus, attract PRs, and I would expect it to be a popular 
project very soon. Of course, if there are lucene-core changes that are needed 
for the plungins to work, those would need to be committed first.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> 

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-21 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15685728#comment-15685728
 ] 

Otis Gospodnetic commented on LUCENE-6966:
--

Uh, silence. :( I have not looked into the implementation and have only skimmed 
comments here in the past.  My general feeling though is that until/unless this 
gets committed most people won't bother looking (I think we saw similar 
behaviour with Solr CDCR which was WIP in JIRA for a while and was labeled as 
such for a long time but now that it's in I hear more and more people using 
it http://search-lucene.com/?q=cdcr ) and once it's in it may get worked on 
by more interested parties.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> t

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-15 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15667213#comment-15667213
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Is there still interest from the community in considering this patch as a 
contribution ? Even if there are limitations and therefore this will not cover 
all possible scenarios, we think this provides an initial set of core features 
and a good starting point for future work. We received multiples personal 
request for this patch which shows there is a certain interest for such a 
feature. I am attaching also an initial technical documentation that explains 
how to use the codec and clarifies its current known limitations.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia da

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-05-19 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15291095#comment-15291095
 ] 

Joel Bernstein commented on LUCENE-6966:


I have a couple of issues with the design from a security standpoint:

1) The security tradeoffs of leaving the posting list in the clear is unknown.

2) Encryption at the codec level makes encryption part of the schema design. 
This leaves open opportunities for users to design insecure schemas that they 
believe are secure. For example data can leak from encrypted fields to 
un-encrypted fields as it's copied around to support sorting, faceting, 
suggestion, multi-language search etc...

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, 
> LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. 

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-05-16 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15284788#comment-15284788
 ] 

Renaud Delbru commented on LUCENE-6966:
---

I think the latest patch is ready for commit, any objections ?

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, 
> LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query throughput, we observed no significant impact on the 
> following queries:

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-04-28 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262520#comment-15262520
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Hi [~joel.bernstein],

{quote}
1) With the latest patch do you feel the major concerns have been addressed.
{quote}

Yes, the latest patch does not reuse IVs anymore but instead use a different IV 
for each data block. It also introduces an API so that one can have control on 
how IVs are generated and how the cipher is instantiated.

{quote}
2) From my initial reading of the patch it seemed like everything in the patch 
was pluggable. Does this need to be committed to be usable? Or can it be hosted 
on another project?

3) Because it's such a large patch and codecs change over time, does it present 
a burden to maintain with the core Lucene project? Along these lines is it more 
appropriate from a maintenance standpoint to be maintained by people who are 
really motivated to have this feature. Alfresco engineers would likely 
participate in an outside project if one existed.
{quote}

The patch follows the standard rules of Lucene codecs, so yes, it is fully 
pluggable. Similar to other codecs, however, the burden to maintain it will be 
low. It is a set of Lucene's *Format classes that are loosely coupled with 
other part of the Lucene code. It will likely require maintenance only when the 
high-level Lucene's Codec and Format API changes.
 
The patch is large because we had to make a copy of some of the original lucene 
*Format classes, as those classes were final and not extensible. If one wants 
to update them with the latest improvements made in the original classes, this 
might require a bit more effort, but from my personal experience it was so far 
straightforward.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms 

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-04-25 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15256800#comment-15256800
 ] 

Joel Bernstein commented on LUCENE-6966:


Hi [~rendel],

Thanks for your work on this. I've read through patch and it's quite a large 
piece of work. I mentioned earlier in the ticket that Alfresco is interested in 
this, so I wanted to ask some questions and see if I could understand it 
better. 

1) With the latest patch do you feel the major concerns have been addressed. 
I'll copy a few of them below:

bq. On the other hand, from your description: reusing IVs per segment and so 
on, that is no CBC mode, sorry its essentially ECB mode: this is just not 
secure.

bq. I am not sure where some of these ideas like "postings lists don't need to 
be encrypted" came from, but most of the design presented on this issue is 
completely insecure. Please, if you want to do this stuff in lucene, it needs 
to be a standardized scheme (like XTS or ESSIV) with all the known tradeoffs 
already computed. You can be 100% sure that if "crypto is invented here" that 
I'm gonna make comments on the issue, because it is the right thing to do.

2) From my initial reading of the patch it seemed like everything in the patch 
could be pluggable. Does this need to be committed to be usable? Or can be 
hosted on another project?

3) Because it's such a large patch and codecs change over time, does it present 
a burden to maintain with the core Lucene project? Along these lines is it more 
appropriate from a maintenance standpoint to be maintained by people who are 
really motivated to have this feature. Alfresco engineers would likely 
participate in an outside project if one existed.



> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlik

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-21 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204549#comment-15204549
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Karl, the patch will not include a ready to use FSDirectory implementation, but 
the doc value format is based on an encrypted index input and output 
implementation which can easily be reused in an implementation of FSDirectory.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query through

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-19 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15197431#comment-15197431
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Thanks for all of the feedback. Based on everyone's comments, it seems like 
different encryption algorithms might be better depending on the situation. 
Rather than implement a one-size-fits-all solution then, perhaps it would be 
better not to enforce any one cipher and instead leave some flexibility for 
users to choose the cipher they find more appropriate. 

If everyone is okay with this approach, I will update the code appropriately.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased 

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-19 Thread Karl Sanders (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202629#comment-15202629
 ] 

Karl Sanders commented on LUCENE-6966:
--

{quote}Rather than implement a one-size-fits-all solution then, perhaps it 
would be better not to enforce any one cipher and instead leave some 
flexibility for users to choose the cipher they find more appropriate.{quote}

I think this is extremely reasonable.

I would like to ask if this patch will also provide "FSDirectory-level 
encryption" like 
[LUCENE-2228|https://issues.apache.org/jira/browse/LUCENE-2228].

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less tha

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-14 Thread Karl Sanders (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192849#comment-15192849
 ] 

Karl Sanders commented on LUCENE-6966:
--

There's an apparently abandoned project that might be of interest:
https://code.google.com/archive/p/lucenetransform/

It appears to be implementing compression and encryption for Lucene indexes.
I also found a couple of related links.

- Some considerations about how it's being used in another project:
https://github.com/muzima/documentation/wiki/Security-regarding-stored-data-by-using-Lucene

- A discussion about ensuring that indexes aren't tampered with:
http://permalink.gmane.org/gmane.comp.jakarta.lucene.user/50495

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were t

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-11 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15191059#comment-15191059
 ] 

Thomas Mueller commented on LUCENE-6966:


The approach taken in LUCENE-2228 sounds sensible to me: "AESDirectory extends 
FSDirectory". Even thought the patch would need to be improved: nowadays XTS 
should be used.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query throughput, we observed no significant impact on the 
> follo

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-10 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189351#comment-15189351
 ] 

Thomas Mueller commented on LUCENE-6966:


> More importantly, with file-level encryption, data would reside in an 
> unencrypted form in memory which is not acceptable to our security team

You could use [homomorphic 
encryption|https://en.wikipedia.org/wiki/Homomorphic_encryption] (just joking). 
The best you can realistically do is overwrite the plain text encryption 
password in memory as soon as you have hashed it, but even that is [a challenge 
because the JVM garbage collector could copy the 
password|http://security.stackexchange.com/questions/6753/deleting-a-java-object-securely].

I agree and understand it would be good to have encryption / decryption done in 
Lucene itself. With [filesystem-level 
encryption|https://en.wikipedia.org/wiki/Filesystem-level_encryption] you 
wouldn't need any changes in Lucene, but is a bit challenging for other reasons 
(backup and restore for example). There are still [challenges with key 
management|http://stackoverflow.com/questions/2664099/how-can-we-store-password-other-than-plain-text]
 of course. One option is to read the encryption password from a config file at 
startup, and delete that file after use. To start the application, use a script 
that copies the config file (with the password) using user "x", but run the 
application as user "y" (with lower privileges). That way, the (plain text) 
password is not there during normal operation. Or ask the operator to type in 
the password at startup, or ask for the password on a web page... But it's more 
flexible than with filesystem-level encryption.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictio

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-10 Thread Karl Sanders (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189316#comment-15189316
 ] 

Karl Sanders commented on LUCENE-6966:
--

I would like to let you know that I made the H2 and HSQLDB communities aware 
about this interest in providing a secure backend for Lucene.
H2's author, Thomas Mueller, has already been so kind as to provide his 
feedback.
Both these databases provide encryption at the file level. In both cases I 
mentioned the possibility to use the database as an encrypted storage for 
Lucene data.
Or even to go as far to add (or improve, in the case of H2) Lucene integration.

Maybe some of the companies interested in adding this capability to Lucene 
might want to reach out to them too, showing that there's the concrete 
possibility for a partnership or to simply do some contract work.

I sincerely hope that Lucene file-level encryption can become a reality.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-10 Thread Thomas Mueller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189211#comment-15189211
 ] 

Thomas Mueller commented on LUCENE-6966:


Your proposal doesn't sound very secure. I would recommend to rename this 
feature as "scrambling", and not use the term "encryption".

> you are correct that ECB mode is not secure if blocks are not unique. 
> However, in our use case we can ensure that all blocks are unique and our 
> security team would argue that this make it equivalent to using CTR or CBC 
> mode, making it secure.

Even thought I'm not an expert, I'm almost 100% sure that no, this is not 
secure.

>  it would not be possible to choose which field to encrypt or not

How important is this? Why can't you use two indexes, one encrypted (properly, 
with XTS) and the other one is not?

As for XTS, it is fairly simple to implement. If you like, I can contribute 
[the XTS code I have written for the H2 
database|https://github.com/h2database/h2database/blob/master/h2/src/main/org/h2/store/fs/FilePathEncrypt.java].



> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-04 Thread Joel Bernstein (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15181311#comment-15181311
 ] 

Joel Bernstein commented on LUCENE-6966:


Alfresco is also interested in this ticket. I'd like to see if there is a way 
to reach consensus on an approach for moving this forward. This will likely 
mean making changes in the patch to address security concerns.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query throughput, we o

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-04 Thread Adam Williams (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15180675#comment-15180675
 ] 

Adam Williams commented on LUCENE-6966:
---

Iron Mountain is also interested in this solution.  We have been following this 
ticket and hoping that it will become a reality. We have over 90,000 customers 
with varying security rules ranging from banks to healthcare and government.  

Ultimately, our responsibility is to ensure that we meet the needs of our 
customers.  We are solr cloud based with 4.8 billion indexed documents on over 
120 virtual machines and 26 clouds.  Many of our large customers have 
encryption requirements, while some of our other customers have none.  For us 
to do disk based encryption on shared storage is not ideal.  The cost for TBs 
of data in teir-1 storage is high.  Also, this approach allows us to set 
encryption for those who need it and turn it off for those who do not.  By 
allowing flexibility in the key generation and crypto provider, we can provide 
a solution to meet the security needs of many customers.  

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> go

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-02 Thread Gennadiy Geyfman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15176749#comment-15176749
 ] 

Gennadiy Geyfman commented on LUCENE-6966:
--

To provide a little more background, there are several reasons why we thought a 
file system based encryption scheme would not work for our case.  One of the 
core challenges we are facing is the management of a large number of user keys 
across our infrastructure. We have a complex security environment with key 
management procedures driven by customers with different security requirements. 
As a result, different indices have to be encrypted with different keys and 
with different crypto providers - all of which we need to manage. 

Second, we also need to control the index life cycle which includes applying of 
a new encryption keys, transfer index over the network and manage index backups 
that are stored in the cloud.  For this reason, it will be much less 
complicated to have control over the key providers and data encryption in one 
central point (Lucene). 
 
Given these requirements, our evaluation showed that file-system encryption 
would lead us to larger effort and investment in order to keep secure control 
across the our infrastructure.



> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same 

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-02-12 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15144305#comment-15144305
 ] 

Robert Muir commented on LUCENE-6966:
-

{quote}
More importantly, with file-level encryption, data would reside in an 
unencrypted form in memory which is not acceptable to our security team and, 
therefore, a non-starter for us. 
{quote}

This speaks volumes. You should fire your security team! You are wasting your 
time worrying about this: if you are using lucene, your data will be in memory, 
in plaintext, in ways you cannot control, and there is nothing you can do about 
that!

Trying to guarantee anything better than "at rest" is serious business, sounds 
like your team is over their head.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and comp

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-02-05 Thread Gennadiy Geyfman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134867#comment-15134867
 ] 

Gennadiy Geyfman commented on LUCENE-6966:
--


Thanks for your feedback, Robert. 

Just to give you some background, this feature is being developed for 
Salesforce, where we have customers that not only require robust encryption, 
but that also have strong compliance requirements for key management.  We are 
not currently users of Solr Cloud, which I mention if just to assure you that 
none of the differences between Solr vs. Solr Cloud have had any impact on this 
design.

As far as the solution itself is concerned, you are correct that ECB mode is 
not secure if blocks are not unique.  However, in our use case we can ensure 
that all blocks are unique and our security team would argue that this make it 
equivalent to using CTR or CBC mode, making it secure.  Nonetheless, our case 
would be stronger if the uniqueness property was actually guaranteed rather 
than suggested, so we will seek to refine the design in this way.

Also, with regard to your other concerns, we evaluated encryption at the 
filesystem level, but our conclusion was that this will bring even more 
complexity than index-level encryption, especially when one considers typical 
compliance requirements for key management.  Encryption at the filesystem level 
would also require thoughtful planning for our backup / restore operations to 
ensure that backups are encrypted as well. More importantly, with file-level 
encryption, data would reside in an unencrypted form in memory which is not 
acceptable to our security team and, therefore, a non-starter for us. 

Hopefully this gives you a better idea of the thinking that went into this 
proposed design.   We agree that for other users of Solr, encryption by the OS 
could easily make more sense.  Moreover, nothing in our proposal would stop 
anyone from pursuing that path.  But for our use case and for others needing 
enterprise search functionality, this solution gives  more granular control of 
the keys as well as the end-to-end encryption process.

If you have any additional questions or concerns, please let me know and I'll 
try to answer them as best I can. Thank you, again, for taking the time to work 
with us on this much-needed feature.


Thanks
Salesforce Search Team
Gennadiy Geyfman
Sr. Director of Engineering

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, t

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-08 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089614#comment-15089614
 ] 

Robert Muir commented on LUCENE-6966:
-

https://www.nccgroup.trust/us/about-us/newsroom-and-events/blog/2009/july/if-youre-typing-the-letters-a-e-s-into-your-code-youre-doing-it-wrong/

I am not sure where some of these ideas like "postings lists don't need to be 
encrypted" came from, but most of the design presented on this issue is 
completely insecure. Please, if you want to do this stuff in lucene, it needs 
to be a standardized scheme (like XTS or ESSIV) with all the known tradeoffs 
already computed. You can be 100% sure that if "crypto is invented here" that 
I'm gonna make comments on the issue, because it is the right thing to do.

The many justifications for doing it in a complicated way in the codec level 
seems to revolve around limitations in solrcloud, rather than from good design. 
Because you really can put different indexes in different directories and let 
the operating system do it for "multitenancy". Because Lucene has stuff like 
ParallelReader and different fields can be in different indexes if you really 
need that, etc, etc. Alternative everywhere which would allow you to still "let 
the OS do it", be secure, and have a working filesystem cache (be fast).



> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks.

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-08 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15089007#comment-15089007
 ] 

Renaud Delbru commented on LUCENE-6966:
---

I agree with you that if we add encryption to Lucene, it should always be 
secure. That's why I opened up the discussion with the commnunity in order to 
review and agree on which approach to adopt. 
With respect to IV reuse with CBC mode, a potential leak of information occurs 
when two messages share a common prefix, as it will reveal the presence and 
length of that prefix.
Now if we look at each format separately and at what type of messages is 
encrypted in each one, we can assess the risk:
- Term Dictionary Index: the entire term dictionary index in a segment will be 
encrypted as one single message - risk is null
- Term Dictionary Data: each suffixes bytes blob is encrypted as one message - 
I would assume that the probability of having two suffixes bytes blobs sharing 
the same prefix or being identical is pretty low. But I might be wrong.
- Stored Fields Format: each compressed doc chunk is encrypted as one message - 
a doc chunk can contain the exact same data (e.g., if multiple documents 
contain the same exact fields and values). This is more likely to happen but it 
sounds like more an edge case.
- Terms Vector: each compress terms and payloads bytes blob of doc chunk is 
encrypted as one message - same issue than with Stored Fields Format

The risk of reusing IV seems to reside in Stored Fields / Terms Vector is not 
acceptable, one solution is to add a random generated header to each compressed 
doc chunk that will serve as a unique IV. What do you think ?

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary inde

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-07 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087165#comment-15087165
 ] 

Robert Muir commented on LUCENE-6966:
-

{quote}
It is true that if the filesystem caches unencrypted pages, then with a warm 
cache you will likely get better performance. However, this also means that 
most of the index data will reside in memory in an unencrypted form. If the 
server is compromised, then this will make life easier for the attacker.
{quote}

These are the correct tradeoffs to make though. It is fast and makes it work 
"at rest". 

On the other hand, from your description: reusing IVs per segment and so on, 
that is no CBC mode, sorry its essentially ECB mode: this is just not secure.

If we are going to add encryption to lucene, it should actually be secure!

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fi

[jira] [Commented] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-01-07 Thread Renaud Delbru (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087144#comment-15087144
 ] 

Renaud Delbru commented on LUCENE-6966:
---

Discussion copied from the following [dev 
thread|http://mail-archives.apache.org/mod_mbox/lucene-dev/201601.mbox/%3C568D2289.5080408@siren.solutions%3E]
 

{quote}
I would strongly recommend against "invent your own mode", and instead
using standardized schemes/modes (e.g. XTS).

Separate from that, I don't understand the reasoning to do it at the
codec level. seems quite a bit more messy and complicated than the
alternatives, such as block device level (e.g. dm-crypt), or
filesystem level (e.g. ext4 filesystem encryption), which have the
advantage of the filesystem cache actually working.
{quote}

[~rcmuir], 

Yes, you are right. This approach is more complex than plain fs level 
encryption, but this enables more fine-grained control on what is encrypted. 
For example, it would not be possible to choose which field to encrypt or not. 
Also, with fs level encryption, all the data is encrypted regardless if it is 
sensitive or not. For example, in such a scenario, the full posting lists will 
be encrypted which is unnecessary, and you'll pay the cost of encrypting the 
posting lists.
It is true that if the filesystem caches unencrypted pages, then with a warm 
cache you will likely get better performance. However, this also means that 
most of the index data will reside in memory in an unencrypted form. If the 
server is compromised, then this will make life easier for the attacker. You 
have also the (small) issue with the swap which can end up with a large portion 
of the index unencrypted. This can be solved by using an encrypted swap, but 
this means that the data is now encrypted using a unique key and not a per-user 
key. Also, this adds complexity in the management of the system.
Highly sensitive installations can make the trade-off between performance and 
security. There are some applications for Solr that are not served by the other 
approaches.

This codec was developed in the context of a large multi-tenant architecture, 
where each user has its own index / collection. Each user has its own key, and 
can update his key at any time.
While it seems it would be possible with ext4 to handle a per-user key (e.g., 
one key per directory), it makes the key and index management more complex 
(especially in SolrCloud). This is not adequate for some environments.
Also, it does not allow the management of multiple key versions in one index. 
If the user changes his key, we have to re-encrypt the full directory which is 
not acceptable wrt performance for some environments.

The codec level encryption approach is more adequate for some environments than 
the fs level encryption approach. Also, it is to be noted that this codec does 
not affect the rest of Lucene/Solr. Users will be able to choose which approach 
is more adequate for their environment. This gives more options to Lucene/Solr 
users.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can h