[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-11-15 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: Encryption Codec Documentation.pdf

An initial technical documentation.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: Encryption Codec Documentation.pdf, LUCENE-6966-1.patch, 
> LUCENE-6966-2-docvalues.patch, LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly increased by 35%.
> There was no significant difference in term of index size.
> h2. Query Throughput
> With respect to query throughput, we observed no significant impact on the 
> following 

[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-05-06 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: LUCENE-6966-2-docvalues.patch

Here is a separate patch (to apply on top of LUCENE-6966-2) for the doc values 
format. It is a prototype based on an encrypted index input/output. The 
encrypted index output writes encrypted data blocks of fixed size. Each data 
block has its own initialization vector.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch, LUCENE-6966-2-docvalues.patch, 
> LUCENE-6966-2.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a performance benchmark we did on an early prototype based on 
> Lucene 4.x. The benchmark was performed on the Wikipedia dataset where all 
> the fields (id, title, body, date) were encrypted. Only the block tree terms 
> and compressed stored fields format were tested at that time. 
> h2. Indexing
> The indexing throughput slightly decreased and is roughly 15% less than with 
> the base Lucene. 
> The merge time slightly 

[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-04-05 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: LUCENE-6966-2.patch

This patch includes changes so that every encrypted data block uses a new iv. 
The iv is encoded in the header of the data block. The CipherFactory has been 
extended so that people can decide on how to instantiate a cipher and how to 
generate new ivs.

The performance impact of storing and using a unique iv per block is minimal. 
The results of the benchmark below (performed on the full wikipedia dataset) 
show that there is no significant difference in qps:

{noformat}
TaskQPS 6966-before StdDevQPS 6966-after  StdDev
Pct diff
 Respell   20.56 (11.2%)   19.18  (7.9%)   
-6.7% ( -23% -   13%)
  Fuzzy2   33.98 (11.7%)   32.76 (11.0%)   
-3.6% ( -23% -   21%)
  Fuzzy1   31.13 (11.2%)   30.05  (8.2%)   
-3.5% ( -20% -   17%)
PKLookup  125.62 (13.0%)  121.38  (8.8%)   
-3.4% ( -22% -   21%)
Wildcard   35.10 (11.7%)   34.36  (8.2%)   
-2.1% ( -19% -   20%)
OrNotHighMed   25.90 (11.4%)   25.86 (10.5%)   
-0.2% ( -19% -   24%)
   OrNotHighHigh   15.26 (12.1%)   15.28 (10.8%)
0.2% ( -20% -   26%)
   OrHighNotHigh9.80 (12.4%)9.82 (12.0%)
0.2% ( -21% -   28%)
OrHighNotMed   13.01 (13.4%)   13.06 (13.0%)
0.4% ( -22% -   30%)
 LowTerm  252.64 (12.5%)  253.90  (8.7%)
0.5% ( -18% -   24%)
OrHighNotLow   35.63 (13.5%)   35.83 (13.4%)
0.6% ( -23% -   31%)
 Prefix3   21.70 (13.3%)   21.86  (9.7%)
0.7% ( -19% -   27%)
 MedTerm   83.04 (11.7%)   83.73  (8.0%)
0.8% ( -16% -   23%)
 AndHighHigh   15.41 (10.6%)   15.61  (7.9%)
1.3% ( -15% -   22%)
 LowSloppyPhrase   68.89 (12.5%)   69.90  (9.0%)
1.5% ( -17% -   26%)
  AndHighLow  294.02 (11.6%)  299.04  (8.3%)
1.7% ( -16% -   24%)
   OrHighMed   10.92 (14.4%)   11.13 (10.8%)
1.9% ( -20% -   31%)
  OrHighHigh9.45 (14.6%)9.63 (10.9%)
1.9% ( -20% -   32%)
 MedSpanNear   69.01 (11.9%)   70.39  (8.4%)
2.0% ( -16% -   25%)
  AndHighMed   45.16 (12.4%)   46.17  (9.1%)
2.2% ( -17% -   27%)
HighTerm   16.61 (13.3%)   16.99  (9.5%)
2.3% ( -18% -   28%)
   LowPhrase3.03 (11.1%)3.10  (9.2%)
2.3% ( -16% -   25%)
  HighPhrase   11.82 (13.0%)   12.10  (9.6%)
2.4% ( -17% -   28%)
   MedPhrase7.49 (12.1%)7.67  (9.1%)
2.4% ( -16% -   26%)
OrNotHighLow  424.80 (11.1%)  434.97  (8.2%)
2.4% ( -15% -   24%)
   OrHighLow   25.08 (12.0%)   25.70 (11.7%)
2.5% ( -18% -   29%)
HighSloppyPhrase4.01 (13.7%)4.11  (9.7%)
2.5% ( -18% -   30%)
 MedSloppyPhrase6.61 (12.9%)6.78  (9.2%)
2.5% ( -17% -   28%)
 LowSpanNear   15.52 (11.8%)   15.91  (8.6%)
2.5% ( -16% -   26%)
  IntNRQ3.76 (16.4%)3.86 (13.1%)
2.7% ( -23% -   38%)
HighSpanNear4.40 (12.8%)4.52  (9.1%)
2.8% ( -16% -   28%)
{noformat}

I have took the occasion to run another benchmark to compare this patch against 
lucene's master. We can see that queries on low frequency terms (probably 
because the dictionary lookup becomes more costly than reading of the posting 
list) and queries that needs to scan a large portion of the dictionary are the 
most impacted.

{noformat}
Task  QPS master  StdDevQPS 6966 StdDev 
   Pct diff
  Fuzzy1   55.08 (15.5%)   35.89  (8.2%)  
-34.8% ( -50% -  -13%)
 Respell   39.31 (16.9%)   28.47  (8.2%)  
-27.6% ( -45% -   -3%)
  Fuzzy2   35.33 (16.8%)   28.21  (8.8%)  
-20.1% ( -39% -6%)
Wildcard   11.13 (18.9%)9.95  (7.9%)  
-10.6% ( -31% -   19%)
  AndHighLow  304.79 (17.7%)  277.30 (10.4%)   
-9.0% ( -31% -   23%)
OrNotHighLow  240.56 (16.8%)  226.64 (10.2%)   
-5.8% ( -28% -   25%)
PKLookup  129.54 (20.1%)  122.47  (8.3%)   
-5.5% ( -28% -   28%)
 

[jira] [Updated] (LUCENE-6966) Contribution: Codec for index-level encryption

2016-03-24 Thread Renaud Delbru (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-6966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Renaud Delbru updated LUCENE-6966:
--
Attachment: LUCENE-6966-1.patch

This patch contains the current state of the codec for index-level encryption. 
It is up to date with the latest version of the lucene-solr master branch. This 
patch does not include yet the ability for the users to choose which cipher to 
use. I'll submit a new patch that will tackle this issue in the next coming 
week.
The full lucene test suite has been executed against this codec using the 
command:
{code}
ant -Dtests.codec=EncryptedLucene60 test
{code}
Only one test fails, TestSizeBoundedForceMerge#testByteSizeLimit, which is 
expected. This test is incompatible with the codec.

The doc values format (prototype based on an encrypted index output) is not 
included in this patch, and will be submitted as a separate patch in the next 
coming days.

> Contribution: Codec for index-level encryption
> --
>
> Key: LUCENE-6966
> URL: https://issues.apache.org/jira/browse/LUCENE-6966
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: modules/other
>Reporter: Renaud Delbru
>  Labels: codec, contrib
> Attachments: LUCENE-6966-1.patch
>
>
> We would like to contribute a codec that enables the encryption of sensitive 
> data in the index that has been developed as part of an engagement with a 
> customer. We think that this could be of interest for the community.
> Below is a description of the project.
> h1. Introduction
> In comparison with approaches where all data is encrypted (e.g., file system 
> encryption, index output / directory encryption), encryption at a codec level 
> enables more fine-grained control on which block of data is encrypted. This 
> is more efficient since less data has to be encrypted. This also gives more 
> flexibility such as the ability to select which field to encrypt.
> Some of the requirements for this project were:
> * The performance impact of the encryption should be reasonable.
> * The user can choose which field to encrypt.
> * Key management: During the life cycle of the index, the user can provide a 
> new version of his encryption key. Multiple key versions should co-exist in 
> one index.
> h1. What is supported ?
> - Block tree terms index and dictionary
> - Compressed stored fields format
> - Compressed term vectors format
> - Doc values format (prototype based on an encrypted index output) - this 
> will be submitted as a separated patch
> - Index upgrader: command to upgrade all the index segments with the latest 
> key version available.
> h1. How it is implemented ?
> h2. Key Management
> One index segment is encrypted with a single key version. An index can have 
> multiple segments, each one encrypted using a different key version. The key 
> version for a segment is stored in the segment info.
> The provided codec is abstract, and a subclass is responsible in providing an 
> implementation of the cipher factory. The cipher factory is responsible of 
> the creation of a cipher instance based on a given key version.
> h2. Encryption Model
> The encryption model is based on AES/CBC with padding. Initialisation vector 
> (IV) is reused for performance reason, but only on a per format and per 
> segment basis.
> While IV reuse is usually considered a bad practice, the CBC mode is somehow 
> resilient to IV reuse. The only "leak" of information that this could lead to 
> is being able to know that two encrypted blocks of data starts with the same 
> prefix. However, it is unlikely that two data blocks in an index segment will 
> start with the same data:
> - Stored Fields Format: Each encrypted data block is a compressed block 
> (~4kb) of one or more documents. It is unlikely that two compressed blocks 
> start with the same data prefix.
> - Term Vectors: Each encrypted data block is a compressed block (~4kb) of 
> terms and payloads from one or more documents. It is unlikely that two 
> compressed blocks start with the same data prefix.
> - Term Dictionary Index: The term dictionary index is encoded and encrypted 
> in one single data block.
> - Term Dictionary Data: Each data block of the term dictionary encodes a set 
> of suffixes. It is unlikely to have two dictionary data blocks sharing the 
> same prefix within the same segment.
> - DocValues: A DocValues file will be composed of multiple encrypted data 
> blocks. It is unlikely to have two data blocks sharing the same prefix within 
> the same segment (each one will encodes a list of values associated to a 
> field).
> To the best of our knowledge, this model should be safe. However, it would be 
> good if someone with security expertise in the community could review and 
> validate it. 
> h1. Performance
> We report here a