Attendees / Agenda: Gidon (IBM): Parquet encryption. Uber, Vertica, Amazon Anna, Gabor, Nandor (Cloudera): Review for column indexing Junjie (tencent): Bloom filter Lars (Cloudera impala) Jim (Cloudera): Bloom filter Deepak (Vertica): Encryption Qinghui, Benoit (Criteo): parquet protobuf.
Parquet encryption: * Deepak will look at the code this week. * Gidon update: * multi key encryption (one for keys and one for footer) * Implementation available. * Working on performance evaluation * Starting in java 9 encryption is hardware accelerated and much better. Very little overhead * Java 8 encryption has more overhead. * If using gzip overhead is small * If using snappy, overhead is high * Added a second encryption implementation that is faster but less secure for java 8 * Advantage of 2 algorithms: makes us think of formalization of also in metadata. * Use case to use encryption without api. Through Hadoop config to pass info. * Modified design document * Discussion on metadata. * Column indexes do not replace the statistics in the footer but replace the statistics in the page header. Column indexing: * Parquet-mr/pr/481 * Encryption * [Some things covered already before these notes started] * Hardware support for encryption? Yes power. Not sure if ARM. Definitely x86-64 * Bloom filters: C++ needs review, but also doing performance tests * Guava Bloom filter: Not sure if compat between version. Impala BFs might be much faster * Java vs. C++ compat: there will be tests * Column indexing * parquet-mr 481 https://github.com/apache/parquet-mr/pull/481 * Right now doing in a separate branch for compat reasons. Not sure the write path will work. * That branch has 3 or more commits * Column indexes will be stored just before the filter. Will the statistics (before the footer) still be useful with column indexing - can we just leave them out. * Filter is for row-groups, column indexing is for pages? * Do we store the maximum value in a page, or a value that is greater than or equal to the largest value in the page? Impala does the latter; PR#481 does that for some pages, but not all (?)