[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302507#comment-14302507 ] Owen O'Malley commented on HIVE-9188: - Suggestions: * Pick m to always be a multiple of 64 (since you are using longs are the representation) * change the representation of BloomFilter in orc_proto to record the number of hash functions and not the size or fpp. * use fixed64 for the bit field * you'll also need to update the specification in the wiki with the change to the format (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification) * revert the spurious change to CliDriver.java * revert the spurious change to .gitignore * it seems suboptimal to convert long values to bytes before hashing BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298628#comment-14298628 ] Hive QA commented on HIVE-9188: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695438/HIVE-9188.6.patch {color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 7435 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto_mult_tables org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2584/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2584/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2584/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12695438 - PreCommit-HIVE-TRUNK-Build BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298021#comment-14298021 ] Gunther Hagleitner commented on HIVE-9188: -- should look at test failures (probably unrelated). Otherwise: +1 BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch, HIVE-9188.5.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296359#comment-14296359 ] Hive QA commented on HIVE-9188: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12695124/HIVE-9188.5.patch {color:red}ERROR:{color} -1 due to 56 failed/errored test(s), 7430 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_create org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_split_elimination org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join38 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_subquery_in org.apache.hadoop.hive.ql.io.orc.TestColumnStatistics.testHasNull org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testMROutput org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testSplitElimination org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testSplitEliminationNullStats org.apache.hive.hcatalog.mapreduce.TestHCatDynamicPartitioned.testHCatDynamicPartitionedTableMultipleTask[3] org.apache.hive.hcatalog.mapreduce.TestHCatDynamicPartitioned.testHCatDynamicPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatExternalDynamicPartitioned.testHCatDynamicPartitionedTableMultipleTask[3] org.apache.hive.hcatalog.mapreduce.TestHCatExternalDynamicPartitioned.testHCatDynamicPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatExternalDynamicPartitioned.testHCatExternalDynamicCustomLocation[3] org.apache.hive.hcatalog.mapreduce.TestHCatExternalNonPartitioned.testHCatNonPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatExternalPartitioned.testHCatPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatMutableDynamicPartitioned.testHCatDynamicPartitionedTableMultipleTask[3] org.apache.hive.hcatalog.mapreduce.TestHCatMutableDynamicPartitioned.testHCatDynamicPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatMutableNonPartitioned.testHCatNonPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatMutablePartitioned.testHCatPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatNonPartitioned.testHCatNonPartitionedTable[3] org.apache.hive.hcatalog.mapreduce.TestHCatPartitioned.testHCatPartitionedTable[3] org.apache.hive.hcatalog.pig.TestE2EScenarios.testReadOrcAndRCFromPig org.apache.hive.hcatalog.pig.TestHCatLoader.testProjectionsBasic[3] org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataBasic[3] org.apache.hive.hcatalog.pig.TestHCatLoader.testReadPartitionedBasic[3] org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testMapNullKey[3] org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testMapWithComplexData[3] org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testSyntheticComplexSchema[3] org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testTupleInBagInTupleInBag[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testBagNStruct[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testDateCharTypes[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testDynamicPartitioningMultiPartColsInDataNoSpec[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testDynamicPartitioningMultiPartColsInDataPartialSpec[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testMultiPartColsInData[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testPartColsInData[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreFuncAllSimpleTypes[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreFuncSimple[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreInPartiitonedTbl[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreMultiTables[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreWithNoCtorArgs[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreWithNoSchema[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteChar[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDate2[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDate3[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDate[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDecimalXY[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDecimalX[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDecimal[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteSmallint[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteTimestamp[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteTinyint[3] org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteVarchar[3] org.apache.hive.hcatalog.pig.TestHCatStorerMulti.testStoreBasicTable[3] org.apache.hive.hcatalog.pig.TestHCatStorerMulti.testStorePartitionedTable[3] org.apache.hive.hcatalog.pig.TestHCatStorerMulti.testStoreTableMulti[3] org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion {noformat} Test results:
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275997#comment-14275997 ] Owen O'Malley commented on HIVE-9188: - [~prasanth_j] Please remove the upper two levels of bloom filters. They are utterly useless. Their false positive rate will be far above 99%. They absolutely should not be stored in the column statistics. That will hurt the common ppd case and not help. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269892#comment-14269892 ] Prasanth Jayachandran commented on HIVE-9188: - [~gopalv] Thanks for the review comments! I will address all the concerns in the next patch after rebasing with HIVE-4639. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268907#comment-14268907 ] Gopal V commented on HIVE-9188: --- Left some comments, particularly about the encoding of the bloom filter itself. The ListLong is a bad idea as the 2nd long in the list is actually a double containing the fpp value. Otherwise the patch looks good. I've added it to the build right now, will ETL in a bunch of the NYC taxi data with this and run some point-scan queries. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268573#comment-14268573 ] Owen O'Malley commented on HIVE-9188: - [~prasanth_j] Ok, I thought that you said that you were going to have bloom filters at row group, stripe, and file level. I agree completely that ORC should only have bloom filters at the row group level. Having the bloom filter as a separate stream means the reader does *far* less IO. It will still go through the code that merges adjacent ranges together into a single read. So if you need all of the indexes and bloom filters for all of the columns the reader should read them in a single IO operation. On the other hand, if it doesn't need any bloom filter it shouldn't have to load the extra mb of data it doesn't need. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268632#comment-14268632 ] Prasanth Jayachandran commented on HIVE-9188: - [~owen.omalley] Current patch has bloom filters at all 3 levels. The size is kept constant for all 3 levels. But fpp for stripe will be 0.05 (assuming 10k unique items) and for file it will be much worse. With this we will get good row group elimination and considerably good stripe elimination. I can drop the file level bloom filter which we don't use for any purpose. The merging of disk ranges happens after we pick the row groups that satisfy the SARG (readPartialDataStreams() happens after pickRowGroups()). But we need bloom filter before that for eliminating row groups. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268176#comment-14268176 ] Owen O'Malley commented on HIVE-9188: - [~gopalv] I don't understand your concern. The indexes are already stored in ROW_INDEX streams. I'm just saying that the bloom filters, which are much larger than the rest of the ROW_INDEX be split into a BLOOM_FILTER stream instead of bundled in with the ROW_INDEX stream. That would let you load just the ROW_INDEX if you don't need the bloom filter. The size of the bloom filter needs to be changed relative to the number of items. You've sized them for the default row group size (n = 10,000, p=0.05) - 7.8kb. To use them at the file level, you'd need to make the bloom filters much much much larger. For a file with 100 million values in a column, you'd need a 74mb bloom filter. I'd propose that you only do the bloom filters at the row group level and scale them to match the row index stride rather than just use the default 10k. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268215#comment-14268215 ] Prasanth Jayachandran commented on HIVE-9188: - The 0.05 fpp is guaranteed only at row index stride level that 10k rows by default. Merging the bloom filter to higher levels (stripe,file) will increase the fpp keeping the size constant. We will get worse fpp if we exceed the number of insertions in stripe level. We don't really need the file level bloom filter as its not useful considering we have stripe level statistics. If we have the bloom filter in row index we can read it in single IO per stripe. But we will end up reading the bloom filters of columns that does not participate in bloom filter. On the other hand if we have bloom filter as a separate stream we will end up with an extra IO op per stripe to read the bloom filter. Also having it as separate stream has additional costs (boolean flag in row index to know if we bloom filter for that column, position information). BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267993#comment-14267993 ] Owen O'Malley commented on HIVE-9188: - I'm concerned about the size of the bloom filters and making them an integrated part of the column statistics. I think we'd do much better to make a BLOOM_FILTER stream kind and place them in a completely separate stream. That would allow the predicate push down to only load the bloom filters for the columns that it needs. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268002#comment-14268002 ] Gopal V commented on HIVE-9188: --- [~owen.omalley]: the stream has the issue that it's read after the disk ranges are computed ( read). So we don't get the IO savings with the stream approach. The row-group stats is the only bit of data that is read ahead of the actual HDFS IO ops, which lets us skip the reads off the disk. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267032#comment-14267032 ] Prasanth Jayachandran commented on HIVE-9188: - This patch needs to be rebased after HIVE-4639 as both patches touches the same set of files. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264436#comment-14264436 ] Hive QA commented on HIVE-9188: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12690039/HIVE-9188.4.patch {color:green}SUCCESS:{color} +1 6748 tests passed Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2250/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2250/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2250/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. ATTACHMENT ID: 12690039 - PreCommit-HIVE-TRUNK-Build BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, HIVE-9188.4.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263450#comment-14263450 ] Hive QA commented on HIVE-9188: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689910/HIVE-9188.3.patch {color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 6748 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2240/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2240/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2240/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 4 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12689910 - PreCommit-HIVE-TRUNK-Build BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263430#comment-14263430 ] Hive QA commented on HIVE-9188: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12689909/HIVE-9188.2.patch {color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 6748 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2239/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2239/console Test logs: http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2239/ Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12689909 - PreCommit-HIVE-TRUNK-Build BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263455#comment-14263455 ] Lefty Leverenz commented on HIVE-9188: -- Great, thanks [~prasanth_j]. BloomFilter in ORC row group index -- Key: HIVE-9188 URL: https://issues.apache.org/jira/browse/HIVE-9188 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.15.0 Reporter: Prasanth Jayachandran Assignee: Prasanth Jayachandran Labels: orcfile Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch BloomFilters are well known probabilistic data structure for set membership checking. We can use bloom filters in ORC index for better row group pruning. Currently, ORC row group index uses min/max statistics to eliminate row groups (stripes as well) that do not satisfy predicate condition specified in the query. But in some cases, the efficiency of min/max based elimination is not optimal (unsorted columns with wide range of entries). Bloom filters can be an effective and efficient alternative for row group/split elimination for point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255455#comment-14255455 ] Hive QA commented on HIVE-9188: --- {color:red}Overall{color}: -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12688602/HIVE-9188.1.patch {color:red}ERROR:{color} -1 due to 52 failed/errored test(s), 6742 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_project org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_tmp_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_no_match org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_whole_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_acid_dynamic_partition org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_nonacid_from_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_update_delete org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_values_dynamic_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_values_tmp_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nonmr_fetch org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_acid org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_after_multiple_inserts org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_all_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_all_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_tmp_table org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_two_cols org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_no_match org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_non_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_partitioned org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_virtual_column org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_all_non_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_all_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_tmp_table org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_no_match org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_non_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_whole_partition org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_update_delete org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_values_dynamic_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_values_tmp_table org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_after_multiple_inserts org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_all_non_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_all_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_tmp_table org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_two_cols org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_no_match org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_non_partitioned org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_partitioned org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_acid_overwrite org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testCombinationInputFormatWithAcid org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.testStatsAfterCompactionPartTbl org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection {noformat} Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2160/testReport Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2160/console Test logs: