[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-02-02 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14302507#comment-14302507
 ] 

Owen O'Malley commented on HIVE-9188:
-

Suggestions:
* Pick m to always be a multiple of 64 (since you are using longs are the 
representation)
* change the representation of BloomFilter in orc_proto to record the number of 
hash functions and not the size or fpp.
* use fixed64 for the bit field
* you'll also need to update the specification in the wiki with the change to 
the format 
(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification)
* revert the spurious change to CliDriver.java
* revert the spurious change to .gitignore
* it seems suboptimal to convert long values to bytes before hashing


 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-30 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298628#comment-14298628
 ] 

Hive QA commented on HIVE-9188:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12695438/HIVE-9188.6.patch

{color:red}ERROR:{color} -1 due to 2 failed/errored test(s), 7435 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_index_auto_mult_tables
org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2584/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2584/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2584/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 2 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12695438 - PreCommit-HIVE-TRUNK-Build

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-29 Thread Gunther Hagleitner (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14298021#comment-14298021
 ] 

Gunther Hagleitner commented on HIVE-9188:
--

should look at test failures (probably unrelated). Otherwise: +1

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch, HIVE-9188.5.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-28 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296359#comment-14296359
 ] 

Hive QA commented on HIVE-9188:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12695124/HIVE-9188.5.patch

{color:red}ERROR:{color} -1 due to 56 failed/errored test(s), 7430 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_create
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_split_elimination
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_join38
org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_subquery_in
org.apache.hadoop.hive.ql.io.orc.TestColumnStatistics.testHasNull
org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testMROutput
org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testSplitElimination
org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testSplitEliminationNullStats
org.apache.hive.hcatalog.mapreduce.TestHCatDynamicPartitioned.testHCatDynamicPartitionedTableMultipleTask[3]
org.apache.hive.hcatalog.mapreduce.TestHCatDynamicPartitioned.testHCatDynamicPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatExternalDynamicPartitioned.testHCatDynamicPartitionedTableMultipleTask[3]
org.apache.hive.hcatalog.mapreduce.TestHCatExternalDynamicPartitioned.testHCatDynamicPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatExternalDynamicPartitioned.testHCatExternalDynamicCustomLocation[3]
org.apache.hive.hcatalog.mapreduce.TestHCatExternalNonPartitioned.testHCatNonPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatExternalPartitioned.testHCatPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatMutableDynamicPartitioned.testHCatDynamicPartitionedTableMultipleTask[3]
org.apache.hive.hcatalog.mapreduce.TestHCatMutableDynamicPartitioned.testHCatDynamicPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatMutableNonPartitioned.testHCatNonPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatMutablePartitioned.testHCatPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatNonPartitioned.testHCatNonPartitionedTable[3]
org.apache.hive.hcatalog.mapreduce.TestHCatPartitioned.testHCatPartitionedTable[3]
org.apache.hive.hcatalog.pig.TestE2EScenarios.testReadOrcAndRCFromPig
org.apache.hive.hcatalog.pig.TestHCatLoader.testProjectionsBasic[3]
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadDataBasic[3]
org.apache.hive.hcatalog.pig.TestHCatLoader.testReadPartitionedBasic[3]
org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testMapNullKey[3]
org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testMapWithComplexData[3]
org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testSyntheticComplexSchema[3]
org.apache.hive.hcatalog.pig.TestHCatLoaderComplexSchema.testTupleInBagInTupleInBag[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testBagNStruct[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testDateCharTypes[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testDynamicPartitioningMultiPartColsInDataNoSpec[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testDynamicPartitioningMultiPartColsInDataPartialSpec[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testMultiPartColsInData[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testPartColsInData[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreFuncAllSimpleTypes[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreFuncSimple[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreInPartiitonedTbl[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreMultiTables[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreWithNoCtorArgs[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testStoreWithNoSchema[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteChar[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDate2[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDate3[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDate[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDecimalXY[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDecimalX[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteDecimal[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteSmallint[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteTimestamp[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteTinyint[3]
org.apache.hive.hcatalog.pig.TestHCatStorer.testWriteVarchar[3]
org.apache.hive.hcatalog.pig.TestHCatStorerMulti.testStoreBasicTable[3]
org.apache.hive.hcatalog.pig.TestHCatStorerMulti.testStorePartitionedTable[3]
org.apache.hive.hcatalog.pig.TestHCatStorerMulti.testStoreTableMulti[3]
org.apache.hive.hcatalog.templeton.TestWebHCatE2e.getHiveVersion
{noformat}

Test results: 

[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-13 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275997#comment-14275997
 ] 

Owen O'Malley commented on HIVE-9188:
-

[~prasanth_j] Please remove the upper two levels of bloom filters. They are 
utterly useless. Their false positive rate will be far above 99%.

They absolutely should not be stored in the column statistics. That will hurt 
the common ppd case and not help.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-08 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269892#comment-14269892
 ] 

Prasanth Jayachandran commented on HIVE-9188:
-

[~gopalv] Thanks for the review comments! I will address all the concerns in 
the next patch after rebasing with HIVE-4639.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268907#comment-14268907
 ] 

Gopal V commented on HIVE-9188:
---

Left some comments, particularly about the encoding of the bloom filter itself.

The ListLong is a bad idea as the 2nd long in the list is actually a double 
containing the fpp value.

Otherwise the patch looks good. I've added it to the build right now, will ETL 
in a bunch of the NYC taxi data with this and run some point-scan queries.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268573#comment-14268573
 ] 

Owen O'Malley commented on HIVE-9188:
-

[~prasanth_j] Ok, I thought that you said that you were going to have bloom 
filters at row group, stripe, and file level. I agree completely that ORC 
should only have bloom filters at the row group level.

Having the bloom filter as a separate stream means the reader does *far* less 
IO. It will still go through the code that merges adjacent ranges together into 
a single read. So if you need all of the indexes and bloom filters for all of 
the columns the reader should read them in a single IO operation. On the other 
hand, if it doesn't need any bloom filter it shouldn't have to load the extra 
mb of data it doesn't need.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268632#comment-14268632
 ] 

Prasanth Jayachandran commented on HIVE-9188:
-

[~owen.omalley] Current patch has bloom filters at all 3 levels. The size is 
kept constant for all 3 levels. But fpp for stripe will be 0.05 (assuming 10k 
unique items) and for file it will be much worse. With this we will get good 
row group elimination and considerably good stripe elimination. I can drop the 
file level bloom filter which we don't use for any purpose.

The merging of disk ranges happens after we pick the row groups that satisfy 
the SARG (readPartialDataStreams() happens after pickRowGroups()). But we need 
bloom filter before that for eliminating row groups.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268176#comment-14268176
 ] 

Owen O'Malley commented on HIVE-9188:
-

[~gopalv] I don't understand your concern. The indexes are already stored in 
ROW_INDEX streams. I'm just saying that the bloom filters, which are much 
larger than the rest of the ROW_INDEX be split into a BLOOM_FILTER stream 
instead of bundled in with the ROW_INDEX stream. That would let you load just 
the ROW_INDEX if you don't need the bloom filter.

The size of the bloom filter needs to be changed relative to the number of 
items. You've sized them for the default row group size (n = 10,000, p=0.05) - 
7.8kb. To use them at the file level, you'd need to make the bloom filters much 
much much larger. For a file with 100 million values in a column, you'd need a 
74mb bloom filter. I'd propose that you only do the bloom filters at the row 
group level and scale them to match the row index stride rather than just use 
the default 10k.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268215#comment-14268215
 ] 

Prasanth Jayachandran commented on HIVE-9188:
-

The 0.05 fpp is guaranteed only at row index stride level that 10k rows by 
default. Merging the bloom filter to higher levels (stripe,file) will increase 
the fpp keeping the size constant. We will get worse fpp if we exceed the 
number of insertions in stripe level. We don't really need the file level bloom 
filter as its not useful considering we have stripe level statistics. 

If we have the bloom filter in row index we can read it in single IO per 
stripe. But we will end up reading the bloom filters of columns that does not 
participate in bloom filter. On the other hand if we have bloom filter as a 
separate stream we will end up with an extra IO op per stripe to read the bloom 
filter. Also having it as separate stream has additional costs (boolean flag in 
row index to know if we bloom filter for that column, position information). 

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Owen O'Malley (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267993#comment-14267993
 ] 

Owen O'Malley commented on HIVE-9188:
-

I'm concerned about the size of the bloom filters and making them an integrated 
part of the column statistics. I think we'd do much better to make a 
BLOOM_FILTER stream kind and place them in a completely separate stream. That 
would allow the predicate push down to only load the bloom filters for the 
columns that it needs.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-07 Thread Gopal V (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14268002#comment-14268002
 ] 

Gopal V commented on HIVE-9188:
---

[~owen.omalley]: the stream has the issue that it's read after the disk ranges 
are computed ( read). So we don't get the IO savings with the stream approach.

The row-group stats is the only bit of data that is read ahead of the actual 
HDFS IO ops, which lets us skip the reads off the disk.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-06 Thread Prasanth Jayachandran (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267032#comment-14267032
 ] 

Prasanth Jayachandran commented on HIVE-9188:
-

This patch needs to be rebased after HIVE-4639 as both patches touches the same 
set of files.

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-05 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264436#comment-14264436
 ] 

Hive QA commented on HIVE-9188:
---



{color:green}Overall{color}: +1 all checks pass

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12690039/HIVE-9188.4.patch

{color:green}SUCCESS:{color} +1 6748 tests passed

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2250/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2250/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2250/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12690039 - PreCommit-HIVE-TRUNK-Build

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, 
 HIVE-9188.4.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-02 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263450#comment-14263450
 ] 

Hive QA commented on HIVE-9188:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689910/HIVE-9188.3.patch

{color:red}ERROR:{color} -1 due to 4 failed/errored test(s), 6748 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_optimize_nullscan
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2240/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2240/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2240/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 4 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12689910 - PreCommit-HIVE-TRUNK-Build

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-02 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263430#comment-14263430
 ] 

Hive QA commented on HIVE-9188:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12689909/HIVE-9188.2.patch

{color:red}ERROR:{color} -1 due to 3 failed/errored test(s), 6748 tests executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2239/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2239/console
Test logs: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-TRUNK-Build-2239/

Messages:
{noformat}
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 3 tests failed
{noformat}

This message is automatically generated.

ATTACHMENT ID: 12689909 - PreCommit-HIVE-TRUNK-Build

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2015-01-02 Thread Lefty Leverenz (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263455#comment-14263455
 ] 

Lefty Leverenz commented on HIVE-9188:
--

Great, thanks [~prasanth_j].

 BloomFilter in ORC row group index
 --

 Key: HIVE-9188
 URL: https://issues.apache.org/jira/browse/HIVE-9188
 Project: Hive
  Issue Type: New Feature
  Components: File Formats
Affects Versions: 0.15.0
Reporter: Prasanth Jayachandran
Assignee: Prasanth Jayachandran
  Labels: orcfile
 Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch


 BloomFilters are well known probabilistic data structure for set membership 
 checking. We can use bloom filters in ORC index for better row group pruning. 
 Currently, ORC row group index uses min/max statistics to eliminate row 
 groups (stripes as well) that do not satisfy predicate condition specified in 
 the query. But in some cases, the efficiency of min/max based elimination is 
 not optimal (unsorted columns with wide range of entries). Bloom filters can 
 be an effective and efficient alternative for row group/split elimination for 
 point queries or queries with IN clause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index

2014-12-21 Thread Hive QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14255455#comment-14255455
 ] 

Hive QA commented on HIVE-9188:
---



{color:red}Overall{color}: -1 at least one tests failed

Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12688602/HIVE-9188.1.patch

{color:red}ERROR:{color} -1 due to 52 failed/errored test(s), 6742 tests 
executed
*Failed tests:*
{noformat}
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_join
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_partition
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_vectorization_project
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_non_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_all_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_tmp_table
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_no_match
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_non_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_where_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_delete_whole_partition
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_dynpart_sort_optimization_acid
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_acid_dynamic_partition
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_nonacid_from_acid
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_update_delete
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_values_dynamic_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_insert_values_tmp_table
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nonmr_fetch
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_transform_acid
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_after_multiple_inserts
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_all_non_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_all_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_tmp_table
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_two_cols
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_no_match
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_non_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_update_where_partitioned
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_virtual_column
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_all_non_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_all_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_tmp_table
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_no_match
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_non_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_where_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_delete_whole_partition
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_update_delete
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_values_dynamic_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_insert_values_tmp_table
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_after_multiple_inserts
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_all_non_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_all_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_tmp_table
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_two_cols
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_no_match
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_non_partitioned
org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_update_where_partitioned
org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_acid_overwrite
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDictionaryThreshold
org.apache.hadoop.hive.ql.io.orc.TestFileDump.testDump
org.apache.hadoop.hive.ql.io.orc.TestInputOutputFormat.testCombinationInputFormatWithAcid
org.apache.hadoop.hive.ql.txn.compactor.TestCompactor.testStatsAfterCompactionPartTbl
org.apache.hive.hcatalog.streaming.TestStreaming.testEndpointConnection
{noformat}

Test results: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2160/testReport
Console output: 
http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/2160/console
Test logs: