[jira] [Updated] (HIVE-4359) Remove old versions of the javadoc
[ https://issues.apache.org/jira/browse/HIVE-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4359: Attachment: h-4359.patch Combined with {code} % svn rm publish/docs/r0.{3,4,5,6,7,8}.0 {code} > Remove old versions of the javadoc > -- > > Key: HIVE-4359 > URL: https://issues.apache.org/jira/browse/HIVE-4359 > Project: Hive > Issue Type: Task > Components: Website >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: h-4359.patch > > > Delete the old versions of the javadoc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (HIVE-4359) Remove old versions of the javadoc
[ https://issues.apache.org/jira/browse/HIVE-4359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley resolved HIVE-4359. - Resolution: Fixed I just committed this. > Remove old versions of the javadoc > -- > > Key: HIVE-4359 > URL: https://issues.apache.org/jira/browse/HIVE-4359 > Project: Hive > Issue Type: Task > Components: Website >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Attachments: h-4359.patch > > > Delete the old versions of the javadoc. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4189) ORC fails with String column that ends in lots of nulls
[ https://issues.apache.org/jira/browse/HIVE-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634583#comment-13634583 ] Owen O'Malley commented on HIVE-4189: - +1 looks good. > ORC fails with String column that ends in lots of nulls > --- > > Key: HIVE-4189 > URL: https://issues.apache.org/jira/browse/HIVE-4189 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Affects Versions: 0.11.0 >Reporter: Kevin Wilfong >Assignee: Kevin Wilfong > Attachments: HIVE-4189.1.patch.txt, HIVE-4189.2.patch.txt > > > When ORC attempts to write out a string column that ends in enough nulls to > span an index stride, StringTreeWriter's writeStripe method will get an > exception from TreeWriter's writeStripe method > Column has wrong number of index entries found: x expected: y > This is caused by rowIndexValueCount having multiple entries equal to the > number of non-null rows in the column, combined with the fact that > StringTreeWriter has special logic for constructing its index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4305) Use a single system for dependency resolution
[ https://issues.apache.org/jira/browse/HIVE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13634624#comment-13634624 ] Owen O'Malley commented on HIVE-4305: - Carl, I fully acknowledge that ant vs maven is a religious discussion. However, to back up my five points: * IDE support is much better >From http://www.jetbrains.com/idea/features/ant_maven.html : Maven integration reads the files and builds the modules and dependencies between them. Ant integration executes ant targets. This is similar to eclipse too. For Maven projects, you don't need to maintain a set of helper files that set up the project in the IDE. They can build it automatically. Even with our eclipse helper scripts, users give up on building Hive in an IDE. * Offline support is much better Try turning off the internet and build Hive. It is relatively difficult. Maven will just work if you have the required jars in your cache. * You can download a Maven project and build it without reading the build file. This is obviously true from the fundamentals of each system. Ant provides a wide open playing field and you can build "tar" in one project and "package" in another. There are no rules. In Maven, I know what "package" will build. * Publishing to Maven central is much easier. Ivy can't publish to Maven central, so you end up use ant's maven tasks to publish. This requires that you have two different descriptions of the projects dependencies one for ivy and one for ant's maven tasks. Furthermore, based on my experience as the release manager for Hadoop, ant's maven tasks are much more error-prone. Futhermore, they don't support features like storing your password encrypted. * Profiles work much better in Maven. Ok, this one is debatable. In my opinion, Maven profiles are cleaner and better designed. Finally, I fully support Brock's point: * Maven is used by the other Hadoop ecosystem projects. Hadoop in particular was using ant, ivy, and maven ant tasks for a long time and traded them in for Maven. There is significant value in using similar tools. > Use a single system for dependency resolution > - > > Key: HIVE-4305 > URL: https://issues.apache.org/jira/browse/HIVE-4305 > Project: Hive > Issue Type: Improvement > Components: Build Infrastructure, HCatalog >Reporter: Travis Crawford > > Both Hive and HCatalog use ant as their build tool. However, Hive uses ivy > for dependency resolution while HCatalog uses maven-ant-tasks. With the > project merge we should converge on a single tool for dependency resolution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4305) Use a single system for dependency resolution
[ https://issues.apache.org/jira/browse/HIVE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635822#comment-13635822 ] Owen O'Malley commented on HIVE-4305: - .bq I have good news and bad news. I have better news, Maven handles offline by just passing in "-o". > Use a single system for dependency resolution > - > > Key: HIVE-4305 > URL: https://issues.apache.org/jira/browse/HIVE-4305 > Project: Hive > Issue Type: Improvement > Components: Build Infrastructure, HCatalog >Reporter: Travis Crawford >Assignee: Carl Steinbach > > Both Hive and HCatalog use ant as their build tool. However, Hive uses ivy > for dependency resolution while HCatalog uses maven-ant-tasks. With the > project merge we should converge on a single tool for dependency resolution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4305) Use a single system for dependency resolution
[ https://issues.apache.org/jira/browse/HIVE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13635867#comment-13635867 ] Owen O'Malley commented on HIVE-4305: - Carl, the critical point is that you are having to fix the the ant build file to make offline work. In Maven, it is built in and thus we have less to maintain. > Use a single system for dependency resolution > - > > Key: HIVE-4305 > URL: https://issues.apache.org/jira/browse/HIVE-4305 > Project: Hive > Issue Type: Improvement > Components: Build Infrastructure, HCatalog >Reporter: Travis Crawford >Assignee: Carl Steinbach > > Both Hive and HCatalog use ant as their build tool. However, Hive uses ivy > for dependency resolution while HCatalog uses maven-ant-tasks. With the > project merge we should converge on a single tool for dependency resolution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4178) ORC fails with files with different numbers of columns
[ https://issues.apache.org/jira/browse/HIVE-4178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4178: Resolution: Fixed Fix Version/s: 0.11.0 Status: Resolved (was: Patch Available) I just committed this to trunk and branch-11. Thanks, Kevin! > ORC fails with files with different numbers of columns > -- > > Key: HIVE-4178 > URL: https://issues.apache.org/jira/browse/HIVE-4178 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Affects Versions: 0.11.0 >Reporter: Kevin Wilfong >Assignee: Kevin Wilfong > Fix For: 0.11.0 > > Attachments: HIVE-4178.1.patch.txt > > > When CombineHiveInputFormat is used, it's possible that two files with > different numbers of files can be included in the same split, in which case > Hive will fail at one of several points with an > ArrayIndexOutOfBoundsException. > This can happen when a partition contains empty files or two partitions are > read with different numbers of columns. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4305) Use a single system for dependency resolution
[ https://issues.apache.org/jira/browse/HIVE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13636534#comment-13636534 ] Owen O'Malley commented on HIVE-4305: - Carl, Rather than debate it theoretically or compare it to Hadoop, which has a *LOT* more complexity in their build, I propose that we have Travis make a Maven build file for the combined Hive and HCat systems. Then we can debate the value and issues in the particular patch and how to move the project forward. The current state is painful with extremely long builds. We need to move forward and enable the project to evolve quickly so that Hive can compete with its many comercial competitors. > Use a single system for dependency resolution > - > > Key: HIVE-4305 > URL: https://issues.apache.org/jira/browse/HIVE-4305 > Project: Hive > Issue Type: Improvement > Components: Build Infrastructure, HCatalog >Reporter: Travis Crawford >Assignee: Carl Steinbach > > Both Hive and HCatalog use ant as their build tool. However, Hive uses ivy > for dependency resolution while HCatalog uses maven-ant-tasks. With the > project merge we should converge on a single tool for dependency resolution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4189) ORC fails with String column that ends in lots of nulls
[ https://issues.apache.org/jira/browse/HIVE-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4189: Resolution: Fixed Fix Version/s: 0.11.0 Status: Resolved (was: Patch Available) I just committed this to trunk and branch-0.11. Thanks, Kevin! > ORC fails with String column that ends in lots of nulls > --- > > Key: HIVE-4189 > URL: https://issues.apache.org/jira/browse/HIVE-4189 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Affects Versions: 0.11.0 >Reporter: Kevin Wilfong >Assignee: Kevin Wilfong > Fix For: 0.11.0 > > Attachments: HIVE-4189.1.patch.txt, HIVE-4189.2.patch.txt > > > When ORC attempts to write out a string column that ends in enough nulls to > span an index stride, StringTreeWriter's writeStripe method will get an > exception from TreeWriter's writeStripe method > Column has wrong number of index entries found: x expected: y > This is caused by rowIndexValueCount having multiple entries equal to the > number of non-null rows in the column, combined with the fact that > StringTreeWriter has special logic for constructing its index. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4305) Use a single system for dependency resolution
[ https://issues.apache.org/jira/browse/HIVE-4305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13637405#comment-13637405 ] Owen O'Malley commented on HIVE-4305: - {quote} Owen, please give some concrete examples of things that make Hadoop's build more complex than Hive's. {quote} * It contains native executables. * It contains native libraries. * It contains jni libraries. {quote} I think it would be more pragmatic to spend time improving the build that we currently have {quote} Moving to Maven would be making it better in the opinion of the majority of the development community. The current Hive build is a complex mess and Ivy and maven ant tasks is really hard to debug. Certainly, I believe it is possible to make things worse with Maven. I'm not a fan of how the Hadoop mavenization was done and I deeply regret not taking the time to make it better as it went in, but it was still better than the ant + ivy + maven ant tasks that we had. If it hadn't been, it would have been rejected. That said, in my experience most projects are better off with Maven builds than ant + ivy + maven ant tasks. > Use a single system for dependency resolution > - > > Key: HIVE-4305 > URL: https://issues.apache.org/jira/browse/HIVE-4305 > Project: Hive > Issue Type: Improvement > Components: Build Infrastructure, HCatalog >Reporter: Travis Crawford >Assignee: Carl Steinbach > > Both Hive and HCatalog use ant as their build tool. However, Hive uses ivy > for dependency resolution while HCatalog uses maven-ant-tasks. With the > project merge we should converge on a single tool for dependency resolution. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4421) Improve memory usage by ORC dictionaries
Owen O'Malley created HIVE-4421: --- Summary: Improve memory usage by ORC dictionaries Key: HIVE-4421 URL: https://issues.apache.org/jira/browse/HIVE-4421 Project: Hive Issue Type: Bug Components: Serializers/Deserializers Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, for tables with many string columns, it is possible to significantly underestimate the memory used by the ORC dictionaries and cause the query to run out of memory in the task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-4421) Improve memory usage by ORC dictionaries
[ https://issues.apache.org/jira/browse/HIVE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-4421: Fix Version/s: 0.11.0 Status: Patch Available (was: Open) This patch does three things: * Improves the memory usage while writing ORC dictionaries by removing the counts and just storing offsets instead of offsets and lengths. * Improves the tracking of how much memory is used by the dictionaries by tracking the allocation rather than the usage. * Reduces the size of some of the allocation sizes of the integer arrays. > Improve memory usage by ORC dictionaries > > > Key: HIVE-4421 > URL: https://issues.apache.org/jira/browse/HIVE-4421 > Project: Hive > Issue Type: Bug > Components: Serializers/Deserializers >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Fix For: 0.11.0 > > Attachments: HIVE-4421.D10545.1.patch > > > Currently, for tables with many string columns, it is possible to > significantly underestimate the memory used by the ORC dictionaries and cause > the query to run out of memory in the task. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-4464) Hive's JDBC module doesn't compile under openjdk 7
Owen O'Malley created HIVE-4464: --- Summary: Hive's JDBC module doesn't compile under openjdk 7 Key: HIVE-4464 URL: https://issues.apache.org/jira/browse/HIVE-4464 Project: Hive Issue Type: Task Reporter: Owen O'Malley Assignee: Owen O'Malley Hive currently fails to compile when compiled with openjdk 7. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-8880) non-synchronized access to split list in OrcInputFormat
[ https://issues.apache.org/jira/browse/HIVE-8880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236177#comment-14236177 ] Owen O'Malley commented on HIVE-8880: - +1, this is good. > non-synchronized access to split list in OrcInputFormat > --- > > Key: HIVE-8880 > URL: https://issues.apache.org/jira/browse/HIVE-8880 > Project: Hive > Issue Type: Bug >Affects Versions: 0.14.0 >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.14.1 > > Attachments: HIVE-8880.patch > > > When adding delta files to the list of orc splits access to the list is not > synchronized though it is shared across threads. All other additions to the > list are synchronized. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted
[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14240415#comment-14240415 ] Owen O'Malley commented on HIVE-8966: - Alan, your patch looks good +1 > Delta files created by hive hcatalog streaming cannot be compacted > -- > > Key: HIVE-8966 > URL: https://issues.apache.org/jira/browse/HIVE-8966 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 0.14.0 > Environment: hive >Reporter: Jihong Liu >Assignee: Alan Gates >Priority: Critical > Fix For: 0.14.1 > > Attachments: HIVE-8966.2.patch, HIVE-8966.patch > > > hive hcatalog streaming will also create a file like bucket_n_flush_length in > each delta directory. Where "n" is the bucket number. But the > compactor.CompactorMR think this file also needs to compact. However this > file of course cannot be compacted, so compactor.CompactorMR will not > continue to do the compaction. > Did a test, after removed the bucket_n_flush_length file, then the "alter > table partition compact" finished successfully. If don't delete that file, > nothing will be compacted. > This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9166) Place an upper bound for SARG CNF conversion
[ https://issues.apache.org/jira/browse/HIVE-9166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14252406#comment-14252406 ] Owen O'Malley commented on HIVE-9166: - +1 LGTM You probably should add a test case where there is something other than the large CNF. something like (and leaf-1 (or ...)) You should end up with leaf-1 as your final expression. > Place an upper bound for SARG CNF conversion > > > Key: HIVE-9166 > URL: https://issues.apache.org/jira/browse/HIVE-9166 > Project: Hive > Issue Type: Bug >Affects Versions: 0.14.0, 0.15.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9166.1.patch, HIVE-9166.2.patch > > > SARG creation in ORC, applies several optimizations to expression tree. In > that CNF conversion is an exponential algorithm as it finds all combinations > of expressions when converting from OR of AND form to AND of OR form (CNF). > We need an upper bound for this algorithm to prevent it from running for long > time and generating huge combinations list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14267993#comment-14267993 ] Owen O'Malley commented on HIVE-9188: - I'm concerned about the size of the bloom filters and making them an integrated part of the column statistics. I think we'd do much better to make a BLOOM_FILTER stream kind and place them in a completely separate stream. That would allow the predicate push down to only load the bloom filters for the columns that it needs. > BloomFilter in ORC row group index > -- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats >Affects Versions: 0.15.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, > HIVE-9188.4.patch > > > BloomFilters are well known probabilistic data structure for set membership > checking. We can use bloom filters in ORC index for better row group pruning. > Currently, ORC row group index uses min/max statistics to eliminate row > groups (stripes as well) that do not satisfy predicate condition specified in > the query. But in some cases, the efficiency of min/max based elimination is > not optimal (unsorted columns with wide range of entries). Bloom filters can > be an effective and efficient alternative for row group/split elimination for > point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-4639) Add has null flag to ORC internal index
[ https://issues.apache.org/jira/browse/HIVE-4639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268053#comment-14268053 ] Owen O'Malley commented on HIVE-4639: - You should encode four values: no_values, all_nulls, some_nulls, no_nulls This will allow you to support a richer set of sargs. > Add has null flag to ORC internal index > --- > > Key: HIVE-4639 > URL: https://issues.apache.org/jira/browse/HIVE-4639 > Project: Hive > Issue Type: Improvement > Components: File Formats >Reporter: Owen O'Malley >Assignee: Prasanth Jayachandran > Attachments: HIVE-4639.1.patch > > > It would enable more predicate pushdown if we added a flag to the index entry > recording if there were any null values in the column for the 10k rows. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268176#comment-14268176 ] Owen O'Malley commented on HIVE-9188: - [~gopalv] I don't understand your concern. The indexes are already stored in ROW_INDEX streams. I'm just saying that the bloom filters, which are much larger than the rest of the ROW_INDEX be split into a BLOOM_FILTER stream instead of bundled in with the ROW_INDEX stream. That would let you load just the ROW_INDEX if you don't need the bloom filter. The size of the bloom filter needs to be changed relative to the number of items. You've sized them for the default row group size (n = 10,000, p=0.05) -> 7.8kb. To use them at the file level, you'd need to make the bloom filters much much much larger. For a file with 100 million values in a column, you'd need a 74mb bloom filter. I'd propose that you only do the bloom filters at the row group level and scale them to match the row index stride rather than just use the default 10k. > BloomFilter in ORC row group index > -- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats >Affects Versions: 0.15.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, > HIVE-9188.4.patch > > > BloomFilters are well known probabilistic data structure for set membership > checking. We can use bloom filters in ORC index for better row group pruning. > Currently, ORC row group index uses min/max statistics to eliminate row > groups (stripes as well) that do not satisfy predicate condition specified in > the query. But in some cases, the efficiency of min/max based elimination is > not optimal (unsorted columns with wide range of entries). Bloom filters can > be an effective and efficient alternative for row group/split elimination for > point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14268573#comment-14268573 ] Owen O'Malley commented on HIVE-9188: - [~prasanth_j] Ok, I thought that you said that you were going to have bloom filters at row group, stripe, and file level. I agree completely that ORC should only have bloom filters at the row group level. Having the bloom filter as a separate stream means the reader does *far* less IO. It will still go through the code that merges adjacent ranges together into a single read. So if you need all of the indexes and bloom filters for all of the columns the reader should read them in a single IO operation. On the other hand, if it doesn't need any bloom filter it shouldn't have to load the extra mb of data it doesn't need. > BloomFilter in ORC row group index > -- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats >Affects Versions: 0.15.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, > HIVE-9188.4.patch > > > BloomFilters are well known probabilistic data structure for set membership > checking. We can use bloom filters in ORC index for better row group pruning. > Currently, ORC row group index uses min/max statistics to eliminate row > groups (stripes as well) that do not satisfy predicate condition specified in > the query. But in some cases, the efficiency of min/max based elimination is > not optimal (unsorted columns with wide range of entries). Bloom filters can > be an effective and efficient alternative for row group/split elimination for > point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9317) move Microsoft copyright to NOTICE file
Owen O'Malley created HIVE-9317: --- Summary: move Microsoft copyright to NOTICE file Key: HIVE-9317 URL: https://issues.apache.org/jira/browse/HIVE-9317 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Fix For: 0.15.0 There are a set of files that still have the Microsoft copyright notices. Those notices need to be moved into NOTICES and replaced with the standard Apache headers. {code} ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14275997#comment-14275997 ] Owen O'Malley commented on HIVE-9188: - [~prasanth_j] Please remove the upper two levels of bloom filters. They are utterly useless. Their false positive rate will be far above 99%. They absolutely should not be stored in the column statistics. That will hurt the common ppd case and not help. > BloomFilter in ORC row group index > -- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats >Affects Versions: 0.15.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, > HIVE-9188.4.patch > > > BloomFilters are well known probabilistic data structure for set membership > checking. We can use bloom filters in ORC index for better row group pruning. > Currently, ORC row group index uses min/max statistics to eliminate row > groups (stripes as well) that do not satisfy predicate condition specified in > the query. But in some cases, the efficiency of min/max based elimination is > not optimal (unsorted columns with wide range of entries). Bloom filters can > be an effective and efficient alternative for row group/split elimination for > point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted
[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284927#comment-14284927 ] Owen O'Malley commented on HIVE-8966: - This looks good, Alan. +1 One minor nit is that the class javadoc for ValidReadTxnList has "And" instead of the intended "An". > Delta files created by hive hcatalog streaming cannot be compacted > -- > > Key: HIVE-8966 > URL: https://issues.apache.org/jira/browse/HIVE-8966 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 0.14.0 > Environment: hive >Reporter: Jihong Liu >Assignee: Alan Gates >Priority: Critical > Fix For: 0.14.1 > > Attachments: HIVE-8966.2.patch, HIVE-8966.3.patch, HIVE-8966.4.patch, > HIVE-8966.5.patch, HIVE-8966.patch > > > hive hcatalog streaming will also create a file like bucket_n_flush_length in > each delta directory. Where "n" is the bucket number. But the > compactor.CompactorMR think this file also needs to compact. However this > file of course cannot be compacted, so compactor.CompactorMR will not > continue to do the compaction. > Did a test, after removed the bucket_n_flush_length file, then the "alter > table partition compact" finished successfully. If don't delete that file, > nothing will be compacted. > This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-8966) Delta files created by hive hcatalog streaming cannot be compacted
[ https://issues.apache.org/jira/browse/HIVE-8966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14284935#comment-14284935 ] Owen O'Malley commented on HIVE-8966: - After a little more thought, I'm worried that someone will accidentally create a ValidCompactorTxnList and get confused by the different behavior. I think it would make sense to move it into the compactor package to minimize the chance that someone accidentally uses it by mistake. > Delta files created by hive hcatalog streaming cannot be compacted > -- > > Key: HIVE-8966 > URL: https://issues.apache.org/jira/browse/HIVE-8966 > Project: Hive > Issue Type: Bug > Components: HCatalog >Affects Versions: 0.14.0 > Environment: hive >Reporter: Jihong Liu >Assignee: Alan Gates >Priority: Critical > Fix For: 0.14.1 > > Attachments: HIVE-8966.2.patch, HIVE-8966.3.patch, HIVE-8966.4.patch, > HIVE-8966.5.patch, HIVE-8966.patch > > > hive hcatalog streaming will also create a file like bucket_n_flush_length in > each delta directory. Where "n" is the bucket number. But the > compactor.CompactorMR think this file also needs to compact. However this > file of course cannot be compacted, so compactor.CompactorMR will not > continue to do the compaction. > Did a test, after removed the bucket_n_flush_length file, then the "alter > table partition compact" finished successfully. If don't delete that file, > nothing will be compacted. > This is probably a very severity bug. Both 0.13 and 0.14 have this issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9451) Add max size of column dictionaries to ORC metadata
Owen O'Malley created HIVE-9451: --- Summary: Add max size of column dictionaries to ORC metadata Key: HIVE-9451 URL: https://issues.apache.org/jira/browse/HIVE-9451 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley To predict the amount of memory required to read an ORC file we need to know the size of the dictionaries for the columns that we are reading. I propose adding the number of bytes for each column's dictionary to the stripe's column statistics. The file's column statistics would have the maximum dictionary size for each column. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-9467) ORC - sort dictionary streams to the end of the stripe
Owen O'Malley created HIVE-9467: --- Summary: ORC - sort dictionary streams to the end of the stripe Key: HIVE-9467 URL: https://issues.apache.org/jira/browse/HIVE-9467 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley When reading ORC files, it would be convenient to group the dictionary streams at the end of the stripe. This would allow the reader to use fewer read operations if they want to load the dictionaries before they load the data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Attachment: hive-9327.txt This patch changes no code, just puts the required Apache header on the source files and moves Microsoft's copyright notice to the NOTICE file. > move Microsoft copyright to NOTICE file > --- > > Key: HIVE-9317 > URL: https://issues.apache.org/jira/browse/HIVE-9317 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley > Fix For: 0.15.0 > > Attachments: hive-9327.txt > > > There are a set of files that still have the Microsoft copyright notices. > Those notices need to be moved into NOTICES and replaced with the standard > Apache headers. > {code} > ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java > ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Priority: Blocker (was: Major) > move Microsoft copyright to NOTICE file > --- > > Key: HIVE-9317 > URL: https://issues.apache.org/jira/browse/HIVE-9317 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Blocker > Fix For: 0.15.0 > > Attachments: hive-9327.txt > > > There are a set of files that still have the Microsoft copyright notices. > Those notices need to be moved into NOTICES and replaced with the standard > Apache headers. > {code} > ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java > ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Status: Patch Available (was: Open) > move Microsoft copyright to NOTICE file > --- > > Key: HIVE-9317 > URL: https://issues.apache.org/jira/browse/HIVE-9317 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Blocker > Fix For: 0.15.0 > > Attachments: hive-9327.txt > > > There are a set of files that still have the Microsoft copyright notices. > Those notices need to be moved into NOTICES and replaced with the standard > Apache headers. > {code} > ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java > ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley reassigned HIVE-9317: --- Assignee: Owen O'Malley > move Microsoft copyright to NOTICE file > --- > > Key: HIVE-9317 > URL: https://issues.apache.org/jira/browse/HIVE-9317 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley > Fix For: 0.15.0 > > Attachments: hive-9327.txt > > > There are a set of files that still have the Microsoft copyright notices. > Those notices need to be moved into NOTICES and replaced with the standard > Apache headers. > {code} > ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java > ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9317: Resolution: Fixed Fix Version/s: 1.0.0 Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) I committed this. Thanks for the review, Alan. > move Microsoft copyright to NOTICE file > --- > > Key: HIVE-9317 > URL: https://issues.apache.org/jira/browse/HIVE-9317 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Blocker > Fix For: 0.15.0, 1.0.0 > > Attachments: hive-9327.txt > > > There are a set of files that still have the Microsoft copyright notices. > Those notices need to be moved into NOTICES and replaced with the standard > Apache headers. > {code} > ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java > ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9451) Add max size of column dictionaries to ORC metadata
[ https://issues.apache.org/jira/browse/HIVE-9451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297178#comment-14297178 ] Owen O'Malley commented on HIVE-9451: - We should also record the stripe size that was used as the file was written. That gives a strict upper bound on the size of memory in the writer. > Add max size of column dictionaries to ORC metadata > --- > > Key: HIVE-9451 > URL: https://issues.apache.org/jira/browse/HIVE-9451 > Project: Hive > Issue Type: Improvement >Reporter: Owen O'Malley > > To predict the amount of memory required to read an ORC file we need to know > the size of the dictionaries for the columns that we are reading. I propose > adding the number of bytes for each column's dictionary to the stripe's > column statistics. The file's column statistics would have the maximum > dictionary size for each column. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9317) move Microsoft copyright to NOTICE file
[ https://issues.apache.org/jira/browse/HIVE-9317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14297319#comment-14297319 ] Owen O'Malley commented on HIVE-9317: - +1 to not rolling a new RC specifically for this one. I just want to make sure it goes into to any new RCs. > move Microsoft copyright to NOTICE file > --- > > Key: HIVE-9317 > URL: https://issues.apache.org/jira/browse/HIVE-9317 > Project: Hive > Issue Type: Bug >Reporter: Owen O'Malley >Assignee: Owen O'Malley >Priority: Blocker > Fix For: 0.15.0, 1.0.0 > > Attachments: hive-9327.txt > > > There are a set of files that still have the Microsoft copyright notices. > Those notices need to be moved into NOTICES and replaced with the standard > Apache headers. > {code} > ./common/src/java/org/apache/hadoop/hive/common/type/Decimal128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SignedInt128.java > ./common/src/java/org/apache/hadoop/hive/common/type/SqlMathUtil.java > ./common/src/java/org/apache/hadoop/hive/common/type/UnsignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestDecimal128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSignedInt128.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestSqlMathUtil.java > ./common/src/test/org/apache/hadoop/hive/common/type/TestUnsignedInt128.java > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-9188) BloomFilter in ORC row group index
[ https://issues.apache.org/jira/browse/HIVE-9188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14302507#comment-14302507 ] Owen O'Malley commented on HIVE-9188: - Suggestions: * Pick m to always be a multiple of 64 (since you are using longs are the representation) * change the representation of BloomFilter in orc_proto to record the number of hash functions and not the size or fpp. * use fixed64 for the bit field * you'll also need to update the specification in the wiki with the change to the format (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC#LanguageManualORC-orc-specORCFormatSpecification) * revert the spurious change to CliDriver.java * revert the spurious change to .gitignore * it seems suboptimal to convert long values to bytes before hashing > BloomFilter in ORC row group index > -- > > Key: HIVE-9188 > URL: https://issues.apache.org/jira/browse/HIVE-9188 > Project: Hive > Issue Type: New Feature > Components: File Formats >Affects Versions: 0.15.0 >Reporter: Prasanth Jayachandran >Assignee: Prasanth Jayachandran > Labels: orcfile > Attachments: HIVE-9188.1.patch, HIVE-9188.2.patch, HIVE-9188.3.patch, > HIVE-9188.4.patch, HIVE-9188.5.patch, HIVE-9188.6.patch > > > BloomFilters are well known probabilistic data structure for set membership > checking. We can use bloom filters in ORC index for better row group pruning. > Currently, ORC row group index uses min/max statistics to eliminate row > groups (stripes as well) that do not satisfy predicate condition specified in > the query. But in some cases, the efficiency of min/max based elimination is > not optimal (unsorted columns with wide range of entries). Bloom filters can > be an effective and efficient alternative for row group/split elimination for > point queries or queries with IN clause. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9593) ORC Reader should ignore unknown metadata streams
[ https://issues.apache.org/jira/browse/HIVE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9593: Status: Patch Available (was: Open) > ORC Reader should ignore unknown metadata streams > -- > > Key: HIVE-9593 > URL: https://issues.apache.org/jira/browse/HIVE-9593 > Project: Hive > Issue Type: Bug > Components: File Formats >Affects Versions: 0.13.1, 0.12.0, 0.11.0, 1.0.0, 1.2.0, 1.1.0 >Reporter: Gopal V >Assignee: Owen O'Malley > Attachments: hive-9593.patch > > > ORC readers should ignore metadata streams which are non-essential additions > to the main data streams. > This will include additional indices, histograms or anything we add as an > optional stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9593) ORC Reader should ignore unknown metadata streams
[ https://issues.apache.org/jira/browse/HIVE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9593: Attachment: hive-9593.patch This patch changes all of the required fields to be optional. I've gone through the current code to ensure that null pointers from getKind() won't cause NPE. > ORC Reader should ignore unknown metadata streams > -- > > Key: HIVE-9593 > URL: https://issues.apache.org/jira/browse/HIVE-9593 > Project: Hive > Issue Type: Bug > Components: File Formats >Affects Versions: 0.11.0, 0.12.0, 0.13.1, 1.0.0, 1.2.0, 1.1.0 >Reporter: Gopal V >Assignee: Owen O'Malley > Attachments: hive-9593.patch > > > ORC readers should ignore metadata streams which are non-essential additions > to the main data streams. > This will include additional indices, histograms or anything we add as an > optional stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HIVE-9593) ORC Reader should ignore unknown metadata streams
[ https://issues.apache.org/jira/browse/HIVE-9593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Owen O'Malley updated HIVE-9593: Resolution: Fixed Fix Version/s: 1.1.0 1.0.1 Status: Resolved (was: Patch Available) I committed this. Thanks for the review, Gopal! > ORC Reader should ignore unknown metadata streams > -- > > Key: HIVE-9593 > URL: https://issues.apache.org/jira/browse/HIVE-9593 > Project: Hive > Issue Type: Bug > Components: File Formats >Affects Versions: 0.11.0, 0.12.0, 0.13.1, 1.0.0, 1.2.0, 1.1.0 >Reporter: Gopal V >Assignee: Owen O'Malley > Fix For: 1.0.1, 1.1.0 > > Attachments: HIVE-9593.no-autogen.patch, hive-9593.patch > > > ORC readers should ignore metadata streams which are non-essential additions > to the main data streams. > This will include additional indices, histograms or anything we add as an > optional stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15375) Port ORC-115 to storage-api
Owen O'Malley created HIVE-15375: Summary: Port ORC-115 to storage-api Key: HIVE-15375 URL: https://issues.apache.org/jira/browse/HIVE-15375 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, VectorizedRowBatch.toString() assumes that all BytesColumnVector's use the internal buffer for all of the values. This leads to incorrect strings in many common cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15419) Separate out storage-api to be released independently
Owen O'Malley created HIVE-15419: Summary: Separate out storage-api to be released independently Key: HIVE-15419 URL: https://issues.apache.org/jira/browse/HIVE-15419 Project: Hive Issue Type: Task Components: storage-api Reporter: Owen O'Malley Currently, the Hive project releases a single monolithic release, but this makes file formats reading directly into Hive's vector row batches a circular dependence. Storage-api is a small module with the vectorized row batches and SearchArgument that are necessary for efficient vectorized read and write. By releasing storage-api independently, we can make an interface that the file formats can read and write from. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15643) remove use of default charset in FastHiveDecimal
Owen O'Malley created HIVE-15643: Summary: remove use of default charset in FastHiveDecimal Key: HIVE-15643 URL: https://issues.apache.org/jira/browse/HIVE-15643 Project: Hive Issue Type: Bug Reporter: Owen O'Malley HIVE-15335 introduced some new uses of String.getBytes(), which uses the default char set. These need to be replaced with the version that always uses UTF8. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15841) Upgrade Hive to ORC 1.3.2
Owen O'Malley created HIVE-15841: Summary: Upgrade Hive to ORC 1.3.2 Key: HIVE-15841 URL: https://issues.apache.org/jira/browse/HIVE-15841 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Hive needs ORC-141 and ORC-135, so we should upgrade to ORC-1.3.2 once it releases. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15922) SchemaEvolution must guarantee that getFileIncluded is not null
Owen O'Malley created HIVE-15922: Summary: SchemaEvolution must guarantee that getFileIncluded is not null Key: HIVE-15922 URL: https://issues.apache.org/jira/browse/HIVE-15922 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.1 Reporter: Owen O'Malley Fix For: 2.1.2 This only impacts branch-2.1, because it is already fixed in master by HIVE-14007. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-15929) Fix HiveDecimalWritable
Owen O'Malley created HIVE-15929: Summary: Fix HiveDecimalWritable Key: HIVE-15929 URL: https://issues.apache.org/jira/browse/HIVE-15929 Project: Hive Issue Type: Bug Reporter: Owen O'Malley HIVE-15335 broke compatibility with Hive 2.1 by making HiveDecimalWritable.getInternalStorate() throw an exception when called on an unset value. It is easy to instead return an empty array, which will allow the old code to allocate a new array. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16549) Fix an incompatible change in PredicateLeafImpl from HIVE-15269
Owen O'Malley created HIVE-16549: Summary: Fix an incompatible change in PredicateLeafImpl from HIVE-15269 Key: HIVE-16549 URL: https://issues.apache.org/jira/browse/HIVE-16549 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley HIVE-15269 added a parameter to the constructor for PredicateLeafImpl for a configuration object. The configuration object is only used for the new LiteralDelegates. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16683) ORC WriterVersion gets ArrayIndexOutOfBoundsException on newer ORC files
Owen O'Malley created HIVE-16683: Summary: ORC WriterVersion gets ArrayIndexOutOfBoundsException on newer ORC files Key: HIVE-16683 URL: https://issues.apache.org/jira/browse/HIVE-16683 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.1, 2.2.0 Reporter: Owen O'Malley Assignee: Owen O'Malley This only impacts branch-2.1 and branch-2.2, because it has been fixed in the ORC project's code base via ORC-125. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-16787) Fix itests in branch-2.2
Owen O'Malley created HIVE-16787: Summary: Fix itests in branch-2.2 Key: HIVE-16787 URL: https://issues.apache.org/jira/browse/HIVE-16787 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.2.0 The itests are broken in branch 2.2 and need to be fixed before release. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HIVE-17118) Clean up of HIVE-14309 to move the orc source code to org.apache.hive.orc
Owen O'Malley created HIVE-17118: Summary: Clean up of HIVE-14309 to move the orc source code to org.apache.hive.orc Key: HIVE-17118 URL: https://issues.apache.org/jira/browse/HIVE-17118 Project: Hive Issue Type: Bug Components: ORC Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.2.0 Just for branch-2.2. HIVE-14309 shaded the hive-orc jar to use a unique package org.apache.hive.orc package. This patch moves the source files over to the right directory and removes the shading. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17154) fix rat problems in branch-2.2
Owen O'Malley created HIVE-17154: Summary: fix rat problems in branch-2.2 Key: HIVE-17154 URL: https://issues.apache.org/jira/browse/HIVE-17154 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Fix rat problems in the branch-2.2. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17171) Remove old javadoc versions
Owen O'Malley created HIVE-17171: Summary: Remove old javadoc versions Key: HIVE-17171 URL: https://issues.apache.org/jira/browse/HIVE-17171 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley We currently have a lot of old javadoc versions. I'd propose that we keep the following versions: * r1.2.2 * r2.1.1 * r2.2.0 (Note that 2.3.0 was not checked in to the site.) In particular, I'd suggest we remove: * hcat-r0.5.0 * r0.10.0 * r0.11.0 * r0.12.0 * r0.13.1 * r1.0.1 * r1.1.1 * r2.0.1 Any concerns? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17173) Add some connivence redirects to the Hive site
Owen O'Malley created HIVE-17173: Summary: Add some connivence redirects to the Hive site Key: HIVE-17173 URL: https://issues.apache.org/jira/browse/HIVE-17173 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley I'd propose that we add the following redirects to our site's .htaccess: * http://hive.apache.org/bugs -> https://issues.apache.org/jira/browse/hive * http://hive.apache.org/downloads -> https://www.apache.org/dyn/closer.cgi/hive/ * http://hive.apache.org/releases -> https://hive.apache.org/docs/downloads.html * http://hive.apache.org/src -> https://github.com/apache/hive * http://hive.apache.org/web-src -> https://svn.apache.org/repos/asf/hive/cms/trunk Thoughts? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17924) Restore SerDe by reverting HIVE-15167 to unbreak API compatibility
Owen O'Malley created HIVE-17924: Summary: Restore SerDe by reverting HIVE-15167 to unbreak API compatibility Key: HIVE-17924 URL: https://issues.apache.org/jira/browse/HIVE-17924 Project: Hive Issue Type: Bug Affects Versions: 2.3.0, 2.3.1 Reporter: Owen O'Malley Assignee: Owen O'Malley HIVE-15167 broke compatibility badly for very little gain and caused a lot of pain for our users. We should revert it and restore the SerDe interface. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-17925) Fix TestHooks so that it avoids ClassNotFound on teardown
Owen O'Malley created HIVE-17925: Summary: Fix TestHooks so that it avoids ClassNotFound on teardown Key: HIVE-17925 URL: https://issues.apache.org/jira/browse/HIVE-17925 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley TestHooks gets a ClassNotFound exception during teardown, which messes up some following tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (HIVE-19013) Fix some minor build issues in storage-api
Owen O'Malley created HIVE-19013: Summary: Fix some minor build issues in storage-api Key: HIVE-19013 URL: https://issues.apache.org/jira/browse/HIVE-19013 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the storage-api tests complain that there isn't a log4j2.xml and the javadoc fails. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-20135) Fix incompatible change in TimestampColumnVector to default to UTC
Owen O'Malley created HIVE-20135: Summary: Fix incompatible change in TimestampColumnVector to default to UTC Key: HIVE-20135 URL: https://issues.apache.org/jira/browse/HIVE-20135 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Jesus Camacho Rodriguez HIVE-20007 changed the default for TimestampColumnVector to be to use UTC, which breaks the API compatibility with storage-api 2.6. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (HIVE-12638) Hive should not create empty files in partitions
Owen O'Malley created HIVE-12638: Summary: Hive should not create empty files in partitions Key: HIVE-12638 URL: https://issues.apache.org/jira/browse/HIVE-12638 Project: Hive Issue Type: Bug Components: File Formats Reporter: Owen O'Malley Currently Hive creates empty files for buckets with no rows in a directory. I believe this was originally because the SMB and bucket join require files to be present to get InputSplits. There are customers where this behavior leads the creation of more 200,000 empty ORC files per an hour on a cluster (with peaks of more than 725,000 per an hour). We've also seen instances where a single DataNode is involved in 5600 of these empty ORC files within a 2 minute period. This causes significant stress on HDFS at both the NameNode and DataNode and is completely unnecessary. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12838) Add methods for getting and storing serialized ORC file tails
Owen O'Malley created HIVE-12838: Summary: Add methods for getting and storing serialized ORC file tails Key: HIVE-12838 URL: https://issues.apache.org/jira/browse/HIVE-12838 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Provide a pair of routines for getting and restoring from a serialized file footer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13232) Aggressively drop compression buffers in ORC OutStreams
Owen O'Malley created HIVE-13232: Summary: Aggressively drop compression buffers in ORC OutStreams Key: HIVE-13232 URL: https://issues.apache.org/jira/browse/HIVE-13232 Project: Hive Issue Type: Bug Components: ORC Reporter: Owen O'Malley Assignee: Owen O'Malley In Hive 0.11, when ORC's OutStream's were flushed they dropped all of the their buffers. In the patch for HIVE-4342, we inadvertently changed that behavior so that one of the buffers is held on to. For queries with a lot of writers and thus under significant memory pressure this can have a significant impact on the memory usage. Note that "hive.optimize.sort.dynamic.partition" avoids this problem by sorting on the dynamic partition key and thus only a single ORC writer is open at once. This will use memory more effectively and avoid creating ORC files with very small stripes, which will produce better downstream performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13464) Backport changes to storage-api into branch 2 for release into 2.0.1
Owen O'Malley created HIVE-13464: Summary: Backport changes to storage-api into branch 2 for release into 2.0.1 Key: HIVE-13464 URL: https://issues.apache.org/jira/browse/HIVE-13464 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.0.1 To release ORC as a separate project, backporting the safe changes for storage-api to 2.0.1 will minimize the disruption. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13763) Update smart-apply-patch.sh with ability to use patches from git
Owen O'Malley created HIVE-13763: Summary: Update smart-apply-patch.sh with ability to use patches from git Key: HIVE-13763 URL: https://issues.apache.org/jira/browse/HIVE-13763 Project: Hive Issue Type: Improvement Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the smart-apply-patch.sh doesn't understand git patches. It is relatively easy to make it understand patches generated by: {code} % git format-patch apache/master --stdout > HIVE-999.patch {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-13906) Remove guava dependence from storage-api module
Owen O'Malley created HIVE-13906: Summary: Remove guava dependence from storage-api module Key: HIVE-13906 URL: https://issues.apache.org/jira/browse/HIVE-13906 Project: Hive Issue Type: Bug Components: storage-api Reporter: Owen O'Malley Assignee: Owen O'Malley Guava is a very problematic library to depend on because of the version incompatibilities and the use of it in the storage-api module causes it to leak into everything that depends on it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14007) Replace ORC module with ORC release
Owen O'Malley created HIVE-14007: Summary: Replace ORC module with ORC release Key: HIVE-14007 URL: https://issues.apache.org/jira/browse/HIVE-14007 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.2.0 Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.2.0 This completes moving the core ORC reader & writer to the ORC project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14166) Minor updates to the website.
Owen O'Malley created HIVE-14166: Summary: Minor updates to the website. Key: HIVE-14166 URL: https://issues.apache.org/jira/browse/HIVE-14166 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Minor updates to the website & documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14220) Protected users from Reader.rows(Options) modifying the Options object
Owen O'Malley created HIVE-14220: Summary: Protected users from Reader.rows(Options) modifying the Options object Key: HIVE-14220 URL: https://issues.apache.org/jira/browse/HIVE-14220 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley This is a matching fix to HIVE-14004 where ACID was getting in to trouble because it was reusing the Reader.Options argument between files and Reader.rows was modifying it. HIVE-14004 just fixed the Hive case, but we need a corresponding fix over here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14242) Backport ORC-53 to Hive
Owen O'Malley created HIVE-14242: Summary: Backport ORC-53 to Hive Key: HIVE-14242 URL: https://issues.apache.org/jira/browse/HIVE-14242 Project: Hive Issue Type: Bug Components: ORC Reporter: Owen O'Malley Assignee: Owen O'Malley ORC-53 was mostly about the mapreduce shims for ORC, but it fixed a problem in TypeDescription that should be backported to Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-14309) Fix naming of classes in orc module to not conflict with standalone orc
Owen O'Malley created HIVE-14309: Summary: Fix naming of classes in orc module to not conflict with standalone orc Key: HIVE-14309 URL: https://issues.apache.org/jira/browse/HIVE-14309 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley The current Hive 2.0 and 2.1 releases have classes in the org.apache.orc namespace that clash with the ORC project's classes. From Hive 2.2 onward, the classes will only be on ORC, but we'll reduce the problems of classpath issues if we rename the classes to org.apache.hive.orc. I've looked at a set of projects (pig, spark, oozie, flume, & storm) and can't find any uses of Hive's versions of the org.apache.orc classes, so I believe this is a safe change that will reduce the integration problems down stream. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-15124) Fix OrcInputFormat to use reader's schema for include boolean array
Owen O'Malley created HIVE-15124: Summary: Fix OrcInputFormat to use reader's schema for include boolean array Key: HIVE-15124 URL: https://issues.apache.org/jira/browse/HIVE-15124 Project: Hive Issue Type: Bug Components: ORC Affects Versions: 2.1.0 Reporter: Owen O'Malley Assignee: Owen O'Malley Currently, the OrcInputFormat uses the file's schema rather than the reader's schema. This means that SchemaEvolution fails with an ArrayIndexOutOfBoundsException if a partition has a different schema than the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10171) Create a storage-api module
Owen O'Malley created HIVE-10171: Summary: Create a storage-api module Key: HIVE-10171 URL: https://issues.apache.org/jira/browse/HIVE-10171 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley To support high performance file formats, I'd like to propose that we move the minimal set of classes that are required to integrate with Hive in to a new module named "storage-api". This module will include VectorizedRowBatch, the various ColumnVector classes, and the SARG classes. It will form the start of an API that high performance storage formats can use to integrate with Hive. Both ORC and Parquet can use the new API to support vectorization and SARGs without performance destroying shims. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10305) TestOrcFile has a mistake that makes metadata test ineffective
Owen O'Malley created HIVE-10305: Summary: TestOrcFile has a mistake that makes metadata test ineffective Key: HIVE-10305 URL: https://issues.apache.org/jira/browse/HIVE-10305 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Two of the values that are being stored as user metadata in TestOrcFile.metaData weren't flipped and thus were empty buffers. The test passes because they are compared to empty buffers. We should fix the test to perform the expected test. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10407) separate out the timestamp ranges for testing purposes
Owen O'Malley created HIVE-10407: Summary: separate out the timestamp ranges for testing purposes Key: HIVE-10407 URL: https://issues.apache.org/jira/browse/HIVE-10407 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley Some platforms have limits for date ranges, so separate out the test cases that are outside of the range 1970 to 2038. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10794) Remove the dependence from ErrorMsg to HiveUtils
Owen O'Malley created HIVE-10794: Summary: Remove the dependence from ErrorMsg to HiveUtils Key: HIVE-10794 URL: https://issues.apache.org/jira/browse/HIVE-10794 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley HiveUtils has a large set of dependencies and ErrorMsg only needs the new line constant. Breaking the dependence will reduce the dependency set from ErrorMsg significantly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10795) Remove use of PerfLogger from Orc
Owen O'Malley created HIVE-10795: Summary: Remove use of PerfLogger from Orc Key: HIVE-10795 URL: https://issues.apache.org/jira/browse/HIVE-10795 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley PerfLogger is yet another class with a huge dependency set that Orc doesn't need. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10796) Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel
Owen O'Malley created HIVE-10796: Summary: Remove dependencies on NumericHistogram and NumDistinctValueEstimator from JavaDataModel Key: HIVE-10796 URL: https://issues.apache.org/jira/browse/HIVE-10796 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The JavaDataModel class is used in a lot of places and the non-general calculations are better done in the other classes. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10797) Simplify the test for vectorized input
Owen O'Malley created HIVE-10797: Summary: Simplify the test for vectorized input Key: HIVE-10797 URL: https://issues.apache.org/jira/browse/HIVE-10797 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The call to Utilities.isVectorMode should be simplified for the readers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10798) Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader
Owen O'Malley created HIVE-10798: Summary: Remove dependence on VectorizedBatchUtil from VectorizedOrcAcidRowReader Key: HIVE-10798 URL: https://issues.apache.org/jira/browse/HIVE-10798 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley VectorizedBatchUtil has a lot of dependences that Orc should avoid and the code should be refactored. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-10799) Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc
Owen O'Malley created HIVE-10799: Summary: Refactor the SearchArgumentFactory to remove the dependence on ExprNodeGenericFuncDesc Key: HIVE-10799 URL: https://issues.apache.org/jira/browse/HIVE-10799 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley SearchArgumentFactory and SearchArgumentImpl are high level and shouldn't depend on the internals of Hive's AST model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11080) Modify VectorizedRowBatch.toString() to not depend on VectorExpressionWriter
Owen O'Malley created HIVE-11080: Summary: Modify VectorizedRowBatch.toString() to not depend on VectorExpressionWriter Key: HIVE-11080 URL: https://issues.apache.org/jira/browse/HIVE-11080 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently the VectorizedRowBatch.toString method uses the VectorExpressionWriter to convert the row batch to a string. Since the string is only used for printing error messages, I'd propose making the toString use the types of the vector batch instead of the object inspector. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11086) Remove use of ErrorMsg in Orc's RunLengthIntegerReaderV2
Owen O'Malley created HIVE-11086: Summary: Remove use of ErrorMsg in Orc's RunLengthIntegerReaderV2 Key: HIVE-11086 URL: https://issues.apache.org/jira/browse/HIVE-11086 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley ORC's rle v2 reader uses a string literal from ErrorMsg, which forces a large dependency on the rle v2 reader. Pulling the string literal in directly doesn't change the behavior and fixes the linkage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11115) Remove dependence from ORC's WriterImpl to OrcInputFormat
Owen O'Malley created HIVE-5: Summary: Remove dependence from ORC's WriterImpl to OrcInputFormat Key: HIVE-5 URL: https://issues.apache.org/jira/browse/HIVE-5 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently there is a link from WriterImpl to OrcInputFormat that should be removed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11124) Move OrcRecordUpdater.getAcidEventFields to RecordReaderFactory
Owen O'Malley created HIVE-11124: Summary: Move OrcRecordUpdater.getAcidEventFields to RecordReaderFactory Key: HIVE-11124 URL: https://issues.apache.org/jira/browse/HIVE-11124 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Move OrcRecordUpdater.getAcidEventFields to RecordReaderFactory to avoid the extra dependence. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11137) In DateWritable remove the use of LazyBinaryUtils
Owen O'Malley created HIVE-11137: Summary: In DateWritable remove the use of LazyBinaryUtils Key: HIVE-11137 URL: https://issues.apache.org/jira/browse/HIVE-11137 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently the DateWritable class uses LazyBinaryUtils, which has a lot of dependencies. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11144) Replace row by row reader and writer with shims to vectorized path.
Owen O'Malley created HIVE-11144: Summary: Replace row by row reader and writer with shims to vectorized path. Key: HIVE-11144 URL: https://issues.apache.org/jira/browse/HIVE-11144 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley The core ORC reader and writer will be better served if the vectorized read and write paths are the primary API and the row by row reader and writer and their corresponding object inspectors become Hive-specific shims. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11209) Clean up dependencies in HiveDecimalWritable
Owen O'Malley created HIVE-11209: Summary: Clean up dependencies in HiveDecimalWritable Key: HIVE-11209 URL: https://issues.apache.org/jira/browse/HIVE-11209 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently HiveDecimalWritable depends on: * org.apache.hadoop.hive.serde2.ByteStream * org.apache.hadoop.hive.serde2.lazybinary.LazyBinaryUtils * org.apache.hadoop.hive.serde2.typeinfo.HiveDecimalUtils since we need HiveDecimalWritable for the decimal VectorizedColumnBatch, breaking these dependencies will improve things. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11210) Remove dependency on HiveConf from Orc reader & writer
Owen O'Malley created HIVE-11210: Summary: Remove dependency on HiveConf from Orc reader & writer Key: HIVE-11210 URL: https://issues.apache.org/jira/browse/HIVE-11210 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley Currently the ORC reader and writer get their default values from HiveConf. I propose that we make the reader and writer have their own programatic defaults and the OrcInputFormat and OrcOutputFormat can use the version in HiveConf. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11212) Create vectorized types for complex types
Owen O'Malley created HIVE-11212: Summary: Create vectorized types for complex types Key: HIVE-11212 URL: https://issues.apache.org/jira/browse/HIVE-11212 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley We need vectorized types for structs, maps, lists, and unions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11245) Fix the LLAP to ORC APIs
Owen O'Malley created HIVE-11245: Summary: Fix the LLAP to ORC APIs Key: HIVE-11245 URL: https://issues.apache.org/jira/browse/HIVE-11245 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Priority: Blocker Fix For: llap Currently the LLAP branch has refactored the ORC code to have different code paths depending on whether the data is coming from the cache or a FileSystem. We need to introduce a concept of a DataSource that is responsible for getting the necessary bytes regardless of whether they are coming from a FileSystem, in memory cache, or both. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11253) Move SearchArgument and VectorizedRowBatch classes to storage-api.
Owen O'Malley created HIVE-11253: Summary: Move SearchArgument and VectorizedRowBatch classes to storage-api. Key: HIVE-11253 URL: https://issues.apache.org/jira/browse/HIVE-11253 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11307) Remove getWritableObject from ColumnVectorBatch
Owen O'Malley created HIVE-11307: Summary: Remove getWritableObject from ColumnVectorBatch Key: HIVE-11307 URL: https://issues.apache.org/jira/browse/HIVE-11307 Project: Hive Issue Type: Sub-task Components: Vectorization Reporter: Owen O'Malley Assignee: Owen O'Malley Fix For: 2.0.0 ColumnVectorBatch.getWritableObject is only used in a few tests and is really problematic when adding the complex types to vectorization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11321) Move OrcFile.OrcTableProperties from OrcFile into OrcConf.
Owen O'Malley created HIVE-11321: Summary: Move OrcFile.OrcTableProperties from OrcFile into OrcConf. Key: HIVE-11321 URL: https://issues.apache.org/jira/browse/HIVE-11321 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley We should pull all of the configuration/table property knobs into a single list. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11370) Extend SARGs to support binary type
Owen O'Malley created HIVE-11370: Summary: Extend SARGs to support binary type Key: HIVE-11370 URL: https://issues.apache.org/jira/browse/HIVE-11370 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Currently the sargs only apply to string, boolean, integer, decimal, floating, date, and timestamp columns. It would be good to support binary blobs also. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11417) Create ObjectInspectors for VectorizedRowBatch
Owen O'Malley created HIVE-11417: Summary: Create ObjectInspectors for VectorizedRowBatch Key: HIVE-11417 URL: https://issues.apache.org/jira/browse/HIVE-11417 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley I'd like to make the default path for reading and writing ORC files to be vectorized. To ensure that Hive can still read row by row, I'll make ObjectInspectors that are backed by the VectorizedRowBatch. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11618) Correct the SARG api to reunify the PredicateLeaf.Type INTEGER and LONG
Owen O'Malley created HIVE-11618: Summary: Correct the SARG api to reunify the PredicateLeaf.Type INTEGER and LONG Key: HIVE-11618 URL: https://issues.apache.org/jira/browse/HIVE-11618 Project: Hive Issue Type: Bug Components: Types Reporter: Owen O'Malley The Parquet binding leaked implementation details into the generic SARG api. Rather than make all users of the SARG api deal with each of the specific types, reunify the INTEGER and LONG types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11704) Create errata.txt file
Owen O'Malley created HIVE-11704: Summary: Create errata.txt file Key: HIVE-11704 URL: https://issues.apache.org/jira/browse/HIVE-11704 Project: Hive Issue Type: Bug Components: Documentation Reporter: Owen O'Malley Assignee: Owen O'Malley As discussed on the email list, we should have a file documenting known problems in the commit messages. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11807) Set ORC buffer size in relation to set stripe size
Owen O'Malley created HIVE-11807: Summary: Set ORC buffer size in relation to set stripe size Key: HIVE-11807 URL: https://issues.apache.org/jira/browse/HIVE-11807 Project: Hive Issue Type: Improvement Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley A customer produced ORC files with very small stripe sizes (10k rows/stripe) by setting a small 64MB stripe size and 256K buffer size for a 54 column table. At that size, each of the streams only get a buffer or two before the stripe size is reached. The current code uses the available memory instead of the stripe size and thus doesn't shrink the buffer size if the JVM has much more memory than the stripe size. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11808) In ORC removing the dynamic dispatch for StringTreeReader improves read by 10%
Owen O'Malley created HIVE-11808: Summary: In ORC removing the dynamic dispatch for StringTreeReader improves read by 10% Key: HIVE-11808 URL: https://issues.apache.org/jira/browse/HIVE-11808 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Assignee: Owen O'Malley When we introduced the dictionary/direct encodings for ORC, we made subclasses of StringTreeReader named StringDirectTreeReader and StringDictionaryTreeReader and introduce an additional dynamic dispatch in the inner loop. For tables with a lot of string columns, removing that extra dispatch improves performance 10%. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-11890) Create ORC module
Owen O'Malley created HIVE-11890: Summary: Create ORC module Key: HIVE-11890 URL: https://issues.apache.org/jira/browse/HIVE-11890 Project: Hive Issue Type: Bug Reporter: Owen O'Malley Start moving classes over to the ORC module. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12054) Create vectorized write method
Owen O'Malley created HIVE-12054: Summary: Create vectorized write method Key: HIVE-12054 URL: https://issues.apache.org/jira/browse/HIVE-12054 Project: Hive Issue Type: Sub-task Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley We need to add writer methods that can write VectorizedRowBatch to an ORC file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12055) Create row-by-row shims for the write path
Owen O'Malley created HIVE-12055: Summary: Create row-by-row shims for the write path Key: HIVE-12055 URL: https://issues.apache.org/jira/browse/HIVE-12055 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley As part of removing the row-by-row writer, we'll need to shim out the higher level API (OrcSerde and OrcOutputFormat) so that we maintain backwards compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12066) Add javadoc for methods added to public APIs
Owen O'Malley created HIVE-12066: Summary: Add javadoc for methods added to public APIs Key: HIVE-12066 URL: https://issues.apache.org/jira/browse/HIVE-12066 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Sergey Shelukhin Looking through the changes for ORC, there are methods being added without documentation: {code} --- ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java +++ ql/src/java/org/apache/hadoop/hive/ql/io/orc/Reader.java @@ -360,8 +353,18 @@ RecordReader rows(long offset, long length, MetadataReader metadata() throws IOException; + List getVersionList(); + + int getMetadataSize(); + + List getOrcProtoStripeStatistics(); + + List getStripeStatistics(); + + List getOrcProtoFileStatistics(); + + DataReader createDefaultDataReader(boolean useZeroCopy); + {code} You really need to look through all of the interfaces and fix them before merging into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12159) Create vectorized readers for the complex types
Owen O'Malley created HIVE-12159: Summary: Create vectorized readers for the complex types Key: HIVE-12159 URL: https://issues.apache.org/jira/browse/HIVE-12159 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Assignee: Owen O'Malley We need vectorized readers for the complex types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HIVE-12286) Add option to ORC vectorized reader to not trim spaces from char columns.
Owen O'Malley created HIVE-12286: Summary: Add option to ORC vectorized reader to not trim spaces from char columns. Key: HIVE-12286 URL: https://issues.apache.org/jira/browse/HIVE-12286 Project: Hive Issue Type: Sub-task Reporter: Owen O'Malley Currently the ORC reader in nextBatch always strips spaces from char columns. It is more natural for non-Hive applications to make it not trim the results on read, so I propose adding a switch to ReaderOptions. -- This message was sent by Atlassian JIRA (v6.3.4#6332)