Top-K optimization
Hi All, I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and Hive. In my past life, I was on the optimizer team of Greenplum Parallel Database. I'm a newbie to the Hive mailing list, so apologies for any missteps. I've done some searching in the Hive mailing list and JIRA and have not found any discussions around this topic - please feel free to redirect me to any old discussions I might've missed. A class of queries we're interested in optimizing are top-k queries i.e. queries of the form: (1) SELECT x, y from T order by z limit 10 You can imagine similar query with aggregates: (2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit 10 I'll continue my discussion with example (1) for simplicity. The way such a query is executed, every mapper sorts all rows from T and writes it to local files. Reducers (in this example, singular) read these files and merge them. These rows are fed to the limit operator which stops after 10 rows. The change I'm proposing is a combination of Hive and Hadoop changes which will greatly improve the performance of such queries: Hadoop change: - New parameter map.sort.limitrecords which determines how many records each mapper in a job will send to every reducer - When writing out local files after sorting, map-task stops after map.sort.limitrecords records for each reducer - Effectively, each mapper sends out its top-K records Hive change: - Determining when the Top-K optimization is applicable and setting K in ReduceSinkDesc - Passing the K value along to MapredWork - ExecDriver sets map.sort.limitrecords before executing the job corresponding to the MapredWork This change will reduce the amount of I/O that happens on the map-side (writing only 10 rows per reducer as opposed to entire table) and can have a big effect on performance. Furthermore, it is possible to make the sort on the mapper side a top-k sort which can further improve performance - but the deep pocket is really the I/O savings. In my experiments, I see a 5x performance improvement for such queries. Please let me know if this is of general interest - I'll be happy to contribute this back to the community. I'll also be mailing the Hadoop mailing list about this. Thanks Siva
Re: Top-K optimization
Hi Siva, Take a look at https://issues.apache.org/jira/browse/HIVE-3562. It is in my todo list, but I have not been able to review this. I think, this addresses a very similar problem. If yes, can you also review the above patch ? Thanks, -namit On 11/19/12 3:10 PM, Sivaramakrishnan Narayanan tarb...@gmail.com wrote: Hi All, I'm a developer at Qubole (http://www.qubole.com) looking at Hadoop and Hive. In my past life, I was on the optimizer team of Greenplum Parallel Database. I'm a newbie to the Hive mailing list, so apologies for any missteps. I've done some searching in the Hive mailing list and JIRA and have not found any discussions around this topic - please feel free to redirect me to any old discussions I might've missed. A class of queries we're interested in optimizing are top-k queries i.e. queries of the form: (1) SELECT x, y from T order by z limit 10 You can imagine similar query with aggregates: (2) SELECT x, y, count(*) as c from T group by x, y order by c desc limit 10 I'll continue my discussion with example (1) for simplicity. The way such a query is executed, every mapper sorts all rows from T and writes it to local files. Reducers (in this example, singular) read these files and merge them. These rows are fed to the limit operator which stops after 10 rows. The change I'm proposing is a combination of Hive and Hadoop changes which will greatly improve the performance of such queries: Hadoop change: - New parameter map.sort.limitrecords which determines how many records each mapper in a job will send to every reducer - When writing out local files after sorting, map-task stops after map.sort.limitrecords records for each reducer - Effectively, each mapper sends out its top-K records Hive change: - Determining when the Top-K optimization is applicable and setting K in ReduceSinkDesc - Passing the K value along to MapredWork - ExecDriver sets map.sort.limitrecords before executing the job corresponding to the MapredWork This change will reduce the amount of I/O that happens on the map-side (writing only 10 rows per reducer as opposed to entire table) and can have a big effect on performance. Furthermore, it is possible to make the sort on the mapper side a top-k sort which can further improve performance - but the deep pocket is really the I/O savings. In my experiments, I see a 5x performance improvement for such queries. Please let me know if this is of general interest - I'll be happy to contribute this back to the community. I'll also be mailing the Hadoop mailing list about this. Thanks Siva
[jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage
[ https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500118#comment-13500118 ] Sivaramakrishnan Narayanan commented on HIVE-3562: -- I'm interested in this particular optimization. Let's say the table src have N rows and we're interested in top-K. If the rows in T are in almost descending order and we're interested in ascending Top-K (this is very likely when ordering by timestamps), then the number of memcopies will be N * K. See code fragment: {code} +public boolean isTopN(byte[] key) { + int index = Arrays.binarySearch(keys, key, C); + index = index 0 ? -index -1 : index; + if (index = keys.length - 1) { +return false; + } + System.arraycopy(keys, index, keys, index + 1, keys.length - index - 1); + keys[index] = Arrays.copyOf(key, key.length); + return true; +} + } {code} You could use a linked list, but binary search is not an option in that case. An alternate approach to the problem is to use a combination of Hive and Hadoop changes. Hadoop change: * New parameter map.sort.limitrecords which determines how many records each mapper in a job will send to every reducer * When writing out local files after sorting, map-task stops after map.sort.limitrecords records for each reducer * Effectively, each mapper sends out its top-K records Hive change: * Determining when the Top-K optimization is applicable and setting K in ReduceSinkDesc * Passing the K value along to MapredWork * ExecDriver sets map.sort.limitrecords before executing the job corresponding to the MapredWork This change will reduce the amount of I/O that happens on the map-side (writing only 10 rows per reducer as opposed to entire table) and can have a big effect on performance. Furthermore, it is possible to make the sort on the mapper side a top-k sort which can further improve performance - but the deep pocket is really the I/O savings. In my experiments, I see a 5x performance improvement for such queries. Some limit can be pushed down to map stage -- Key: HIVE-3562 URL: https://issues.apache.org/jira/browse/HIVE-3562 Project: Hive Issue Type: Bug Reporter: Navis Assignee: Navis Priority: Trivial Attachments: HIVE-3562.D5967.1.patch Queries with limit clause (with reasonable number), for example {noformat} select * from src order by key limit 10; {noformat} makes operator tree, TS-SEL-RS-EXT-LIMIT-FS But LIMIT can be partially calculated in RS, reducing size of shuffling. TS-SEL-RS(TOP-N)-EXT-LIMIT-FS -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3633) sort-merge join does not work with sub-queries
[ https://issues.apache.org/jira/browse/HIVE-3633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Namit Jain updated HIVE-3633: - Attachment: hive.3633.5.patch sort-merge join does not work with sub-queries -- Key: HIVE-3633 URL: https://issues.apache.org/jira/browse/HIVE-3633 Project: Hive Issue Type: Improvement Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.3633.1.patch, hive.3633.2.patch, hive.3633.3.patch, hive.3633.4.patch, hive.3633.5.patch Consider the following query: create table smb_bucket_1(key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 6 BUCKETS STORED AS TEXTFILE; create table smb_bucket_2(key int, value string) CLUSTERED BY (key) SORTED BY (key) INTO 6 BUCKETS STORED AS TEXTFILE; -- load the above tables set hive.optimize.bucketmapjoin = true; set hive.optimize.bucketmapjoin.sortedmerge = true; set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; explain select count(*) from ( select /*+mapjoin(a)*/ a.key as key1, b.key as key2, a.value as value1, b.value as value2 from smb_bucket_1 a join smb_bucket_2 b on a.key = b.key) subq; The above query does not use sort-merge join. This would be very useful as we automatically convert the queries to use sorting and bucketing properties for join. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3562) Some limit can be pushed down to map stage
[ https://issues.apache.org/jira/browse/HIVE-3562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500165#comment-13500165 ] Sivaramakrishnan Narayanan commented on HIVE-3562: -- Apologies, you can use a heap to maintain a top-k as opposed to an array or a linked list. You may also want to consider the case where the top-k do not fit in memory. One possibility would be to employ this optimization only if K is less than some threshold. This approach has the advantage that it is a Hive-only change and does not depend on a Hadoop change. That is a pretty big plus. Some limit can be pushed down to map stage -- Key: HIVE-3562 URL: https://issues.apache.org/jira/browse/HIVE-3562 Project: Hive Issue Type: Bug Reporter: Navis Assignee: Navis Priority: Trivial Attachments: HIVE-3562.D5967.1.patch Queries with limit clause (with reasonable number), for example {noformat} select * from src order by key limit 10; {noformat} makes operator tree, TS-SEL-RS-EXT-LIMIT-FS But LIMIT can be partially calculated in RS, reducing size of shuffling. TS-SEL-RS(TOP-N)-EXT-LIMIT-FS -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false #203
See https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/ -- [...truncated 10343 lines...] compile-test: [echo] Project: serde [javac] Compiling 26 source files to https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/serde/test/classes [javac] Note: Some input files use or override a deprecated API. [javac] Note: Recompile with -Xlint:deprecation for details. [javac] Note: Some input files use unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. create-dirs: [echo] Project: service [copy] Warning: https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/service/src/test/resources does not exist. init: [echo] Project: service ivy-init-settings: [echo] Project: service ivy-resolve: [echo] Project: service [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/ivy/ivysettings.xml [ivy:report] Processing https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/ivy/resolution-cache/org.apache.hive-hive-service-default.xml to https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/ivy/report/org.apache.hive-hive-service-default.html ivy-retrieve: [echo] Project: service compile: [echo] Project: service ivy-resolve-test: [echo] Project: service ivy-retrieve-test: [echo] Project: service compile-test: [echo] Project: service [javac] Compiling 2 source files to https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/service/test/classes test: [echo] Project: hive test-shims: [echo] Project: hive test-conditions: [echo] Project: shims gen-test: [echo] Project: shims create-dirs: [echo] Project: shims [copy] Warning: https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/shims/src/test/resources does not exist. init: [echo] Project: shims ivy-init-settings: [echo] Project: shims ivy-resolve: [echo] Project: shims [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/ivy/ivysettings.xml [ivy:report] Processing https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/ivy/resolution-cache/org.apache.hive-hive-shims-default.xml to https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/ivy/report/org.apache.hive-hive-shims-default.html ivy-retrieve: [echo] Project: shims compile: [echo] Project: shims [echo] Building shims 0.20 build_shims: [echo] Project: shims [echo] Compiling https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/shims/src/common/java;/home/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/hive/shims/src/0.20/java against hadoop 0.20.2 (https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/hadoopcore/hadoop-0.20.2) ivy-init-settings: [echo] Project: shims ivy-resolve-hadoop-shim: [echo] Project: shims [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/ivy/ivysettings.xml ivy-retrieve-hadoop-shim: [echo] Project: shims [echo] Building shims 0.20S build_shims: [echo] Project: shims [echo] Compiling https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/shims/src/common/java;/home/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/hive/shims/src/common-secure/java;/home/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/hive/shims/src/0.20S/java against hadoop 1.0.0 (https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/hadoopcore/hadoop-1.0.0) ivy-init-settings: [echo] Project: shims ivy-resolve-hadoop-shim: [echo] Project: shims [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/ivy/ivysettings.xml ivy-retrieve-hadoop-shim: [echo] Project: shims [echo] Building shims 0.23 build_shims: [echo] Project: shims [echo] Compiling https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/ws/hive/shims/src/common/java;/home/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/hive/shims/src/common-secure/java;/home/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/hive/shims/src/0.23/java against hadoop 0.23.3 (https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21-keepgoing=false/203/artifact/hive/build/hadoopcore/hadoop-0.23.3)
[jira] [Commented] (HIVE-3705) Adding authorization capability to the metastore
[ https://issues.apache.org/jira/browse/HIVE-3705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500375#comment-13500375 ] Rob Weltman commented on HIVE-3705: --- A new JIRA has been opened for the larger issues around the desired semantics of Hive authorization and ensuring they are enforced: https://issues.apache.org/jira/browse/HIVE-3720 Adding authorization capability to the metastore Key: HIVE-3705 URL: https://issues.apache.org/jira/browse/HIVE-3705 Project: Hive Issue Type: New Feature Components: Authorization, Metastore Reporter: Sushanth Sowmyan Assignee: Sushanth Sowmyan Attachments: HIVE-3705.D6681.1.patch, HIVE-3705.D6681.2.patch, hive-backend-auth.git.patch, hivesec_investigation.pdf In an environment where multiple clients access a single metastore, and we want to evolve hive security to a point where it's no longer simply preventing users from shooting their own foot, we need to be able to authorize metastore calls as well, instead of simply performing every metastore api call that's made. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (HIVE-3718) Add check to determine whether partition can be dropped at Semantic Analysis time
[ https://issues.apache.org/jira/browse/HIVE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pamela Vagata reassigned HIVE-3718: --- Assignee: Pamela Vagata Add check to determine whether partition can be dropped at Semantic Analysis time - Key: HIVE-3718 URL: https://issues.apache.org/jira/browse/HIVE-3718 Project: Hive Issue Type: Task Components: CLI Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500469#comment-13500469 ] Carl Steinbach commented on HIVE-2206: -- @Yin: The correlation optimizer is only enabled for a small set of new CliDriver tests. If I enable the correlation optimizer by default, which of the existing CliDriver tests are expected to fail? add a new optimizer for query correlation discovery and optimization Key: HIVE-2206 URL: https://issues.apache.org/jira/browse/HIVE-2206 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.10.0 Reporter: He Yongqiang Assignee: Yin Huai Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and # No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. # Support queries only involve TC; # Support queries in which input tables of correlated MR jobs involves intermediate tables; and # Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3718) Add check to determine whether partition can be dropped at Semantic Analysis time
[ https://issues.apache.org/jira/browse/HIVE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pamela Vagata updated HIVE-3718: Attachment: (was: HIVE-3718.1.patch.txt) Add check to determine whether partition can be dropped at Semantic Analysis time - Key: HIVE-3718 URL: https://issues.apache.org/jira/browse/HIVE-3718 Project: Hive Issue Type: Task Components: CLI Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3718) Add check to determine whether partition can be dropped at Semantic Analysis time
[ https://issues.apache.org/jira/browse/HIVE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pamela Vagata updated HIVE-3718: Attachment: HIVE-3718.1.patch.txt Add check to determine whether partition can be dropped at Semantic Analysis time - Key: HIVE-3718 URL: https://issues.apache.org/jira/browse/HIVE-3718 Project: Hive Issue Type: Task Components: CLI Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor Attachments: HIVE-3718.1.patch.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3718) Add check to determine whether partition can be dropped at Semantic Analysis time
[ https://issues.apache.org/jira/browse/HIVE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pamela Vagata updated HIVE-3718: Status: Patch Available (was: Open) Add check to determine whether partition can be dropped at Semantic Analysis time - Key: HIVE-3718 URL: https://issues.apache.org/jira/browse/HIVE-3718 Project: Hive Issue Type: Task Components: CLI Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor Attachments: HIVE-3718.1.patch.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500474#comment-13500474 ] David Inbar commented on HIVE-2206: --- I will be on vacation through Friday Nov 23rd, but will be checking email and voicemail periodically. For all time-critical items, please call my mobile phone. Many thanks, David NOTICE: All information in and attached to this email may be proprietary, confidential, privileged and otherwise protected from improper or erroneous disclosure. If you are not the sender's intended recipient, you are not authorized to intercept, read, print, retain, copy, forward, or disseminate this message. add a new optimizer for query correlation discovery and optimization Key: HIVE-2206 URL: https://issues.apache.org/jira/browse/HIVE-2206 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.10.0 Reporter: He Yongqiang Assignee: Yin Huai Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and # No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. # Support queries only involve TC; # Support queries in which input tables of correlated MR jobs involves intermediate tables; and # Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3718) Add check to determine whether partition can be dropped at Semantic Analysis time
[ https://issues.apache.org/jira/browse/HIVE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500493#comment-13500493 ] Kevin Wilfong commented on HIVE-3718: - +1 Add check to determine whether partition can be dropped at Semantic Analysis time - Key: HIVE-3718 URL: https://issues.apache.org/jira/browse/HIVE-3718 Project: Hive Issue Type: Task Components: CLI Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor Attachments: HIVE-3718.1.patch.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3647) map-side groupby wrongly due to HIVE-3432
[ https://issues.apache.org/jira/browse/HIVE-3647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wilfong updated HIVE-3647: Resolution: Fixed Status: Resolved (was: Patch Available) Committed, thanks Namit. map-side groupby wrongly due to HIVE-3432 - Key: HIVE-3647 URL: https://issues.apache.org/jira/browse/HIVE-3647 Project: Hive Issue Type: Bug Components: Query Processor Reporter: Namit Jain Assignee: Namit Jain Attachments: hive.3647.1.patch, hive.3647.2.patch, hive.3647.3.patch, hive.3647.4.patch, hive.3647.5.patch, hive.3647.6.patch, hive.3647.7.patch, hive.3647.8.patch There seems to be a bug due to HIVE-3432. We are converting the group by to a map side group by after only looking at sorting columns. This can give wrong results if the data is sorted and bucketed by different columns. Add some tests for that scenario, verify and fix any issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3678) Add metastore upgrade scripts for column stats schema changes
[ https://issues.apache.org/jira/browse/HIVE-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500500#comment-13500500 ] Carl Steinbach commented on HIVE-3678: -- The upgrade scripts look good to me. As for HIVE-3712 which is included in this patch, I have started to wonder if it would be better for the metastore DB to store the column stats values (e.g. min/max value, num trues/falses, min/max/avg length, etc) as a JSON text blob. This approach would make the code more portable by eliminating dependencies on specific DBs and will also make it easier to add new fields in the future. The big downside of this approach is that we won't be able to push down column stats filters on these fields, but I'm not convinced that this is a practical use case in the first place. Add metastore upgrade scripts for column stats schema changes - Key: HIVE-3678 URL: https://issues.apache.org/jira/browse/HIVE-3678 Project: Hive Issue Type: Bug Components: Metastore Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Fix For: 0.10.0 Attachments: HIVE-3678.1.patch.txt Add upgrade script for column statistics schema changes for Postgres/MySQL/Oracle/Derby -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500499#comment-13500499 ] Yin Huai commented on HIVE-2206: [~cwsteinbach] If the optimizer is enabled by default, based on my last tests, only auto_join26.q is expected to fail, because it will be optimized by correlation optimizer. But, except the query plan, the query result of auto_join26.q is correct. Also, once I finished HIVE-3671 (I am working on it right now), the failure of auto_join26.q should be eliminated. add a new optimizer for query correlation discovery and optimization Key: HIVE-2206 URL: https://issues.apache.org/jira/browse/HIVE-2206 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.10.0 Reporter: He Yongqiang Assignee: Yin Huai Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and # No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. # Support queries only involve TC; # Support queries in which input tables of correlated MR jobs involves intermediate tables; and # Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3718) Add check to determine whether partition can be dropped at Semantic Analysis time
[ https://issues.apache.org/jira/browse/HIVE-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pamela Vagata updated HIVE-3718: Attachment: (was: HIVE-3718.1.patch.txt) Add check to determine whether partition can be dropped at Semantic Analysis time - Key: HIVE-3718 URL: https://issues.apache.org/jira/browse/HIVE-3718 Project: Hive Issue Type: Task Components: CLI Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor Attachments: HIVE-3718.1.patch.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3719) Improve HiveServer to support username/password authentication
[ https://issues.apache.org/jira/browse/HIVE-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ashutosh Chauhan updated HIVE-3719: --- Assignee: Yu Gao Improve HiveServer to support username/password authentication -- Key: HIVE-3719 URL: https://issues.apache.org/jira/browse/HIVE-3719 Project: Hive Issue Type: Improvement Components: Authentication, JDBC Affects Versions: 0.9.0 Reporter: Yu Gao Assignee: Yu Gao Labels: security The current HiveServer implementation (call it HiveServer version 1 to distinguish it from HIveServer2 that is under development currently) does not have any authentication mechanism against connecting clients, which means anyone can access it, e.g. through Hive JDBC driver, without any security control. The user and password property are simply ignored by Hive JDBC driver and never get to HiveServer1. It would be good to introduce authentication infrastructure to HiveServer 1, and improve JDBC driver implementation as well to support this, so that together with the existing authorization infrastructure, for applications that want to access HiveServer1 via JDBC driver, connections and operations are under security control. Although there's HiveServer2 that has been under implementation for a while, this improvement for HiveServer1 is very necessary to fill the big security hole, and would benefit applications a lot that are using HiveServer1. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3678) Add metastore upgrade scripts for column stats schema changes
[ https://issues.apache.org/jira/browse/HIVE-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500526#comment-13500526 ] Shreepadma Venugopalan commented on HIVE-3678: -- With the changes from HIVE-3712, the column schema has *no* dependency on any specific db. The column schema, with the changes from HIVE-3712, uses simple data types, which are supported across DBs. The primary motivation for making the change to the schema in HIVE-3712 was to avoid storing column statistics fields as a BLOB. The problem with using a BLOB is a) BLOBs are designed to store large volumes of data in the order of GBs and are hence stored outside the row. A consequence of this design is BLOBs don't perform well for storing small amounts of data. While some DBs such as Oracle inline small BLOBs, all DBs don't. While BLOBs are the only practical choice for storing data whose size is not known in advance, it is an overkill for storing around 100 bytes of data, and b) there is no uniform support across DB vendors and versions. Hence I don't really see the value in storing this as a JSON BLOB. Add metastore upgrade scripts for column stats schema changes - Key: HIVE-3678 URL: https://issues.apache.org/jira/browse/HIVE-3678 Project: Hive Issue Type: Bug Components: Metastore Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Fix For: 0.10.0 Attachments: HIVE-3678.1.patch.txt Add upgrade script for column statistics schema changes for Postgres/MySQL/Oracle/Derby -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3709) Stop storing default ConfVars in temp file
[ https://issues.apache.org/jira/browse/HIVE-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Carl Steinbach updated HIVE-3709: - Status: Open (was: Patch Available) @Kevin: I still see errors in TestHiveServerSessions when I run the test individually: % ant clean package test -Dtestcase=TestHiveServerSessions test: [echo] Project: service [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/Users/carl/.local/java/ant/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/Users/carl/Work/repos/hive-test/build/ivy/lib/hadoop0.20.shim/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.hadoop.hive.service.TestHiveServerSessions [junit] Hive history file=/Users/carl/Work/repos/hive-test/build/service/tmp/hive_job_log_carl_201211191056_789001489.txt [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 8.439 sec [junit] Hive history file=/Users/carl/Work/repos/hive-test/build/service/tmp/hive_job_log_carl_201211191056_788616740.txt [junit] [Fatal Error] :1:1: Content is not allowed in prolog. [junit] [Fatal Error] :92:58: The element type name must be terminated by the matching end-tag /name. [junit] Test org.apache.hadoop.hive.service.TestHiveServerSessions FAILED [for] /Users/carl/Work/repos/hive-test/service/build.xml: The following error occurred while executing this line: [for] /Users/carl/Work/repos/hive-test/build.xml:325: The following error occurred while executing this line: [for] /Users/carl/Work/repos/hive-test/build-common.xml:455: Tests failed! BUILD FAILED /Users/carl/Work/repos/hive-test/build.xml:320: Keepgoing execution: 1 of 12 iterations failed. Stop storing default ConfVars in temp file -- Key: HIVE-3709 URL: https://issues.apache.org/jira/browse/HIVE-3709 Project: Hive Issue Type: Improvement Components: Configuration Affects Versions: 0.10.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-3709.1.patch.txt, HIVE-3709.2.patch.txt To work around issues with Hadoop's Configuration object, specifically it's addResource(InputStream), default configurations are written to a temp file (I think HIVE-2362 introduced this). This, however, introduces the problem that once that file is deleted from /tmp the client crashes. This is particularly problematic for long running services like the metastore server. Writing a custom InputStream to deal with the problems in the Configuration object should provide a work around, which does not introduce a time bomb into Hive. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3678) Add metastore upgrade scripts for column stats schema changes
[ https://issues.apache.org/jira/browse/HIVE-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500570#comment-13500570 ] Carl Steinbach commented on HIVE-3678: -- Sorry for the confusion. When I wrote blob I was trying to convey only that the field will be opaque to the DB (since it's a JSON struct), not that it will actually be stored in a BLOB column. If we store the JSON struct in a VARCHAR we have at least 4000 bytes to work with. Add metastore upgrade scripts for column stats schema changes - Key: HIVE-3678 URL: https://issues.apache.org/jira/browse/HIVE-3678 Project: Hive Issue Type: Bug Components: Metastore Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Fix For: 0.10.0 Attachments: HIVE-3678.1.patch.txt Add upgrade script for column statistics schema changes for Postgres/MySQL/Oracle/Derby -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: HIVE-2206: add a new optimizer for query correlation discovery and optimization
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/7126/ --- (Updated Nov. 19, 2012, 7:51 p.m.) Review request for hive. Changes --- Correlation optimizer will guess which join operators at the bottom (input tables are not intermediate tables) will be optimized by auto join convert and ignore those join operators in the optimization of correlation optimizer. Description --- This optimizer exploits intra-query correlations and merges multiple correlated MapReduce jobs into one jobs. Open a new request since I have been working on hive-git. This addresses bug HIVE-2206. https://issues.apache.org/jira/browse/HIVE-2206 Diffs (updated) - common/src/java/org/apache/hadoop/hive/conf/HiveConf.java 9fa9525 conf/hive-default.xml.template f332f3a ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/OperatorType.java 7c4c413 ql/src/java/org/apache/hadoop/hive/ql/exec/BaseReduceSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationCompositeOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationLocalSimulativeReduceSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/CorrelationReducerDispatchOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/ExecReducer.java 18a9bd2 ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java 46daeb2 ql/src/java/org/apache/hadoop/hive/ql/exec/Operator.java 68302f8 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java 0c22141 ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java 919a140 ql/src/java/org/apache/hadoop/hive/ql/exec/TableScanOperator.java 1469325 ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizer.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/CorrelationOptimizerUtils.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/GenMapRedUtils.java edde378 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java d1555e2 ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java 2bf284d ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java 330aa52 ql/src/java/org/apache/hadoop/hive/ql/plan/BaseReduceSinkDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationCompositeDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationLocalSimulativeReduceSinkDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/CorrelationReducerDispatchDesc.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/MapredWork.java 5a9f064 ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java b33d616 ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 9a95efd ql/src/test/org/apache/hadoop/hive/ql/exec/TestExecDriver.java 6f8bc47 ql/src/test/queries/clientpositive/correlationoptimizer1.q PRE-CREATION ql/src/test/queries/clientpositive/correlationoptimizer2.q PRE-CREATION ql/src/test/queries/clientpositive/correlationoptimizer3.q PRE-CREATION ql/src/test/queries/clientpositive/correlationoptimizer4.q PRE-CREATION ql/src/test/queries/clientpositive/correlationoptimizer5.q PRE-CREATION ql/src/test/results/clientpositive/correlationoptimizer1.q.out PRE-CREATION ql/src/test/results/clientpositive/correlationoptimizer2.q.out PRE-CREATION ql/src/test/results/clientpositive/correlationoptimizer3.q.out PRE-CREATION ql/src/test/results/clientpositive/correlationoptimizer4.q.out PRE-CREATION ql/src/test/results/clientpositive/correlationoptimizer5.q.out PRE-CREATION ql/src/test/results/compiler/plan/groupby1.q.xml cd0d6e4 ql/src/test/results/compiler/plan/groupby2.q.xml 7b07f02 ql/src/test/results/compiler/plan/groupby3.q.xml a6a1986 ql/src/test/results/compiler/plan/groupby5.q.xml 25e3583 Diff: https://reviews.apache.org/r/7126/diff/ Testing --- All tests pass. Thanks, Yin Huai
[jira] [Updated] (HIVE-3648) HiveMetaStoreFsImpl is not compatible with hadoop viewfs
[ https://issues.apache.org/jira/browse/HIVE-3648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arup Malakar updated HIVE-3648: --- Attachment: HIVE_3648_branch_0.patch HIVE_3648_trunk_1.patch Patch available for branch. Added one missing abstract method in HadoopShimsSecure class. Updated trunk review: https://reviews.facebook.net/D6759 Branch review: https://reviews.facebook.net/D6801 Thanks, Arup HiveMetaStoreFsImpl is not compatible with hadoop viewfs Key: HIVE-3648 URL: https://issues.apache.org/jira/browse/HIVE-3648 Project: Hive Issue Type: Bug Components: Metastore Affects Versions: 0.9.0, 0.10.0 Reporter: Kihwal Lee Attachments: HIVE_3648_branch_0.patch, HIVE-3648-trunk-0.patch, HIVE_3648_trunk_1.patch HiveMetaStoreFsImpl#deleteDir() method calls Trash#moveToTrash(). This may not work when viewfs is used. It needs to call Trash#moveToAppropriateTrash() instead. Please note that this method is not available in hadoop versions earlier than 0.23. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HIVE-3721) ALTER TABLE ADD PARTS should check for valid partition spec and throw a SemanticException if part spec is not valid
Pamela Vagata created HIVE-3721: --- Summary: ALTER TABLE ADD PARTS should check for valid partition spec and throw a SemanticException if part spec is not valid Key: HIVE-3721 URL: https://issues.apache.org/jira/browse/HIVE-3721 Project: Hive Issue Type: Task Reporter: Pamela Vagata Assignee: Pamela Vagata Priority: Minor -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated HIVE-2206: --- Attachment: HIVE-2206.19-r1410581.patch.txt I just integrate HIVE-3671 into this patch. At the beginning of correlation optimizer, it will predict if a join operator will be converted by CommonJoinResolver, if so, correlation optimizer will annotate this join operator and in the future optimization, ignore this operator. The prediction can only be made to those join operators the input tables of which are not intermediate tables. The method of the prediction is ported from CommonJoinResolver. Also, a test is added in correlationoptimizer1.q [~namit] Please take a look at this patch. Let me know if you have any comment. add a new optimizer for query correlation discovery and optimization Key: HIVE-2206 URL: https://issues.apache.org/jira/browse/HIVE-2206 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.10.0 Reporter: He Yongqiang Assignee: Yin Huai Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and # No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. # Support queries only involve TC; # Support queries in which input tables of correlated MR jobs involves intermediate tables; and # Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3678) Add metastore upgrade scripts for column stats schema changes
[ https://issues.apache.org/jira/browse/HIVE-3678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500599#comment-13500599 ] Ashutosh Chauhan commented on HIVE-3678: I agree with Carl, making it easier to evolve such that its independent of exact type will be a win. We already have one such use-case with BigDecimal support being added over on HIVE-2693. Also, following looks unintentional change. {code} -- Constraints for table PARTITION_KEYS -ALTER TABLE PARTITION_KEYS ADD CONSTRAINT PARTITION_KEYS_FK1 FOREIGN KEY (TBL_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ; +ALTER TABLE PARTITION_KEYS ADD CONSTRAINT PARTITION_KEYS_FK1 FOREIGN KEY (TBTB_ID) REFERENCES TBLS (TBL_ID) INITIALLY DEFERRED ; {code} Add metastore upgrade scripts for column stats schema changes - Key: HIVE-3678 URL: https://issues.apache.org/jira/browse/HIVE-3678 Project: Hive Issue Type: Bug Components: Metastore Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Fix For: 0.10.0 Attachments: HIVE-3678.1.patch.txt Add upgrade script for column statistics schema changes for Postgres/MySQL/Oracle/Derby -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-2206) add a new optimizer for query correlation discovery and optimization
[ https://issues.apache.org/jira/browse/HIVE-2206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500626#comment-13500626 ] Carl Steinbach commented on HIVE-2206: -- I'm surprised that auto_join26 is the only test that fails due to different EXPLAIN output. Is that because this optimization doesn't affect the queries in most tests, or because we don't consistently call EXPLAIN in the tests? What is preventing us from enabling this by default right now? add a new optimizer for query correlation discovery and optimization Key: HIVE-2206 URL: https://issues.apache.org/jira/browse/HIVE-2206 Project: Hive Issue Type: New Feature Components: Query Processor Affects Versions: 0.10.0 Reporter: He Yongqiang Assignee: Yin Huai Attachments: HIVE-2206.10-r1384442.patch.txt, HIVE-2206.11-r1385084.patch.txt, HIVE-2206.12-r1386996.patch.txt, HIVE-2206.13-r1389072.patch.txt, HIVE-2206.14-r1389704.patch.txt, HIVE-2206.15-r1392491.patch.txt, HIVE-2206.16-r1399936.patch.txt, HIVE-2206.17-r1404933.patch.txt, HIVE-2206.18-r1407720.patch.txt, HIVE-2206.19-r1410581.patch.txt, HIVE-2206.1.patch.txt, HIVE-2206.2.patch.txt, HIVE-2206.3.patch.txt, HIVE-2206.4.patch.txt, HIVE-2206.5-1.patch.txt, HIVE-2206.5.patch.txt, HIVE-2206.6.patch.txt, HIVE-2206.7.patch.txt, HIVE-2206.8.r1224646.patch.txt, HIVE-2206.8-r1237253.patch.txt, testQueries.2.q, YSmartPatchForHive.patch This issue proposes a new logical optimizer called Correlation Optimizer, which is used to merge correlated MapReduce jobs (MR jobs) into a single MR job. The idea is based on YSmart (http://ysmart.cse.ohio-state.edu/).The paper and slides of YSmart are linked at the bottom. Since Hive translates queries in a sentence by sentence fashion, for every operation which may need to shuffle the data (e.g. join and aggregation operations), Hive will generate a MapReduce job for that operation. However, for those operations which may need to shuffle the data, they may involve correlations explained below and thus can be executed in a single MR job. # Input Correlation: Multiple MR jobs have input correlation (IC) if their input relation sets are not disjoint; # Transit Correlation: Multiple MR jobs have transit correlation (TC) if they have not only input correlation, but also the same partition key; # Job Flow Correlation: An MR has job flow correlation (JFC) with one of its child nodes if it has the same partition key as that child node. The current implementation of correlation optimizer only detect correlations among MR jobs for reduce-side join operators and reduce-side aggregation operators (not map only aggregation). A query will be optimized if it satisfies following conditions. # There exists a MR job for reduce-side join operator or reduce side aggregation operator which have JFC with all of its parents MR jobs (TCs will be also exploited if JFC exists); # All input tables of those correlated MR job are original input tables (not intermediate tables generated by sub-queries); and # No self join is involved in those correlated MR jobs. Correlation optimizer is implemented as a logical optimizer. The main reasons are that it only needs to manipulate the query plan tree and it can leverage the existing component on generating MR jobs. Current implementation can serve as a framework for correlation related optimizations. I think that it is better than adding individual optimizers. There are several work that can be done in future to improve this optimizer. Here are three examples. # Support queries only involve TC; # Support queries in which input tables of correlated MR jobs involves intermediate tables; and # Optimize queries involving self join. References: Paper and presentation of YSmart. Paper: http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf Slides: http://sdrv.ms/UpwJJc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Build failed in Jenkins: Hive-0.9.1-SNAPSHOT-h0.21 #203
See https://builds.apache.org/job/Hive-0.9.1-SNAPSHOT-h0.21/203/ -- [...truncated 36981 lines...] [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: file:/tmp/jenkins/hive_2012-11-19_12-44-29_760_9041461297608391868/-mr-1 [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: default@testhivedrivertable [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] Hive history file=/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/build/service/tmp/hive_job_log_jenkins_201211191244_402762564.txt [junit] Copying file: file:/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/data/files/kv1.txt [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] OK [junit] PREHOOK: query: create table testhivedrivertable (num int) [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: create table testhivedrivertable (num int) [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: load data local inpath '/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/data/files/kv1.txt' into table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Output: default@testhivedrivertable [junit] Copying data from file:/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/data/files/kv1.txt [junit] Loading data to table default.testhivedrivertable [junit] POSTHOOK: query: load data local inpath '/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/data/files/kv1.txt' into table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: select * from testhivedrivertable limit 10 [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: file:/tmp/jenkins/hive_2012-11-19_12-44-33_658_2399849414089401271/-mr-1 [junit] POSTHOOK: query: select * from testhivedrivertable limit 10 [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: file:/tmp/jenkins/hive_2012-11-19_12-44-33_658_2399849414089401271/-mr-1 [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: default@testhivedrivertable [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] Hive history file=/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/build/service/tmp/hive_job_log_jenkins_201211191244_1902789586.txt [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] OK [junit] PREHOOK: query: create table testhivedrivertable (num int) [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: create table testhivedrivertable (num int) [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] PREHOOK: Input: default@testhivedrivertable [junit] PREHOOK: Output: default@testhivedrivertable [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] POSTHOOK: Input: default@testhivedrivertable [junit] POSTHOOK: Output: default@testhivedrivertable [junit] OK [junit] Hive history file=/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/build/service/tmp/hive_job_log_jenkins_201211191244_994263279.txt [junit] Hive history file=/x1/jenkins/jenkins-slave/workspace/Hive-0.9.1-SNAPSHOT-h0.21/hive/build/service/tmp/hive_job_log_jenkins_201211191244_1983954224.txt [junit] PREHOOK: query: drop table testhivedrivertable [junit] PREHOOK: type: DROPTABLE [junit] POSTHOOK: query: drop table testhivedrivertable [junit] POSTHOOK: type: DROPTABLE [junit] OK [junit] PREHOOK: query: create table testhivedrivertable (key int, value string)
Re: hive 0.10 release
Another quick update. I have created a hive-0.10 branch. At this point, HIVE-3678 is a blocker to do a 0.10 release. There are few others nice to have which were there in my previous email. I will be happy to merge new patches between now and RC if folks request for it and are low risk. Thanks, Ashutosh On Thu, Nov 15, 2012 at 2:29 PM, Ashutosh Chauhan hashut...@apache.orgwrote: Good progress. Looks like folks are on board. I propose to cut the branch in next couple of days. There are few jiras which are patch ready which I want to get into the hive-0.10 release, including HIVE-3255 HIVE-2517 HIVE-3400 HIVE-3678 Ed has already made a request for HIVE-3083. If folks have other patches they want see in 0.10, please chime in. Also, request to other committers to help in review patches. There are quite a few in Patch Available state. Thanks, Ashutosh On Thu, Nov 8, 2012 at 3:22 PM, Owen O'Malley omal...@apache.org wrote: +1 On Thu, Nov 8, 2012 at 3:18 PM, Carl Steinbach c...@cloudera.com wrote: +1 On Wed, Nov 7, 2012 at 11:23 PM, Alexander Lorenz wget.n...@gmail.com wrote: +1, good karma On Nov 8, 2012, at 4:58 AM, Namit Jain nj...@fb.com wrote: +1 to the idea On 11/8/12 6:33 AM, Edward Capriolo edlinuxg...@gmail.com wrote: That sounds good. I think this issue needs to be solved as well as anything else that produces a bugus query result. https://issues.apache.org/jira/browse/HIVE-3083 Edward On Wed, Nov 7, 2012 at 7:50 PM, Ashutosh Chauhan hashut...@apache.org wrote: Hi, Its been a while since we released 0.10 more than six months ago. All this while, lot of action has happened with various cool features landing in trunk. Additionally, I am looking forward to HiveServer2 landing in trunk. So, I propose that we cut the branch for 0.10 soon afterwards and than release it. Thoughts? Thanks, Ashutosh -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF
Re: hive 0.10 release
There are couple of enhancements that I have been working on mainly related to the hive/hbase integration. It would be awesome if it is possible at all to include them in this release. None of them should really be high risk. I have patches submitted for few of them. Will try to get for others submitted in next couple of days. Any specific deadline that I should be looking forward to? [1] https://issues.apache.org/jira/browse/HIVE-2599 (Patch Available) [2] https://issues.apache.org/jira/browse/HIVE-3553 (Patch Available) [3] https://issues.apache.org/jira/browse/HIVE-3211 [4] https://issues.apache.org/jira/browse/HIVE-3555 [5] https://issues.apache.org/jira/browse/HIVE-3725 On Mon, Nov 19, 2012 at 4:55 PM, Ashutosh Chauhan hashut...@apache.orgwrote: Another quick update. I have created a hive-0.10 branch. At this point, HIVE-3678 is a blocker to do a 0.10 release. There are few others nice to have which were there in my previous email. I will be happy to merge new patches between now and RC if folks request for it and are low risk. Thanks, Ashutosh On Thu, Nov 15, 2012 at 2:29 PM, Ashutosh Chauhan hashut...@apache.org wrote: Good progress. Looks like folks are on board. I propose to cut the branch in next couple of days. There are few jiras which are patch ready which I want to get into the hive-0.10 release, including HIVE-3255 HIVE-2517 HIVE-3400 HIVE-3678 Ed has already made a request for HIVE-3083. If folks have other patches they want see in 0.10, please chime in. Also, request to other committers to help in review patches. There are quite a few in Patch Available state. Thanks, Ashutosh On Thu, Nov 8, 2012 at 3:22 PM, Owen O'Malley omal...@apache.org wrote: +1 On Thu, Nov 8, 2012 at 3:18 PM, Carl Steinbach c...@cloudera.com wrote: +1 On Wed, Nov 7, 2012 at 11:23 PM, Alexander Lorenz wget.n...@gmail.com wrote: +1, good karma On Nov 8, 2012, at 4:58 AM, Namit Jain nj...@fb.com wrote: +1 to the idea On 11/8/12 6:33 AM, Edward Capriolo edlinuxg...@gmail.com wrote: That sounds good. I think this issue needs to be solved as well as anything else that produces a bugus query result. https://issues.apache.org/jira/browse/HIVE-3083 Edward On Wed, Nov 7, 2012 at 7:50 PM, Ashutosh Chauhan hashut...@apache.org wrote: Hi, Its been a while since we released 0.10 more than six months ago. All this while, lot of action has happened with various cool features landing in trunk. Additionally, I am looking forward to HiveServer2 landing in trunk. So, I propose that we cut the branch for 0.10 soon afterwards and than release it. Thoughts? Thanks, Ashutosh -- Alexander Alten-Lorenz http://mapredit.blogspot.com German Hadoop LinkedIn Group: http://goo.gl/N8pCF -- Swarnim
Hive-trunk-h0.21 - Build # 1805 - Still Failing
Changes for Build #1764 [kevinwilfong] HIVE-3610. Add a command Explain dependency ... (Sambavi Muthukrishnan via kevinwilfong) Changes for Build #1765 Changes for Build #1766 [hashutosh] HIVE-3441 : testcases escape1,escape2 fail on windows (Thejas Nair via Ashutosh Chauhan) [kevinwilfong] HIVE-3499. add tests to use bucketing metadata for partitions. (njain via kevinwilfong) Changes for Build #1767 [kevinwilfong] HIVE-3276. optimize union sub-queries. (njain via kevinwilfong) Changes for Build #1768 Changes for Build #1769 Changes for Build #1770 [namit] HIVE-3570 Add/fix facility to collect operator specific statisticsin hive + add hash-in/hash-out counter for GroupBy Optr (Satadru Pan via namit) [namit] HIVE-3554 Hive List Bucketing - Query logic (Gang Tim Liu via namit) [cws] HIVE-3563. Drop database cascade fails when there are indexes on any tables (Prasad Mujumdar via cws) Changes for Build #1771 [kevinwilfong] HIVE-3640. Reducer allocation is incorrect if enforce bucketing and mapred.reduce.tasks are both set. (Vighnesh Avadhani via kevinwilfong) Changes for Build #1772 Changes for Build #1773 Changes for Build #1774 Changes for Build #1775 [namit] HIVE-3673 Sort merge join not used when join columns have different names (Kevin Wilfong via namit) Changes for Build #1776 [kevinwilfong] HIVE-3627. eclipse misses library: javolution-@javolution-version@.jar. (Gang Tim Liu via kevinwilfong) Changes for Build #1777 [kevinwilfong] HIVE-3524. Storing certain Exception objects thrown in HiveMetaStore.java in MetaStoreEndFunctionContext. (Maheshwaran Srinivasan via kevinwilfong) [cws] HIVE-1977. DESCRIBE TABLE syntax doesn't support specifying a database qualified table name (Zhenxiao Luo via cws) [cws] HIVE-3674. Test case TestParse broken after recent checkin (Sambavi Muthukrishnan via cws) Changes for Build #1778 [cws] HIVE-1362. Column level scalar valued statistics on Tables and Partitions (Shreepadma Venugopalan via cws) Changes for Build #1779 Changes for Build #1780 [kevinwilfong] HIVE-3686. Fix compile errors introduced by the interaction of HIVE-1362 and HIVE-3524. (Shreepadma Venugopalan via kevinwilfong) Changes for Build #1781 [namit] HIVE-3687 smb_mapjoin_13.q is nondeterministic (Kevin Wilfong via namit) Changes for Build #1782 [hashutosh] HIVE-2715: Upgrade Thrift dependency to 0.9.0 (Ashutosh Chauhan) Changes for Build #1783 [kevinwilfong] HIVE-3654. block relative path access in hive. (njain via kevinwilfong) [hashutosh] HIVE-3658 : Unable to generate the Hbase related unit tests using velocity templates on Windows (Kanna Karanam via Ashutosh Chauhan) [hashutosh] HIVE-3661 : Remove the Windows specific = related swizzle path changes from Proxy FileSystems (Kanna Karanam via Ashutosh Chauhan) [hashutosh] HIVE-3480 : Resource leak: Fix the file handle leaks in Symbolic Symlink related input formats. (Kanna Karanam via Ashutosh Chauhan) Changes for Build #1784 [kevinwilfong] HIVE-3675. NaN does not work correctly for round(n). (njain via kevinwilfong) [cws] HIVE-3651. bucketmapjoin?.q tests fail with hadoop 0.23 (Prasad Mujumdar via cws) Changes for Build #1785 [namit] HIVE-3613 Implement grouping_id function (Ian Gorbachev via namit) [namit] HIVE-3692 Update parallel test documentation (Ivan Gorbachev via namit) [namit] HIVE-3649 Hive List Bucketing - enhance DDL to specify list bucketing table (Gang Tim Liu via namit) Changes for Build #1786 [namit] HIVE-3696 Revert HIVE-3483 which causes performance regression (Gang Tim Liu via namit) Changes for Build #1787 [kevinwilfong] HIVE-3621. Make prompt in Hive CLI configurable. (Jingwei Lu via kevinwilfong) [kevinwilfong] HIVE-3695. TestParse breaks due to HIVE-3675. (njain via kevinwilfong) Changes for Build #1788 [kevinwilfong] HIVE-3557. Access to external URLs in hivetest.py. (Ivan Gorbachev via kevinwilfong) Changes for Build #1789 [hashutosh] HIVE-3662 : TestHiveServer: testScratchDirShouldClearWhileStartup is failing on Windows (Kanna Karanam via Ashutosh Chauhan) [hashutosh] HIVE-3659 : TestHiveHistory::testQueryloglocParentDirNotExist Test fails on Windows because of some resource leaks in ZK (Kanna Karanam via Ashutosh Chauhan) [hashutosh] HIVE-3663 Unable to display the MR Job file path on Windows in case of MR job failures. (Kanna Karanam via Ashutosh Chauhan) Changes for Build #1790 Changes for Build #1791 Changes for Build #1792 Changes for Build #1793 [hashutosh] HIVE-3704 : name of some metastore scripts are not per convention (Ashutosh Chauhan) Changes for Build #1794 [hashutosh] HIVE-3243 : ignore white space between entries of hive/hbase table mapping (Shengsheng Huang via Ashutosh Chauhan) [hashutosh] HIVE-3215 : JobDebugger should use RunningJob.getTrackingURL (Bhushan Mandhani via Ashutosh Chauhan) Changes for Build #1795 [cws] HIVE-3437. 0.23 compatibility: fix unit tests when building against 0.23 (Chris Drome via cws)
[jira] [Commented] (HIVE-3722) Create index fails on CLI using remote metastore
[ https://issues.apache.org/jira/browse/HIVE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500828#comment-13500828 ] Namit Jain commented on HIVE-3722: -- +1 Create index fails on CLI using remote metastore Key: HIVE-3722 URL: https://issues.apache.org/jira/browse/HIVE-3722 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.10.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-3722.1.patch.txt If the CLI uses a remote metastore and the user attempts to create an index without a comment, it will fail with a NPE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3722) Create index fails on CLI using remote metastore
[ https://issues.apache.org/jira/browse/HIVE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500834#comment-13500834 ] Ashutosh Chauhan commented on HIVE-3722: Kevin, I am not sure if you have looked at the discussion on HIVE-2800 Adding a null-check may just be masking an underlying issue. I think it might be worthwhile to uncover it, since this thrift nuisance (of null handling) may bite us again in future. Create index fails on CLI using remote metastore Key: HIVE-3722 URL: https://issues.apache.org/jira/browse/HIVE-3722 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.10.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-3722.1.patch.txt If the CLI uses a remote metastore and the user attempts to create an index without a comment, it will fail with a NPE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3722) Create index fails on CLI using remote metastore
[ https://issues.apache.org/jira/browse/HIVE-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500847#comment-13500847 ] Kevin Wilfong commented on HIVE-3722: - Ashutosh, I missed that JIRA. But based on THRIFT-1625 it sounds like we have to add a check to our code. Create index fails on CLI using remote metastore Key: HIVE-3722 URL: https://issues.apache.org/jira/browse/HIVE-3722 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.10.0 Reporter: Kevin Wilfong Assignee: Kevin Wilfong Attachments: HIVE-3722.1.patch.txt If the CLI uses a remote metastore and the user attempts to create an index without a comment, it will fail with a NPE. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3589) describe/show partition/show tblproperties command should accept database name
[ https://issues.apache.org/jira/browse/HIVE-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500857#comment-13500857 ] Phabricator commented on HIVE-3589: --- navis has commented on the revision HIVE-3589 [jira] describe/show partition/show tblproperties command should accept database name. INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/exec/DDLTask.java:1802 fixed. ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java:1407 I just split original method to two. Exception seemed for handling thrift errors and should be re-thrown to user. ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java:1472 agreed. I'll do it. ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java:1474 I always thought splitting with regex pattern for this kind of simple string is a bit too much. But if it's cleaner, I'll do it. ql/src/java/org/apache/hadoop/hive/ql/plan/DescTableDesc.java:38 ok. ql/src/java/org/apache/hadoop/hive/ql/plan/DescTableDesc.java:112 I'll check on that. ql/src/java/org/apache/hadoop/hive/ql/plan/ShowPartitionsDesc.java:64 ok. ql/src/java/org/apache/hadoop/hive/ql/plan/ShowTblPropertiesDesc.java:34 ok. ql/src/test/queries/clientpositive/describe_table.q:5 Yes, it was HIVE-3676. I'll add the test. REVISION DETAIL https://reviews.facebook.net/D6075 BRANCH DPAL-1916 To: JIRA, cwsteinbach, navis describe/show partition/show tblproperties command should accept database name -- Key: HIVE-3589 URL: https://issues.apache.org/jira/browse/HIVE-3589 Project: Hive Issue Type: Bug Components: Metastore, Query Processor Affects Versions: 0.8.1 Reporter: Sujesh Chirackkal Assignee: Navis Priority: Minor Attachments: HIVE-3589.D6075.1.patch describe command not giving the details when called as describe dbname.tablename. Throwing the error Table dbname not found. Ex: hive -e describe masterdb.table1 will throw error Table masterdb not found -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3635) allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type
[ https://issues.apache.org/jira/browse/HIVE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Alten-Lorenz updated HIVE-3635: - Attachment: (was: HIVE-3635.patch) allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type --- Key: HIVE-3635 URL: https://issues.apache.org/jira/browse/HIVE-3635 Project: Hive Issue Type: Improvement Components: CLI Affects Versions: 0.9.0 Reporter: Alexander Alten-Lorenz Assignee: Alexander Alten-Lorenz Fix For: 0.10.0 Attachments: HIVE-3635.patch interpret t as true and f as false for boolean types. PostgreSQL exports represent it that way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3635) allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type
[ https://issues.apache.org/jira/browse/HIVE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Alten-Lorenz updated HIVE-3635: - Status: Patch Available (was: Open) allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type --- Key: HIVE-3635 URL: https://issues.apache.org/jira/browse/HIVE-3635 Project: Hive Issue Type: Improvement Components: CLI Affects Versions: 0.9.0 Reporter: Alexander Alten-Lorenz Assignee: Alexander Alten-Lorenz Fix For: 0.10.0 Attachments: HIVE-3635.patch interpret t as true and f as false for boolean types. PostgreSQL exports represent it that way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3635) allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type
[ https://issues.apache.org/jira/browse/HIVE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Alten-Lorenz updated HIVE-3635: - Attachment: HIVE-3635.patch allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type --- Key: HIVE-3635 URL: https://issues.apache.org/jira/browse/HIVE-3635 Project: Hive Issue Type: Improvement Components: CLI Affects Versions: 0.9.0 Reporter: Alexander Alten-Lorenz Assignee: Alexander Alten-Lorenz Fix For: 0.10.0 Attachments: HIVE-3635.patch interpret t as true and f as false for boolean types. PostgreSQL exports represent it that way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-3635) allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type
[ https://issues.apache.org/jira/browse/HIVE-3635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13500878#comment-13500878 ] Alexander Alten-Lorenz commented on HIVE-3635: -- Replaced available patch here with the newer one. allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type --- Key: HIVE-3635 URL: https://issues.apache.org/jira/browse/HIVE-3635 Project: Hive Issue Type: Improvement Components: CLI Affects Versions: 0.9.0 Reporter: Alexander Alten-Lorenz Assignee: Alexander Alten-Lorenz Fix For: 0.10.0 Attachments: HIVE-3635.patch interpret t as true and f as false for boolean types. PostgreSQL exports represent it that way. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Review Request: allow 't', 'T', '1', 'f', 'F', and '0' to be allowable true/false values for the boolean hive type
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/7759/ --- (Updated Nov. 20, 2012, 7:11 a.m.) Review request for hive. Changes --- indentation fixed Description --- interpret t as true and f as false for boolean types. PostgreSQL exports represent it that way This addresses bug HIVE-3635. https://issues.apache.org/jira/browse/HIVE-3635 Diffs (updated) - serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyBoolean.java c741c3a Diff: https://reviews.apache.org/r/7759/diff/ Testing --- Thanks, Alexander Alten-Lorenz
[jira] [Updated] (HIVE-3073) Hive List Bucketing - DML support
[ https://issues.apache.org/jira/browse/HIVE-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3073: --- Status: Patch Available (was: Open) Another patch. thanks Hive List Bucketing - DML support -- Key: HIVE-3073 URL: https://issues.apache.org/jira/browse/HIVE-3073 Project: Hive Issue Type: New Feature Components: SQL Affects Versions: 0.10.0 Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3073.patch.12, HIVE-3073.patch.13, HIVE-3073.patch.15 If a hive table column has skewed keys, query performance on non-skewed key is always impacted. Hive List Bucketing feature will address it: https://cwiki.apache.org/Hive/listbucketing.html This jira issue will track DML change for the feature: 1. single skewed column 2. manual load data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (HIVE-3073) Hive List Bucketing - DML support
[ https://issues.apache.org/jira/browse/HIVE-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gang Tim Liu updated HIVE-3073: --- Attachment: HIVE-3073.patch.15 Hive List Bucketing - DML support -- Key: HIVE-3073 URL: https://issues.apache.org/jira/browse/HIVE-3073 Project: Hive Issue Type: New Feature Components: SQL Affects Versions: 0.10.0 Reporter: Gang Tim Liu Assignee: Gang Tim Liu Attachments: HIVE-3073.patch.12, HIVE-3073.patch.13, HIVE-3073.patch.15 If a hive table column has skewed keys, query performance on non-skewed key is always impacted. Hive List Bucketing feature will address it: https://cwiki.apache.org/Hive/listbucketing.html This jira issue will track DML change for the feature: 1. single skewed column 2. manual load data -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira