[jira] [Created] (HIVE-10434) Cancel connection to HS2 when remote Spark driver process has failed [Spark Branch]
Chao Sun created HIVE-10434: --- Summary: Cancel connection to HS2 when remote Spark driver process has failed [Spark Branch] Key: HIVE-10434 URL: https://issues.apache.org/jira/browse/HIVE-10434 Project: Hive Issue Type: Improvement Components: Spark Affects Versions: 1.2.0 Reporter: Chao Sun Assignee: Chao Sun Currently in HoS, in SparkClientImpl it first launch a remote Driver process, and then wait for it to connect back to the HS2. However, in certain situations (for instance, permission issue), the remote process may fail and exit with error code. In this situation, the HS2 process will still wait for the process to connect, and wait for a full timeout period before it throws the exception. What makes it worth, user may need to wait for two timeout periods: one for the SparkSetReducerParallelism, and another for the actual Spark job. This could be very annoying. We should cancel the timeout task once we found out that the process has failed, and set the promise as failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Review Request 33422: HIVE-10434 - Cancel connection when remote Spark driver process has failed [Spark Branch]
On April 22, 2015, 12:38 a.m., Marcelo Vanzin wrote: spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java, line 172 https://reviews.apache.org/r/33422/diff/1/?file=938965#file938965line172 This will throw an exception if the child process exits with a non-zero status after the RSC connects back to HS2. I don't think you want that. Oh yes. I forgot that case. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33422/#review81103 --- On April 22, 2015, 12:30 a.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33422/ --- (Updated April 22, 2015, 12:30 a.m.) Review request for hive and Marcelo Vanzin. Bugs: HIVE-10434 https://issues.apache.org/jira/browse/HIVE-10434 Repository: hive-git Description --- This patch cancels the connection from HS2 to remote process once the latter has failed and exited with error code, to avoid potential long timeout. It add a new public method cancelClient to the RpcServer class - not sure whether there's an easier way to do this.. Diffs - spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 71e432d spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java 32d4c46 Diff: https://reviews.apache.org/r/33422/diff/ Testing --- Tested on my own cluster, and it worked. Thanks, Chao Sun
Re: Review Request 33422: HIVE-10434 - Cancel connection when remote Spark driver process has failed [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33422/ --- (Updated April 22, 2015, 1:25 a.m.) Review request for hive and Marcelo Vanzin. Bugs: HIVE-10434 https://issues.apache.org/jira/browse/HIVE-10434 Repository: hive-git Description --- This patch cancels the connection from HS2 to remote process once the latter has failed and exited with error code, to avoid potential long timeout. It add a new public method cancelClient to the RpcServer class - not sure whether there's an easier way to do this.. Diffs (updated) - spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 71e432d spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java 32d4c46 Diff: https://reviews.apache.org/r/33422/diff/ Testing --- Tested on my own cluster, and it worked. Thanks, Chao Sun
[jira] [Created] (HIVE-10433) Cancel connection when remote driver process exited with error code [Spark Branch]
Chao Sun created HIVE-10433: --- Summary: Cancel connection when remote driver process exited with error code [Spark Branch] Key: HIVE-10433 URL: https://issues.apache.org/jira/browse/HIVE-10433 Project: Hive Issue Type: Bug Components: spark-branch Reporter: Chao Sun Currently in HoS, after starting a remote process in SparkClientImpl, it will wait for the process to connect back. However, there are cases that the process may fail and exit with error code, and thus no connection is attempted. In this situation, the HS2 process will still wait for the connection and eventually timeout itself. What makes it worse, user may need to wait for two timeout periods, one for SparkSetReducerParallelism, and another for the actual Spark job. We should cancel the timeout task and mark the promise as failed once we know that the process is failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Review Request 33422: HIVE-10434 - Cancel connection when remote Spark driver process has failed [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/33422/ --- Review request for hive and Marcelo Vanzin. Bugs: HIVE-10434 https://issues.apache.org/jira/browse/HIVE-10434 Repository: hive-git Description --- This patch cancels the connection from HS2 to remote process once the latter has failed and exited with error code, to avoid potential long timeout. It add a new public method cancelClient to the RpcServer class - not sure whether there's an easier way to do this.. Diffs - spark-client/src/main/java/org/apache/hive/spark/client/SparkClientImpl.java 71e432d spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java 32d4c46 Diff: https://reviews.apache.org/r/33422/diff/ Testing --- Tested on my own cluster, and it worked. Thanks, Chao Sun
Re: VOTE: move to git
+1. Looking forward to seeing it get implemented. On Thu, Apr 16, 2015 at 12:11 AM, Owen O'Malley owen.omal...@gmail.com wrote: +1 Thanks for taking the initiative and starting this. .. Owen On Apr 15, 2015, at 23:46, Sergey Shelukhin ser...@apache.org wrote: Hi. We’ve been discussing this some time ago; this time I¹d like to start an official vote about moving Hive project to git from svn. I volunteer to facilitate the move; that seems to be just filing INFRA jira, and following instructions such as verifying that the new repo is sane. Please vote: +1 move to git 0 don’t care -1 stay on svn +1. -- Best, Chao
Re: [ANNOUNCE] New Hive Committer - Mithun Radhakrishnan
Congrats Mithun! On Tue, Apr 14, 2015 at 3:29 PM, Chris Drome cdr...@yahoo-inc.com.invalid wrote: Congratulations Mithun! On Tuesday, April 14, 2015 2:57 PM, Carl Steinbach c...@apache.org wrote: The Apache Hive PMC has voted to make Mithun Radhakrishnan a committer on the Apache Hive Project. Please join me in congratulating Mithun. Thanks. - Carl -- Best, Chao
Re: Dataset for Hive
Hi Xiaohe, You can try TPC-DS from https://github.com/hortonworks/hive-testbench. It contains large number of queries with complex joins. Chao On Wed, Apr 1, 2015 at 9:30 PM, xiaohe lan zombiexco...@gmail.com wrote: Hi All, I am new to Hive. Just set up a 5 node Hadoop environment and want to have a try on HiveQL. Is there any dataset I can download to play HiveQL. The dataset should have several tables some I can write some complex join. About 100G should be fine. Thanks, Xiaohe
Re: Request for feedback on work intent for non-equijoin support
Hey Lefty, You need to use the ftp protocol, not http. After clicking the link, you'll need to remove http://; from the address bar. Best, Chao On Wed, Apr 1, 2015 at 9:41 PM, Lefty Leverenz leftylever...@gmail.com wrote: Andrés, I followed that link and got the dread 404 Not Found: The requested URI /pub/torres/Hiperfuse/extended_hiperfuse.pdf was not found on this server. -- Lefty On Wed, Apr 1, 2015 at 7:23 PM, andres.qui...@parc.com wrote: Dear Lefty, Thank you very much for pointing that out and for your initial pointers. Here is the missing link: ftp.parc.com/pub/torres/Hiperfuse/extended_hiperfuse.pdf Regards, Andrés -Original Message- From: Lefty Leverenz [mailto:leftylever...@gmail.com] Sent: Wednesday, April 01, 2015 12:48 AM To: dev@hive.apache.org Subject: Re: Request for feedback on work intent for non-equijoin support Hello Andres, the link to your paper is missing: In our preliminary work, which you can find here (pointer to the paper) ... You can find general information about contributing to Hive in the wiki: Resources for Contributors https://cwiki.apache.org/confluence/display/Hive/Home#Home-ResourcesforContributors , How to Contribute https://cwiki.apache.org/confluence/display/Hive/HowToContribute. -- Lefty On Tue, Mar 31, 2015 at 10:42 PM, andres.qui...@parc.com wrote: Dear Hive development community members, I am interested in learning more about the current support for non-equijoins in Hive and/or other Hadoop SQL engines, and in getting feedback about community interest in more extensive support for such a feature. I intend to work on this challenge, assuming people find it compelling, and I intend to contribute results to the community. Where possible, it would be great to receive feedback and engage in collaborations along the way (for a bit more context, see the postscript of this message). My initial goal is to support query conditions such as the following: A.x B.y A.x in_range [B.y, B.z] distance(A.x, B.y) D where A and B are distinct tables/files. It is my understanding that current support for performing non-equijoins like those above is quite limited, and where some forms are supported (like in Cloudera's Impala), this support is based on doing a potentially expensive cross product join. Depending on the data types involved, I believe that joins with these conditions can be made to be tractable (at least on the average) with join algorithms that exploit properties of the data types, possibly with some pre-scanning of the data. I am asking for feedback on the interest need in the community for this work, as well as any pointers to similar work. In particular, I would appreciate any answers people could give on the following questions: - Is my understanding of the state of the art in Hive and similar tools accurate? Are there groups currently working on similar or related issues, or tools that already accomplish some or all of what I have proposed? - Is there significant value to the community in the support of such a feature? In other words, are the manual workarounds necessary because of the absence of non-equijoins such as these enough of a pain to justify the work I propose? - Being aware that the potential pre-scanning adds to the cost of the join, and that data could still blow-up in the worst case, am I missing any other important considerations and tradeoffs for this problem? - What would be a good avenue to contribute this feature to the community (e.g. as a standalone tool on top of Hadoop, or as a Hive extension or plugin)? - What is the best way to get started in working with the community? Thanks for your attention and any info you can provide! Andres Quiroz P.S. If you are interested in some context, and why/how I am proposing to do this work, please read on. I am part of a small project team at PARC working on the general problems of data integration and automated ETL. We have proposed a tool called HiperFuse that is designed to accept declarative, high-level queries in order to produce joined (fused) data sets from multiple heterogeneous raw data sources. In our preliminary work, which you can find here (pointer to the paper), we designed the architecture of the tool and obtained some results separately on the problems of automated data cleansing, data type inference, and query planning. One of the planned prototype implementations of HiperFuse relies on Hadoop MR, and because the declarative language we proposed was closely related to SQL, we thought that we could exploit the existing work in Hive and/or other open-source tools for handling the SQL part and layer our work on top of
Re: Review Request 32692: HIVE-10083 SMBJoin fails in case one table is uninitialized
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32692/#review78380 --- Ship it! Ship It! - Chao Sun On March 31, 2015, 5:01 p.m., Na Yang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/32692/ --- (Updated March 31, 2015, 5:01 p.m.) Review request for hive, Brock Noland, Chao Sun, and Xuefu Zhang. Bugs: 10083 https://issues.apache.org/jira/browse/10083 Repository: hive-git Description --- When one table is unintialized, the smallTblFilesNames is a empty list which caues the IndexOutOfBoundsException when smallTblFileNames.get(toAddSmallIndex) is called. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/AbstractBucketJoinProc.java 70c23a6 Diff: https://reviews.apache.org/r/32692/diff/ Testing --- Thanks, Na Yang
Re: [ANNOUNCE] New Hive Committers - Jimmy Xiang, Matt McCline, and Sergio Pena
Congrats everyone! On Mon, Mar 23, 2015 at 11:33 AM, Alexander Pivovarov apivova...@gmail.com wrote: Congrats to Matt, Jimmy and Sergio! On Mon, Mar 23, 2015 at 11:30 AM, Chaoyu Tang ct...@cloudera.com wrote: Congratulations to Jimmy and Sergio! On Mon, Mar 23, 2015 at 2:08 PM, Carl Steinbach c...@apache.org wrote: The Apache Hive PMC has voted to make Jimmy Xiang, Matt McCline, and Sergio Pena committers on the Apache Hive Project. Please join me in congratulating Jimmy, Matt, and Sergio. Thanks. - Carl -- Best, Chao
Re: Review Request 31942: HIVE-9930 fix QueryPlan.makeQueryId time format
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31942/#review76102 --- Ship it! Ship It! - Chao Sun On March 11, 2015, 5:48 p.m., Alexander Pivovarov wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/31942/ --- (Updated March 11, 2015, 5:48 p.m.) Review request for hive, Jason Dere, Thejas Nair, and Xuefu Zhang. Bugs: HIVE-9930 https://issues.apache.org/jira/browse/HIVE-9930 Repository: hive-git Description --- HIVE-9930 fix QueryPlan.makeQueryId time format Diffs - ql/src/java/org/apache/hadoop/hive/ql/QueryPlan.java 8e1e6e2b4f29a2499845df1f565dbb6859b262a8 Diff: https://reviews.apache.org/r/31942/diff/ Testing --- Thanks, Alexander Pivovarov
Re: [VOTE] Apache Hive 1.1.0 Release Candidate 3
+1 1. Build src with hadoop-1 and hadoop-2, tested the generated bin with some DDL/DML queries. 2. Tested the bin with some DDL/DML queries. 3. Verified signature for bin and src, both asc and md5. Chao On Thu, Feb 19, 2015 at 1:55 AM, Szehon Ho sze...@cloudera.com wrote: +1 1. Verified signature for bin and src 2. Built src with hadoop2 3. Ran few queries from beeline with src 4. Ran few queries from beeline with bin 5. Verified no SNAPSHOT deps Thanks Szehon On Wed, Feb 18, 2015 at 10:03 PM, Xuefu Zhang xzh...@cloudera.com wrote: +1 1. downloaded the src tarball and built w/ -Phadoop-1/2 2. verified no binary (jars) in the src tarball On Wed, Feb 18, 2015 at 8:56 PM, Brock Noland br...@cloudera.com wrote: +1 verified sigs, hashes, created tables, ran MR on YARN jobs On Wed, Feb 18, 2015 at 8:54 PM, Brock Noland br...@cloudera.com wrote: Apache Hive 1.1.0 Release Candidate 3 is available here: http://people.apache.org/~brock/apache-hive-1.1.0-rc3/ Maven artifacts are available here: https://repository.apache.org/content/repositories/orgapachehive-1026/ Source tag for RC3 is at: http://svn.apache.org/repos/asf/hive/tags/release-1.1.0-rc3/ My key is located here: https://people.apache.org/keys/group/hive.asc Voting will conclude in 72 hours -- Best, Chao
Re: [VOTE] Apache Hive 1.1.0 Release Candidate 2
I tested apache-hive.1.1.0-bin and I also got the same error as Szehon reported. On Wed, Feb 18, 2015 at 3:48 PM, Brock Noland br...@cloudera.com wrote: Hi, On Wed, Feb 18, 2015 at 2:21 PM, Gopal Vijayaraghavan gop...@apache.org wrote: Hi, From the release branch, I noticed that the hive-exec.jar now contains a copy of guava-14 without any relocations. The hive spark-client pom.xml adds guava as a lib jar instead of shading it in. https://github.com/apache/hive/blob/branch-1.1/spark-client/pom.xml#L111 That seems to be a great approach for guava compat issues across execution engines. Spark itself relocates guava-14 for compatibility with Hive-on-Spark(??). https://issues.apache.org/jira/browse/SPARK-2848 Does any of the same compatibility issues occur when using a hive-exec.jar containing guava-14 on MRv2 (which has guava-11 in the classpath)? Not that I am aware of. I've tested it on top of MRv2 a number of times and I think the unit tests also excercise these code paths. Cheers, Gopal On 2/17/15, 3:14 PM, Brock Noland br...@cloudera.com wrote: Apache Hive 1.1.0 Release Candidate 2 is available here: http://people.apache.org/~brock/apache-hive-1.1.0-rc2/ Maven artifacts are available here: https://repository.apache.org/content/repositories/orgapachehive-1025/ Source tag for RC1 is at: http://svn.apache.org/repos/asf/hive/tags/release-1.1.0-rc2/ My key is located here: https://people.apache.org/keys/group/hive.asc Voting will conclude in 72 hours -- Best, Chao
Re: [VOTE] Apache Hive 1.1.0 Release Candidate 1
- Tried to build the src for both hadoop-1 and hadoop-2, and some simple DDL/DML queries from generated bin. They worked fine. - Tried to run some simple DDL/DML queries from the bin, and worked fine. - Verified PGP signature and MD5 sum for both src and bin. They are OK. +1 On Mon, Feb 16, 2015 at 9:08 PM, Brock Noland br...@cloudera.com wrote: Apache Hive 1.1.0 Release Candidate 0 is available here: http://people.apache.org/~brock/apache-hive-1.1.0-rc1/ Maven artifacts are available here: https://repository.apache.org/content/repositories/orgapachehive-1024/ Source tag for RC1 is at: http://svn.apache.org/repos/asf/hive/tags/release-1.1.0-rc1/ My key is located here: https://people.apache.org/keys/group/hive.asc Voting will conclude in 72 hours -- Best, Chao
Re: Hive 1.0 patch 9481
No, there's no such way. You need to rebuild the project from source after applying the patch. Please checkout https://cwiki.apache.org/confluence/display/Hive/HowToContribute for more details. Chao On Thu, Feb 12, 2015 at 4:05 AM, Srinivas Thunga srinivas.thu...@gmail.com wrote: Hi Team, is there any way that we can run the patch directly on Hive instead of source. As i am using hive-0.14 - bin. So i need to apply the patch directly in hive to get reflect and support for select columns for insert *Thanks Regards,* *Srinivas T* -- Best, Chao
Re: Review Request 30388: HIVE-9103 - Support backup task for join related optimization [Spark Branch]
On Jan. 29, 2015, 4:20 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 295 https://reviews.apache.org/r/30388/diff/1/?file=839499#file839499line295 childrenBackupTasks or backChildrenTasks? I suggest more consistent variable/method names. Since the none is task, I suggest child. Good point. Will change. On Jan. 29, 2015, 4:20 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java, line 110 https://reviews.apache.org/r/30388/diff/1/?file=839504#file839504line110 In Spark branch - For Spark Will change. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30388/#review70150 --- On Jan. 29, 2015, 1:05 a.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30388/ --- (Updated Jan. 29, 2015, 1:05 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9103 https://issues.apache.org/jira/browse/HIVE-9103 Repository: hive-git Description --- This patch adds backup task to map join task. The backup task, which uses common join, will be triggered in case the mapjoin task failed. Note that, no matter how many map joins there are in the SparkTask, we will only generate one backup task. This means that if the original task failed at the very last map join, the whole task will be re-executed. The handling of backup task is a little bit different from what MR does, mostly because we convert JOIN to MAPJOIN during the operator plan optimization phase, at which time no task/work exist yet. In the patch, we cloned the whole operator tree before the JOIN operator is converted. The operator tree will be processed and generate a separate work tree for a separate backup SparkTask. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 69004dc ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/StageIDsRearranger.java 79c3e02 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java d57ceff ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 9ff47c7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinFactory.java 6e0ac38 ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java b838bff ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 773cfbd ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java f7586a4 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 3a7477a ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 0e85990 ql/src/test/results/clientpositive/spark/auto_join25.q.out ab01b8a Diff: https://reviews.apache.org/r/30388/diff/ Testing --- auto_join25.q Thanks, Chao Sun
Re: Review Request 30388: HIVE-9103 - Support backup task for join related optimization [Spark Branch]
/auto_sortmerge_join_13.q.out 7eadcd0 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_14.q.out 984db20 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_15.q.out 2acc323 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_2.q.out f05b0cc ql/src/test/results/clientpositive/spark/auto_sortmerge_join_3.q.out c7d23f8 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_4.q.out f5dc2f7 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_5.q.out 26e7957 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_7.q.out a5c0562 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_8.q.out ef13a40 ql/src/test/results/clientpositive/spark/auto_sortmerge_join_9.q.out a2b98fc ql/src/test/results/clientpositive/spark/bucket_map_join_spark1.q.out 6230bef ql/src/test/results/clientpositive/spark/bucket_map_join_spark2.q.out 1a33625 ql/src/test/results/clientpositive/spark/bucket_map_join_spark3.q.out fed923c ql/src/test/results/clientpositive/spark/bucket_map_join_spark4.q.out 8b5e8d4 ql/src/test/results/clientpositive/spark/bucket_map_join_tez1.q.out 1c81d1b ql/src/test/results/clientpositive/spark/bucket_map_join_tez2.q.out 04a934f ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_2.q.out 365306e ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_4.q.out 3846de7 ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_6.q.out 5b559c4 ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_7.q.out cefc6aa ql/src/test/results/clientpositive/spark/bucketsortoptimize_insert_8.q.out ca44d7c ql/src/test/results/clientpositive/spark/cross_product_check_2.q.out dda6c38 ql/src/test/results/clientpositive/spark/identity_project_remove_skip.q.out 7238009 ql/src/test/results/clientpositive/spark/infer_bucket_sort_convert_join.q.out 3d4eb18 ql/src/test/results/clientpositive/spark/join28.q.out f23f662 ql/src/test/results/clientpositive/spark/join29.q.out 0b4284c ql/src/test/results/clientpositive/spark/join31.q.out a52a8b6 ql/src/test/results/clientpositive/spark/join32.q.out a9d50b4 ql/src/test/results/clientpositive/spark/join32_lessSize.q.out dac9610 ql/src/test/results/clientpositive/spark/join33.q.out a9d50b4 ql/src/test/results/clientpositive/spark/join_reorder4.q.out 5cc30f7 ql/src/test/results/clientpositive/spark/join_star.q.out 69c2fd7 ql/src/test/results/clientpositive/spark/mapjoin_decimal.q.out b681e5f ql/src/test/results/clientpositive/spark/mapjoin_filter_on_outerjoin.q.out 0271f97 ql/src/test/results/clientpositive/spark/mapjoin_hook.q.out 7aa8ce9 ql/src/test/results/clientpositive/spark/mapjoin_mapjoin.q.out 65a7d06 ql/src/test/results/clientpositive/spark/mapjoin_memcheck.q.out 14f316c ql/src/test/results/clientpositive/spark/mapjoin_subquery.q.out 2d1e7a7 ql/src/test/results/clientpositive/spark/mapjoin_subquery2.q.out a757d0b ql/src/test/results/clientpositive/spark/mapjoin_test_outer.q.out 7143348 ql/src/test/results/clientpositive/spark/multi_join_union.q.out bda569d ql/src/test/results/clientpositive/spark/parquet_join.q.out 390aeb1 ql/src/test/results/clientpositive/spark/reduce_deduplicate_exclude_join.q.out 19ab4c8 ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out bd3a6a1 ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out cb811ed ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.java1.7.out 92a8595 ql/src/test/results/clientpositive/spark/vector_decimal_mapjoin.q.out 5ec95c2 ql/src/test/results/clientpositive/spark/vector_left_outer_join.q.out ca8918a ql/src/test/results/clientpositive/spark/vector_mapjoin_reduce.q.out 02c1fc6 ql/src/test/results/clientpositive/spark/vectorized_mapjoin.q.out 237df98 ql/src/test/results/clientpositive/spark/vectorized_nested_mapjoin.q.out f8e8ba7 Diff: https://reviews.apache.org/r/30388/diff/ Testing --- auto_join25.q Thanks, Chao Sun
Re: Review Request 30388: HIVE-9103 - Support backup task for join related optimization [Spark Branch]
ql/src/test/results/clientpositive/spark/smb_mapjoin_17.q.out bd3a6a1 ql/src/test/results/clientpositive/spark/smb_mapjoin_25.q.out cb811ed ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.java1.7.out 92a8595 ql/src/test/results/clientpositive/spark/vector_decimal_mapjoin.q.out 5ec95c2 ql/src/test/results/clientpositive/spark/vector_left_outer_join.q.out ca8918a ql/src/test/results/clientpositive/spark/vector_mapjoin_reduce.q.out 02c1fc6 ql/src/test/results/clientpositive/spark/vectorized_mapjoin.q.out 237df98 ql/src/test/results/clientpositive/spark/vectorized_nested_mapjoin.q.out f8e8ba7 ql/src/test/results/clientpositive/vector_mapjoin_reduce.q.out 6f11b8c Diff: https://reviews.apache.org/r/30388/diff/ Testing --- auto_join25.q Thanks, Chao Sun
Re: [ANNOUNCE] New Hive PMC Members - Szehon Ho, Vikram Dixit, Jason Dere, Owen O'Malley and Prasanth Jayachandran
Congrats!!! On Wed, Jan 28, 2015 at 1:21 PM, Vaibhav Gumashta vgumas...@hortonworks.com wrote: Congratulations e’one! —Vaibhav On Jan 28, 2015, at 1:20 PM, Xuefu Zhang xzh...@cloudera.commailto: xzh...@cloudera.com wrote: Congratulations to all! --Xuefu On Wed, Jan 28, 2015 at 1:15 PM, Carl Steinbach c...@apache.orgmailto: c...@apache.org wrote: I am pleased to announce that Szehon Ho, Vikram Dixit, Jason Dere, Owen O'Malley and Prasanth Jayachandran have been elected to the Hive Project Management Committee. Please join me in congratulating the these new PMC members! Thanks. - Carl -- Best, Chao
Re: [ANNOUNCE] New Hive PMC Members - Szehon Ho, Vikram Dixit, Jason Dere, Owen O'Malley and Prasanth Jayachandran
Congrats!!! On Wed, Jan 28, 2015 at 1:21 PM, Vaibhav Gumashta vgumas...@hortonworks.com wrote: Congratulations e’one! —Vaibhav On Jan 28, 2015, at 1:20 PM, Xuefu Zhang xzh...@cloudera.commailto: xzh...@cloudera.com wrote: Congratulations to all! --Xuefu On Wed, Jan 28, 2015 at 1:15 PM, Carl Steinbach c...@apache.orgmailto: c...@apache.org wrote: I am pleased to announce that Szehon Ho, Vikram Dixit, Jason Dere, Owen O'Malley and Prasanth Jayachandran have been elected to the Hive Project Management Committee. Please join me in congratulating the these new PMC members! Thanks. - Carl
Review Request 30388: HIVE-9103 - Support backup task for join related optimization [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/30388/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-9103 https://issues.apache.org/jira/browse/HIVE-9103 Repository: hive-git Description --- This patch adds backup task to map join task. The backup task, which uses common join, will be triggered in case the mapjoin task failed. Note that, no matter how many map joins there are in the SparkTask, we will only generate one backup task. This means that if the original task failed at the very last map join, the whole task will be re-executed. The handling of backup task is a little bit different from what MR does, mostly because we convert JOIN to MAPJOIN during the operator plan optimization phase, at which time no task/work exist yet. In the patch, we cloned the whole operator tree before the JOIN operator is converted. The operator tree will be processed and generate a separate work tree for a separate backup SparkTask. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 69004dc ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/StageIDsRearranger.java 79c3e02 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkJoinOptimizer.java d57ceff ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 9ff47c7 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkSortMergeJoinFactory.java 6e0ac38 ql/src/java/org/apache/hadoop/hive/ql/parse/ParseContext.java b838bff ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java 773cfbd ql/src/java/org/apache/hadoop/hive/ql/parse/spark/OptimizeSparkProcContext.java f7586a4 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 3a7477a ql/src/java/org/apache/hadoop/hive/ql/plan/TableScanDesc.java 0e85990 ql/src/test/results/clientpositive/spark/auto_join25.q.out ab01b8a Diff: https://reviews.apache.org/r/30388/diff/ Testing --- auto_join25.q Thanks, Chao Sun
Re: [VOTE] Apache Hive 1.0 Release Candidate 1
- Tried to build the src for both hadoop-1 and hadoop-2, and some simple queries from generated bin. They worked fine. - Tried to run some simple queries from the bin, and worked fine. - Checked RELEASE_NOTES, NOTICE, README.txt. The copy right in the NOTICE file needs to be updated to 2008-2015. In README.txt it mentions @VERSION@, shouldn't that be a concrete number? - Verified PGP signature and MD5 sum for both src and bin. One minor thing is, for PGP signature I kept get warning saying it's not certified with a trusted signature. Maybe the public key is not updated? Best, Chao On Tue, Jan 27, 2015 at 3:36 PM, Lefty Leverenz leftylever...@gmail.com wrote: Can webhcat-default.xml be updated? Besides 0.11.0 in the defaults for templeton.hive.path and templeton.pig.path (HIVE-8807 https://issues.apache.org/jira/browse/HIVE-8807) there are 0.14.0-SNAPSHOT values for templeton.hive.home and templeton.hcat.home. -- Lefty On Tue, Jan 27, 2015 at 2:28 PM, Vikram Dixit K vikram.di...@gmail.com wrote: Apache Hive 1.0 Release Candidate 1 is available here: http://people.apache.org/~vikram/hive/apache-hive-1.0-rc1/ Maven artifacts are available here: https://repository.apache.org/content/repositories/orgapachehive-1020/ Source tag for RC1 is at: http://svn.apache.org/repos/asf/hive/branches/branch-1.0/ Voting will conclude in 72 hours. Hive PMC Members: Please test and vote. Thanks Vikram. -- Nothing better than when appreciated for hard work. -Mark -- Best, Chao
Review Request 29111: HIVE-9041 - Generate better plan for queries containing both union and multi-insert [Spark Branch]
/results/clientpositive/spark/union33.q.out ql/src/test/results/clientpositive/spark/union4.q.out ql/src/test/results/clientpositive/spark/union5.q.out ql/src/test/results/clientpositive/spark/union6.q.out ql/src/test/results/clientpositive/spark/union7.q.out ql/src/test/results/clientpositive/spark/union8.q.out ql/src/test/results/clientpositive/spark/union9.q.out ql/src/test/results/clientpositive/spark/union_ppr.q.out ql/src/test/results/clientpositive/spark/union_remove_1.q.out ql/src/test/results/clientpositive/spark/union_remove_10.q.out ql/src/test/results/clientpositive/spark/union_remove_11.q.out ql/src/test/results/clientpositive/spark/union_remove_15.q.out ql/src/test/results/clientpositive/spark/union_remove_16.q.out ql/src/test/results/clientpositive/spark/union_remove_17.q.out ql/src/test/results/clientpositive/spark/union_remove_18.q.out ql/src/test/results/clientpositive/spark/union_remove_19.q.out ql/src/test/results/clientpositive/spark/union_remove_2.q.out ql/src/test/results/clientpositive/spark/union_remove_20.q.out ql/src/test/results/clientpositive/spark/union_remove_21.q.out ql/src/test/results/clientpositive/spark/union_remove_24.q.out ql/src/test/results/clientpositive/spark/union_remove_25.q.out ql/src/test/results/clientpositive/spark/union_remove_3.q.out ql/src/test/results/clientpositive/spark/union_remove_4.q.out ql/src/test/results/clientpositive/spark/union_remove_5.q.out ql/src/test/results/clientpositive/spark/union_remove_6.q.out ql/src/test/results/clientpositive/spark/union_remove_7.q.out ql/src/test/results/clientpositive/spark/union_remove_8.q.out ql/src/test/results/clientpositive/spark/union_remove_9.q.out Thanks, Chao Sun
Re: Review Request 29111: HIVE-9041 - Generate better plan for queries containing both union and multi-insert [Spark Branch]
/clientpositive/spark/union30.q.out ql/src/test/results/clientpositive/spark/union33.q.out ql/src/test/results/clientpositive/spark/union4.q.out ql/src/test/results/clientpositive/spark/union5.q.out ql/src/test/results/clientpositive/spark/union6.q.out ql/src/test/results/clientpositive/spark/union7.q.out ql/src/test/results/clientpositive/spark/union8.q.out ql/src/test/results/clientpositive/spark/union9.q.out ql/src/test/results/clientpositive/spark/union_ppr.q.out ql/src/test/results/clientpositive/spark/union_remove_1.q.out ql/src/test/results/clientpositive/spark/union_remove_10.q.out ql/src/test/results/clientpositive/spark/union_remove_11.q.out ql/src/test/results/clientpositive/spark/union_remove_15.q.out ql/src/test/results/clientpositive/spark/union_remove_16.q.out ql/src/test/results/clientpositive/spark/union_remove_17.q.out ql/src/test/results/clientpositive/spark/union_remove_18.q.out ql/src/test/results/clientpositive/spark/union_remove_19.q.out ql/src/test/results/clientpositive/spark/union_remove_2.q.out ql/src/test/results/clientpositive/spark/union_remove_20.q.out ql/src/test/results/clientpositive/spark/union_remove_21.q.out ql/src/test/results/clientpositive/spark/union_remove_24.q.out ql/src/test/results/clientpositive/spark/union_remove_25.q.out ql/src/test/results/clientpositive/spark/union_remove_3.q.out ql/src/test/results/clientpositive/spark/union_remove_4.q.out ql/src/test/results/clientpositive/spark/union_remove_5.q.out ql/src/test/results/clientpositive/spark/union_remove_6.q.out ql/src/test/results/clientpositive/spark/union_remove_7.q.out ql/src/test/results/clientpositive/spark/union_remove_8.q.out ql/src/test/results/clientpositive/spark/union_remove_9.q.out Thanks, Chao Sun
Re: Review Request 29111: HIVE-9041 - Generate better plan for queries containing both union and multi-insert [Spark Branch]
On Dec. 17, 2014, midnight, Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java, line 207 https://reviews.apache.org/r/29111/diff/1/?file=793109#file793109line207 should we remove this variable completely? Yes, I'll remove it completely. On Dec. 17, 2014, midnight, Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java, line 93 https://reviews.apache.org/r/29111/diff/1/?file=793110#file793110line93 Original name seems more meaningful. OK, will fix. On Dec. 17, 2014, midnight, Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java, line 98 https://reviews.apache.org/r/29111/diff/1/?file=793112#file793112line98 Should we keep it? You're right - this was a mistake. On Dec. 17, 2014, midnight, Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java, line 212 https://reviews.apache.org/r/29111/diff/1/?file=793108#file793108line212 An assert here would be good. OK, will add. On Dec. 17, 2014, midnight, Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java, line 133 https://reviews.apache.org/r/29111/diff/1/?file=793108#file793108line133 An assert here would be good. OK, will add. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29111/#review65257 --- On Dec. 16, 2014, 7:02 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/29111/ --- (Updated Dec. 16, 2014, 7:02 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-9041 https://issues.apache.org/jira/browse/HIVE-9041 Repository: hive-git Description --- This JIRA removes UnionWork from Spark plan. UnionWork right now is just a dummy work - in execution, it is translated to IdentityTran, which does nothing. The actually union operation is implemented with rdd.union, which happens when a BaseWork has multiple parent BaseWorks. For instance: MW_1MW_2 \/ \ / RW_1 In this case, MW_1 and MW_2 both translates to RDD_1 and RDD_2, and then we create another RDD_3 which is the result of rdd.union(RDD_1, RDD_2). We then create RDD_4 for RW_1, whose parent is RDD_3. *Changes on GenSparkWork* To remove the UnionWork, most changes are in GenSparkWork. I got rid of a chunk of code that creates UnionWork and link the work with parent works. But, I still kept `currentUnionOperators` and `workWithUnionOperators`, since they are needed for removing union operators later. I also changed how `followingWork` is handled. This happens when we have the following operator tree: TS_0 TS_1 \ / \ / UNION_2 / RS_3 / FS_4 (You can see that I ignored quite a few operators here. They are not required to illustrate the problem) In this plan, we will reach `RS_3` via two different paths: `TS_0` and `TS_1`. The first time we get to `RS_3`, say via `TS_0`, we would break `RS_3` with its child, and create a work for the path `TS_0 - UNION_2 - RS_3`. Let's say the work is `MW_1`. We then proceed to `FS_4`, create another ReduceWork `RW_2` for it, and link `RW_2` with `MW_1`. We then will visit to `RS_3` for the second time, from `TS_1`, and create another work for the path `TS_1 - UNION_2 - RS_3`, say `MW_3`. But, the problem is that `RS_3` is already disconnected with `FS_4`. In order to link `MW_3` with `RW_2`, we need to save that information somewhere. This is why we need `leafOpToChildWorkInfo`. It is actually changed from `leafOpToFollowingWork`. But, I found that we also need to have the edge property between `RS_3` and its child saved, in order to connect. I also encountered a case where two BaseWorks may be connected twice. I've explained that in the comments for the source code. *Changes on SparkPlanGenerator* Without UnionWork, SparkPlanGenerator can be a bit cleaner. The changes on this class are mostly refactoring. I got rid of some redundant code in `generate(SparkWork)` method, and combined `generate(MapWork)` and `generate(ReduceWork)` into one. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/IdentityTran.java eb758e09888d7864acc9d88c7186ae2de48bc8f7 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 438efabb062112da8fefc1bed9d8bd90ade26c67 ql/src/java/org/apache/hadoop/hive/ql/optimizer
Re: Review Request 29111: HIVE-9041 - Generate better plan for queries containing both union and multi-insert [Spark Branch]
/union29.q.out ql/src/test/results/clientpositive/spark/union3.q.out ql/src/test/results/clientpositive/spark/union30.q.out ql/src/test/results/clientpositive/spark/union33.q.out ql/src/test/results/clientpositive/spark/union4.q.out ql/src/test/results/clientpositive/spark/union5.q.out ql/src/test/results/clientpositive/spark/union6.q.out ql/src/test/results/clientpositive/spark/union7.q.out ql/src/test/results/clientpositive/spark/union8.q.out ql/src/test/results/clientpositive/spark/union9.q.out ql/src/test/results/clientpositive/spark/union_ppr.q.out ql/src/test/results/clientpositive/spark/union_remove_1.q.out ql/src/test/results/clientpositive/spark/union_remove_10.q.out ql/src/test/results/clientpositive/spark/union_remove_11.q.out ql/src/test/results/clientpositive/spark/union_remove_15.q.out ql/src/test/results/clientpositive/spark/union_remove_16.q.out ql/src/test/results/clientpositive/spark/union_remove_17.q.out ql/src/test/results/clientpositive/spark/union_remove_18.q.out ql/src/test/results/clientpositive/spark/union_remove_19.q.out ql/src/test/results/clientpositive/spark/union_remove_2.q.out ql/src/test/results/clientpositive/spark/union_remove_20.q.out ql/src/test/results/clientpositive/spark/union_remove_21.q.out ql/src/test/results/clientpositive/spark/union_remove_24.q.out ql/src/test/results/clientpositive/spark/union_remove_25.q.out ql/src/test/results/clientpositive/spark/union_remove_3.q.out ql/src/test/results/clientpositive/spark/union_remove_4.q.out ql/src/test/results/clientpositive/spark/union_remove_5.q.out ql/src/test/results/clientpositive/spark/union_remove_6.q.out ql/src/test/results/clientpositive/spark/union_remove_7.q.out ql/src/test/results/clientpositive/spark/union_remove_8.q.out ql/src/test/results/clientpositive/spark/union_remove_9.q.out Thanks, Chao Sun
Re: Review Request 28889: HIVE-8911 - Enable mapjoin hints [Spark Branch]
On Dec. 12, 2014, 7:45 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkMapJoinProcessor.java, line 78 https://reviews.apache.org/r/28889/diff/2/?file=789801#file789801line78 nit: grandParentOps.get(0) is repeated in the next line. nice to have a var for it. Sure. Will fix. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28889/#review64959 --- On Dec. 11, 2014, 10:36 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28889/ --- (Updated Dec. 11, 2014, 10:36 p.m.) Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-8911 https://issues.apache.org/jira/browse/HIVE-8911 Repository: hive-git Description --- Basically the idea is to reuse as much code as possible, from MR. The issue is that, in MR's MapJoinProcessor, after join op is converted to mapjoin op, all the parents ReduceSinkOperators are removed. However, for our Spark branch, we need to preserve those, because they serve as boundaries between BaseWorks, and SparkReduceSinkMapJoinProc triggers upon them. Initially I tried to move this part of logic to SparkMapJoinOptimizer, which happens at a later stage. However, although this works, I'm worried it may have too much affect on the smb join w/ hint, because we then have to move that part of logic to SparkMapJoinOptimizer too. In general, I want to minimize the affect on code path. This patch make changes on MapJoinProcessor. I created a separate method convertMapJoinForSpark, which doesn't remove the ReduceSinkOperators, for small tables. Then, in the transform method it decides which method to call based on the execution engine. I also have to disable several tests related to smb join w/ hints. They can be activated once HIVE-8640 is resolved. Diffs - data/conf/spark/hive-site.xml 44eac86 itests/src/test/resources/testconfiguration.properties 2348e06 ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java 773c827 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java a8a3d86 ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkMapJoinProcessor.java PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_1.q.out f24ae73 ql/src/test/results/clientpositive/spark/bucket_map_join_2.q.out 33e9e8b ql/src/test/results/clientpositive/spark/bucketmapjoin1.q.out aaa0151 ql/src/test/results/clientpositive/spark/bucketmapjoin10.q.out 9954b77 ql/src/test/results/clientpositive/spark/bucketmapjoin11.q.out ad8f0a5 ql/src/test/results/clientpositive/spark/bucketmapjoin12.q.out aa3e2b6 ql/src/test/results/clientpositive/spark/bucketmapjoin13.q.out 44233f6 ql/src/test/results/clientpositive/spark/bucketmapjoin2.q.out c4702ef ql/src/test/results/clientpositive/spark/bucketmapjoin3.q.out 7c31e05 ql/src/test/results/clientpositive/spark/bucketmapjoin4.q.out a8e892e ql/src/test/results/clientpositive/spark/bucketmapjoin5.q.out 041ba12 ql/src/test/results/clientpositive/spark/bucketmapjoin7.q.out 54c4be3 ql/src/test/results/clientpositive/spark/bucketmapjoin8.q.out da9fe1c ql/src/test/results/clientpositive/spark/bucketmapjoin9.q.out 5a5e3f6 ql/src/test/results/clientpositive/spark/bucketmapjoin_negative.q.out 5ac3f4c ql/src/test/results/clientpositive/spark/bucketmapjoin_negative2.q.out e4ff965 ql/src/test/results/clientpositive/spark/bucketmapjoin_negative3.q.out fce5566 ql/src/test/results/clientpositive/spark/join25.q.out 284c97d ql/src/test/results/clientpositive/spark/join26.q.out e271184 ql/src/test/results/clientpositive/spark/join27.q.out d31f29e ql/src/test/results/clientpositive/spark/join30.q.out 7fbbcfa ql/src/test/results/clientpositive/spark/join36.q.out f1317ea ql/src/test/results/clientpositive/spark/join37.q.out 448e983 ql/src/test/results/clientpositive/spark/join38.q.out 735d7ea ql/src/test/results/clientpositive/spark/join39.q.out 0734d4b ql/src/test/results/clientpositive/spark/join40.q.out 60ef13d ql/src/test/results/clientpositive/spark/join_map_ppr.q.out 59fdb99 ql/src/test/results/clientpositive/spark/mapjoin1.q.out 80e38b9 ql/src/test/results/clientpositive/spark/mapjoin_distinct.q.out dc7241c ql/src/test/results/clientpositive/spark/mapjoin_filter_on_outerjoin.q.out 3b80437 ql/src/test/results/clientpositive/spark/mapjoin_test_outer.q.out fdf8f24 ql/src/test/results/clientpositive/spark/semijoin.q.out 2b8e04b ql/src/test/results/clientpositive/spark/skewjoin.q.out 56b78be Diff: https://reviews.apache.org/r/28889
Re: Review Request 28889: HIVE-8911 - Enable mapjoin hints [Spark Branch]
On Dec. 12, 2014, 7:45 p.m., Xuefu Zhang wrote: Patch looks good. One suggestion, we should be able to change the static methods non-static, which would further simplify the code. I agree. Let me change it. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28889/#review64959 --- On Dec. 11, 2014, 10:36 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28889/ --- (Updated Dec. 11, 2014, 10:36 p.m.) Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-8911 https://issues.apache.org/jira/browse/HIVE-8911 Repository: hive-git Description --- Basically the idea is to reuse as much code as possible, from MR. The issue is that, in MR's MapJoinProcessor, after join op is converted to mapjoin op, all the parents ReduceSinkOperators are removed. However, for our Spark branch, we need to preserve those, because they serve as boundaries between BaseWorks, and SparkReduceSinkMapJoinProc triggers upon them. Initially I tried to move this part of logic to SparkMapJoinOptimizer, which happens at a later stage. However, although this works, I'm worried it may have too much affect on the smb join w/ hint, because we then have to move that part of logic to SparkMapJoinOptimizer too. In general, I want to minimize the affect on code path. This patch make changes on MapJoinProcessor. I created a separate method convertMapJoinForSpark, which doesn't remove the ReduceSinkOperators, for small tables. Then, in the transform method it decides which method to call based on the execution engine. I also have to disable several tests related to smb join w/ hints. They can be activated once HIVE-8640 is resolved. Diffs - data/conf/spark/hive-site.xml 44eac86 itests/src/test/resources/testconfiguration.properties 2348e06 ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java 773c827 ql/src/java/org/apache/hadoop/hive/ql/optimizer/Optimizer.java a8a3d86 ql/src/java/org/apache/hadoop/hive/ql/optimizer/SparkMapJoinProcessor.java PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_1.q.out f24ae73 ql/src/test/results/clientpositive/spark/bucket_map_join_2.q.out 33e9e8b ql/src/test/results/clientpositive/spark/bucketmapjoin1.q.out aaa0151 ql/src/test/results/clientpositive/spark/bucketmapjoin10.q.out 9954b77 ql/src/test/results/clientpositive/spark/bucketmapjoin11.q.out ad8f0a5 ql/src/test/results/clientpositive/spark/bucketmapjoin12.q.out aa3e2b6 ql/src/test/results/clientpositive/spark/bucketmapjoin13.q.out 44233f6 ql/src/test/results/clientpositive/spark/bucketmapjoin2.q.out c4702ef ql/src/test/results/clientpositive/spark/bucketmapjoin3.q.out 7c31e05 ql/src/test/results/clientpositive/spark/bucketmapjoin4.q.out a8e892e ql/src/test/results/clientpositive/spark/bucketmapjoin5.q.out 041ba12 ql/src/test/results/clientpositive/spark/bucketmapjoin7.q.out 54c4be3 ql/src/test/results/clientpositive/spark/bucketmapjoin8.q.out da9fe1c ql/src/test/results/clientpositive/spark/bucketmapjoin9.q.out 5a5e3f6 ql/src/test/results/clientpositive/spark/bucketmapjoin_negative.q.out 5ac3f4c ql/src/test/results/clientpositive/spark/bucketmapjoin_negative2.q.out e4ff965 ql/src/test/results/clientpositive/spark/bucketmapjoin_negative3.q.out fce5566 ql/src/test/results/clientpositive/spark/join25.q.out 284c97d ql/src/test/results/clientpositive/spark/join26.q.out e271184 ql/src/test/results/clientpositive/spark/join27.q.out d31f29e ql/src/test/results/clientpositive/spark/join30.q.out 7fbbcfa ql/src/test/results/clientpositive/spark/join36.q.out f1317ea ql/src/test/results/clientpositive/spark/join37.q.out 448e983 ql/src/test/results/clientpositive/spark/join38.q.out 735d7ea ql/src/test/results/clientpositive/spark/join39.q.out 0734d4b ql/src/test/results/clientpositive/spark/join40.q.out 60ef13d ql/src/test/results/clientpositive/spark/join_map_ppr.q.out 59fdb99 ql/src/test/results/clientpositive/spark/mapjoin1.q.out 80e38b9 ql/src/test/results/clientpositive/spark/mapjoin_distinct.q.out dc7241c ql/src/test/results/clientpositive/spark/mapjoin_filter_on_outerjoin.q.out 3b80437 ql/src/test/results/clientpositive/spark/mapjoin_test_outer.q.out fdf8f24 ql/src/test/results/clientpositive/spark/semijoin.q.out 2b8e04b ql/src/test/results/clientpositive/spark/skewjoin.q.out 56b78be Diff: https://reviews.apache.org/r/28889/diff/ Testing --- bucket_map_join_1.q bucket_map_join_2.q bucketmapjoin1.q bucketmapjoin10.q
Re: Review Request 28889: HIVE-8911 - Enable mapjoin hints [Spark Branch]
mapjoin_hook.q mapjoin_tester.q semijoin.q skewjoin.q table_access_keys_stats.q Thanks, Chao Sun
Re: Review Request 28889: HIVE-8911 - Enable mapjoin hints [Spark Branch]
join36.q join37.q join38.q join39.q join40.q join_empty.q join_filters_overlap.q join_map_ppr.q mapjoin1.q mapjoin_distinct.q mapjoin_filter_onerjoin.q mapjoin_hook.q mapjoin_tester.q semijoin.q skewjoin.q table_access_keys_stats.q Thanks, Chao Sun
Re: Review Request 28791: HIVE-9025 join38.q (without map join) produces incorrect result when testing with multiple reducers
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/#review64582 --- http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConstantPropagateProcFactory.java https://reviews.apache.org/r/28791/#comment107348 trailing whitespace. - Chao Sun On Dec. 10, 2014, 6:09 p.m., Ted Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/ --- (Updated Dec. 10, 2014, 6:09 p.m.) Review request for hive, Ashutosh Chauhan and Chao Sun. Bugs: HIVE-9025 https://issues.apache.org/jira/browse/HIVE-9025 Repository: hive Description --- HIVE-5771 introduced a bug that when all partition columns are constants, the partition is transformed to be a random dispatch, which is not expected. This patch adds a constant column in the above case to avoid random partitioning. Diffs - http://svn.apache.org/repos/asf/hive/trunk/itests/src/test/resources/testconfiguration.properties 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConstantPropagateProcFactory.java 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/constprog_partitioner.q PRE-CREATION http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/cluster.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog_partitioner.q.out PRE-CREATION http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/join_nullsafe.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_clusterby.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_join4.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_outer_join5.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/quotedid_basic.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/smb_mapjoin_25.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/dynamic_partition_pruning.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/dynamic_partition_pruning_2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/join_nullsafe.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/vector_decimal_mapjoin.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/vectorized_dynamic_partition_pruning.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/union27.q.out 1644497 Diff: https://reviews.apache.org/r/28791/diff/ Testing --- TestCliDriver passed. Thanks, Ted Xu
Re: Review Request 28791: HIVE-9025 join38.q (without map join) produces incorrect result when testing with multiple reducers
On Dec. 10, 2014, 6:16 p.m., Chao Sun wrote: I don't think optimize_nullscan and vector_decimal_aggregate are related. Ashutosh can correct me if I'm wrong. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/#review64582 --- On Dec. 10, 2014, 6:09 p.m., Ted Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/ --- (Updated Dec. 10, 2014, 6:09 p.m.) Review request for hive, Ashutosh Chauhan and Chao Sun. Bugs: HIVE-9025 https://issues.apache.org/jira/browse/HIVE-9025 Repository: hive Description --- HIVE-5771 introduced a bug that when all partition columns are constants, the partition is transformed to be a random dispatch, which is not expected. This patch adds a constant column in the above case to avoid random partitioning. Diffs - http://svn.apache.org/repos/asf/hive/trunk/itests/src/test/resources/testconfiguration.properties 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConstantPropagateProcFactory.java 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/constprog_partitioner.q PRE-CREATION http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/cluster.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog_partitioner.q.out PRE-CREATION http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/join_nullsafe.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_clusterby.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_join4.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_outer_join5.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/quotedid_basic.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/smb_mapjoin_25.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/dynamic_partition_pruning.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/dynamic_partition_pruning_2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/join_nullsafe.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/vector_decimal_mapjoin.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/vectorized_dynamic_partition_pruning.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/union27.q.out 1644497 Diff: https://reviews.apache.org/r/28791/diff/ Testing --- TestCliDriver passed. Thanks, Ted Xu
Re: Review Request 28791: HIVE-9025 join38.q (without map join) produces incorrect result when testing with multiple reducers
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/#review64666 --- Ship it! Ship It! - Chao Sun On Dec. 11, 2014, 1:36 a.m., Ted Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/ --- (Updated Dec. 11, 2014, 1:36 a.m.) Review request for hive, Ashutosh Chauhan and Chao Sun. Bugs: HIVE-9025 https://issues.apache.org/jira/browse/HIVE-9025 Repository: hive Description --- HIVE-5771 introduced a bug that when all partition columns are constants, the partition is transformed to be a random dispatch, which is not expected. This patch adds a constant column in the above case to avoid random partitioning. Diffs - http://svn.apache.org/repos/asf/hive/trunk/itests/src/test/resources/testconfiguration.properties 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConstantPropagateProcFactory.java 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/queries/clientpositive/constprog_partitioner.q PRE-CREATION http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/cluster.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog_partitioner.q.out PRE-CREATION http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/join_nullsafe.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_clusterby.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_join4.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_outer_join5.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/quotedid_basic.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/smb_mapjoin_25.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/dynamic_partition_pruning.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/dynamic_partition_pruning_2.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/join_nullsafe.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/vector_decimal_mapjoin.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/tez/vectorized_dynamic_partition_pruning.q.out 1644497 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/union27.q.out 1644497 Diff: https://reviews.apache.org/r/28791/diff/ Testing --- TestCliDriver passed. Thanks, Ted Xu
Review Request 28889: HIVE-8911 - Enable mapjoin hints [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28889/ --- Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-8911 https://issues.apache.org/jira/browse/HIVE-8911 Repository: hive-git Description --- Basically the idea is to reuse as much code as possible, from MR. The issue is that, in MR's MapJoinProcessor, after join op is converted to mapjoin op, all the parents ReduceSinkOperators are removed. However, for our Spark branch, we need to preserve those, because they serve as boundaries between BaseWorks, and SparkReduceSinkMapJoinProc triggers upon them. Initially I tried to move this part of logic to SparkMapJoinOptimizer, which happens at a later stage. However, although this works, I'm worried it may have too much affect on the smb join w/ hint, because we then have to move that part of logic to SparkMapJoinOptimizer too. In general, I want to minimize the affect on code path. This patch make changes on MapJoinProcessor. I created a separate method convertMapJoinForSpark, which doesn't remove the ReduceSinkOperators, for small tables. Then, in the transform method it decides which method to call based on the execution engine. I also have to disable several tests related to smb join w/ hints. They can be activated once HIVE-8640 is resolved. Diffs - data/conf/spark/hive-site.xml 44eac86 itests/src/test/resources/testconfiguration.properties d6f8267 ql/src/java/org/apache/hadoop/hive/ql/optimizer/MapJoinProcessor.java 773c827 ql/src/test/results/clientpositive/spark/bucket_map_join_1.q.out f24ae73 ql/src/test/results/clientpositive/spark/bucket_map_join_2.q.out 33e9e8b ql/src/test/results/clientpositive/spark/bucketmapjoin1.q.out aaa0151 ql/src/test/results/clientpositive/spark/bucketmapjoin10.q.out 9954b77 ql/src/test/results/clientpositive/spark/bucketmapjoin11.q.out ad8f0a5 ql/src/test/results/clientpositive/spark/bucketmapjoin12.q.out aa3e2b6 ql/src/test/results/clientpositive/spark/bucketmapjoin13.q.out 44233f6 ql/src/test/results/clientpositive/spark/bucketmapjoin2.q.out c4702ef ql/src/test/results/clientpositive/spark/bucketmapjoin3.q.out 7c31e05 ql/src/test/results/clientpositive/spark/bucketmapjoin4.q.out a8e892e ql/src/test/results/clientpositive/spark/bucketmapjoin5.q.out 041ba12 ql/src/test/results/clientpositive/spark/bucketmapjoin7.q.out 54c4be3 ql/src/test/results/clientpositive/spark/bucketmapjoin8.q.out da9fe1c ql/src/test/results/clientpositive/spark/bucketmapjoin9.q.out 5a5e3f6 ql/src/test/results/clientpositive/spark/bucketmapjoin_negative.q.out 5ac3f4c ql/src/test/results/clientpositive/spark/bucketmapjoin_negative2.q.out e4ff965 ql/src/test/results/clientpositive/spark/bucketmapjoin_negative3.q.out fce5566 ql/src/test/results/clientpositive/spark/join25.q.out 284c97d ql/src/test/results/clientpositive/spark/join26.q.out e271184 ql/src/test/results/clientpositive/spark/join27.q.out d31f29e ql/src/test/results/clientpositive/spark/join30.q.out 7fbbcfa ql/src/test/results/clientpositive/spark/join36.q.out f1317ea ql/src/test/results/clientpositive/spark/join37.q.out 448e983 ql/src/test/results/clientpositive/spark/join38.q.out 735d7ea ql/src/test/results/clientpositive/spark/join39.q.out 0734d4b ql/src/test/results/clientpositive/spark/join40.q.out 60ef13d ql/src/test/results/clientpositive/spark/join_map_ppr.q.out 59fdb99 ql/src/test/results/clientpositive/spark/mapjoin1.q.out 80e38b9 ql/src/test/results/clientpositive/spark/mapjoin_distinct.q.out dc7241c ql/src/test/results/clientpositive/spark/mapjoin_filter_on_outerjoin.q.out 3b80437 ql/src/test/results/clientpositive/spark/mapjoin_test_outer.q.out fdf8f24 ql/src/test/results/clientpositive/spark/semijoin.q.out 2b8e04b ql/src/test/results/clientpositive/spark/skewjoin.q.out 56b78be Diff: https://reviews.apache.org/r/28889/diff/ Testing --- bucket_map_join_1.q bucket_map_join_2.q bucketmapjoin1.q bucketmapjoin10.q bucketmapjoin11.q bucketmapjoin12.q bucketmapjoin13.q bucketmapjoin2.q bucketmapjoin3.q bucketmapjoin4.q bucketmapjoin5.q bucketmapjoin7.q bucketmapjoin8.q bucketmapjoin9.q bucketmapjoin_negative.q bucketmapjoin_negative2.q column_access_stats.q join25.q join26.q join27.q join30.q join36.q join37.q join38.q join39.q join40.q join_empty.q join_filters_overlap.q join_map_ppr.q mapjoin1.q mapjoin_distinct.q mapjoin_filter_onerjoin.q mapjoin_hook.q mapjoin_tester.q semijoin.q skewjoin.q table_access_keys_stats.q Thanks, Chao Sun
Re: Review Request 28791: HIVE-9025 join38.q (without map join) produces incorrect result when testing with multiple reducers
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/#review64242 --- Hi, [~tedxu], thanks for the quick work! I just have one minor question: do you think it would be good to have a new test case for this? Maybe someone that just like join38.q, but uses common join and set number of reducers to be a number greater than one? - Chao Sun On Dec. 7, 2014, 9:30 a.m., Ted Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/ --- (Updated Dec. 7, 2014, 9:30 a.m.) Review request for hive, Ashutosh Chauhan and Chao Sun. Bugs: HIVE-9025 https://issues.apache.org/jira/browse/HIVE-9025 Repository: hive Description --- HIVE-5771 introduced a bug that when all partition columns are constants, the partition is transformed to be a random dispatch, which is not expected. This patch adds a constant column in the above case to avoid random partitioning. Diffs - http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConstantPropagateProcFactory.java 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/cluster.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog2.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/join_nullsafe.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_clusterby.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_join4.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_outer_join5.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/quotedid_basic.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/smb_mapjoin_25.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/union27.q.out 1643530 Diff: https://reviews.apache.org/r/28791/diff/ Testing --- TestCliDriver passed. Thanks, Ted Xu
Re: Review Request 28791: HIVE-9025 join38.q (without map join) produces incorrect result when testing with multiple reducers
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/#review64244 --- Also, some golden files for tez branch need to be updated. - Chao Sun On Dec. 7, 2014, 9:30 a.m., Ted Xu wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28791/ --- (Updated Dec. 7, 2014, 9:30 a.m.) Review request for hive, Ashutosh Chauhan and Chao Sun. Bugs: HIVE-9025 https://issues.apache.org/jira/browse/HIVE-9025 Repository: hive Description --- HIVE-5771 introduced a bug that when all partition columns are constants, the partition is transformed to be a random dispatch, which is not expected. This patch adds a constant column in the above case to avoid random partitioning. Diffs - http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ConstantPropagateProcFactory.java 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/cluster.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/constprog2.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/join_nullsafe.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd2.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_clusterby.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_join4.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/ppd_outer_join5.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/quotedid_basic.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/smb_mapjoin_25.q.out 1643530 http://svn.apache.org/repos/asf/hive/trunk/ql/src/test/results/clientpositive/union27.q.out 1643530 Diff: https://reviews.apache.org/r/28791/diff/ Testing --- TestCliDriver passed. Thanks, Ted Xu
Re: Review Request 28727: HIVE-8638 Implement bucket map join optimization [Spark Branch]
On Dec. 5, 2014, 2:27 a.m., Chao Sun wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java, line 111 https://reviews.apache.org/r/28727/diff/1/?file=782895#file782895line111 why check twice here? Jimmy Xiang wrote: estimatedBuckets could = 0 too. Sorry, you are right. My mistake. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28727/#review63952 --- On Dec. 4, 2014, 11:38 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28727/ --- (Updated Dec. 4, 2014, 11:38 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8638 https://issues.apache.org/jira/browse/HIVE-8638 Repository: hive-git Description --- Patch v3 that works when bucket number matches Diffs - itests/src/test/resources/testconfiguration.properties 09c667e ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java cfc1501 ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java 2f9e55a ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 4054173 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkBucketJoinProcCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 8b78123 ql/src/test/queries/clientpositive/bucket_map_join_spark1.q PRE-CREATION ql/src/test/queries/clientpositive/bucket_map_join_spark2.q PRE-CREATION ql/src/test/results/clientpositive/bucket_map_join_spark1.q.out PRE-CREATION ql/src/test/results/clientpositive/bucket_map_join_spark2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_spark1.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_spark2.q.out PRE-CREATION Diff: https://reviews.apache.org/r/28727/diff/ Testing --- Thanks, Jimmy Xiang
Re: Review Request 28727: HIVE-8638 Implement bucket map join optimization [Spark Branch]
On Dec. 5, 2014, 2:27 a.m., Chao Sun wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 96 https://reviews.apache.org/r/28727/diff/1/?file=782893#file782893line96 I'm wondering if we can get rid of containsOp, and replace with this one. Jimmy Xiang wrote: containsOp is used in many places. It's better to keep it. I changed getOp a little so that getOp and containsOp share the same logic. Sounds good. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28727/#review63952 --- On Dec. 4, 2014, 11:38 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28727/ --- (Updated Dec. 4, 2014, 11:38 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8638 https://issues.apache.org/jira/browse/HIVE-8638 Repository: hive-git Description --- Patch v3 that works when bucket number matches Diffs - itests/src/test/resources/testconfiguration.properties 09c667e ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java cfc1501 ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java 2f9e55a ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 4054173 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkBucketJoinProcCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 8b78123 ql/src/test/queries/clientpositive/bucket_map_join_spark1.q PRE-CREATION ql/src/test/queries/clientpositive/bucket_map_join_spark2.q PRE-CREATION ql/src/test/results/clientpositive/bucket_map_join_spark1.q.out PRE-CREATION ql/src/test/results/clientpositive/bucket_map_join_spark2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_spark1.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_spark2.q.out PRE-CREATION Diff: https://reviews.apache.org/r/28727/diff/ Testing --- Thanks, Jimmy Xiang
Re: Review Request 28727: HIVE-8638 Implement bucket map join optimization [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28727/#review63952 --- ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java https://reviews.apache.org/r/28727/#comment106303 I'm wondering if we can get rid of containsOp, and replace with this one. ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java https://reviews.apache.org/r/28727/#comment106304 trailing whitespace. ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java https://reviews.apache.org/r/28727/#comment106305 should have space between and parentOp. ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java https://reviews.apache.org/r/28727/#comment106306 why check twice here? - Chao Sun On Dec. 4, 2014, 11:38 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28727/ --- (Updated Dec. 4, 2014, 11:38 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8638 https://issues.apache.org/jira/browse/HIVE-8638 Repository: hive-git Description --- Patch v3 that works when bucket number matches Diffs - itests/src/test/resources/testconfiguration.properties 09c667e ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java cfc1501 ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java 2f9e55a ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 4054173 ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkBucketJoinProcCtx.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 8b78123 ql/src/test/queries/clientpositive/bucket_map_join_spark1.q PRE-CREATION ql/src/test/queries/clientpositive/bucket_map_join_spark2.q PRE-CREATION ql/src/test/results/clientpositive/bucket_map_join_spark1.q.out PRE-CREATION ql/src/test/results/clientpositive/bucket_map_join_spark2.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_spark1.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/bucket_map_join_spark2.q.out PRE-CREATION Diff: https://reviews.apache.org/r/28727/diff/ Testing --- Thanks, Jimmy Xiang
Review Request 28464: HIVE-8934 - Investigate test failure on bucketmapjoin10.q and bucketmapjoin11.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28464/ --- Review request for hive, Jimmy Xiang, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8934 https://issues.apache.org/jira/browse/HIVE-8934 Repository: hive-git Description --- With MapJoin enabled, these two tests will generate incorrect results. This seem to be related to the HiveInputFormat that these two are using. We need to investigate the issue. Diffs - itests/src/test/resources/testconfiguration.properties 38380fb ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinEagerRowContainer.java 65bb1b7 ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinTableContainerSerDe.java eddbf18 ql/src/test/results/clientpositive/spark/bucketmapjoin10.q.out 4188ad8 ql/src/test/results/clientpositive/spark/bucketmapjoin11.q.out e4a98ba Diff: https://reviews.apache.org/r/28464/diff/ Testing --- bucketmapjoin10.q and bucketmapjoin11.q now return correct results Thanks, Chao Sun
Review Request 28299: HIVE-8921 - Investigate test failure on auto_join2.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28299/ --- Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8921 https://issues.apache.org/jira/browse/HIVE-8921 Repository: hive-git Description --- Running this test, sometimes it produce the correct result, sometimes it just produce NULL. Looks like there's some concurrency issue. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 96481f1 ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java 6fbdcd2 Diff: https://reviews.apache.org/r/28299/diff/ Testing --- Thanks, Chao Sun
Review Request 28307: HIVE-8908 - Investigate test failure on join34.q [Spark Branch]
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.dest_j1 Local Work: Map Reduce Local Work Union 2 Vertex: Union 2 Stage: Stage-2 Dependency Collection Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe name: default.dest_j1 Stage: Stage-3 Stats-Aggr Operator Stage: Stage-5 Spark DagName: chao_20141118150101_a47a2d7b-e750-4764-be66-5ba95ebbe433:5 Vertices: Map 4 Map Operator Tree: TableScan alias: x Statistics: Num rows: 1 Data size: 216 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 1 Data size: 216 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator condition expressions: 0 {_col1} 1 {value} keys: 0 _col0 (type: string) 1 key (type: string) Reduce Output Operator key expressions: key (type: string) sort order: + Map-reduce partition columns: key (type: string) Statistics: Num rows: 1 Data size: 216 Basic stats: COMPLETE Column stats: NONE value expressions: value (type: string) Local Work: Map Reduce Local Work Time taken: 0.127 seconds, Fetched: 156 row(s) Note that Stage-4 and Stage-5 are identical. Also, in Stage-4 there's a parallel RS operator with the HTS operator, which is strange. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java 4bfc26f Diff: https://reviews.apache.org/r/28307/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 28299: HIVE-8921 - Investigate test failure on auto_join2.q [Spark Branch]
On Nov. 20, 2014, 9:56 p.m., Jimmy Xiang wrote: ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java, line 67 https://reviews.apache.org/r/28299/diff/1/?file=771588#file771588line67 Should be other way around, i.e., the default constructor call this one: this(new LinkHashMap...) OK, will do in this way. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28299/#review62437 --- On Nov. 20, 2014, 9:43 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28299/ --- (Updated Nov. 20, 2014, 9:43 p.m.) Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8921 https://issues.apache.org/jira/browse/HIVE-8921 Repository: hive-git Description --- Running this test, sometimes it produce the correct result, sometimes it just produce NULL. Looks like there's some concurrency issue. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 96481f1 ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java 6fbdcd2 Diff: https://reviews.apache.org/r/28299/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 28299: HIVE-8921 - Investigate test failure on auto_join2.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28299/ --- (Updated Nov. 21, 2014, 1:28 a.m.) Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8921 https://issues.apache.org/jira/browse/HIVE-8921 Repository: hive-git Description --- Running this test, sometimes it produce the correct result, sometimes it just produce NULL. Looks like there's some concurrency issue. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 96481f1 ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java 6fbdcd2 Diff: https://reviews.apache.org/r/28299/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 28145: HIVE-8883 - Investigate test failures on auto_join30.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28145/ --- (Updated Nov. 19, 2014, 11:35 p.m.) Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8883 https://issues.apache.org/jira/browse/HIVE-8883 Repository: hive-git Description --- This test fails with the following stack trace: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-11-14 17:05:09,206 ERROR [Executor task launch worker-4]: spark.SparkReduceRecordHandler (SparkReduceRecordHandler.java:processRow(285)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{reducesinkkey0:val_0},value:{_col0:0}} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:328) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:318) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) ... 14 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) ... 17 more auto_join27.q and auto_join31.q seem to fail with the same error. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 2895d80 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 141ae6f Diff: https://reviews.apache.org/r/28145/diff/ Testing --- Tested with auto_join30.q, auto_join31.q, and auto_join27.q. They now generates correct results. Thanks, Chao Sun
Re: Review Request 28145: HIVE-8883 - Investigate test failures on auto_join30.q [Spark Branch]
On Nov. 19, 2014, 11:50 p.m., Jimmy Xiang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java, line 74 https://reviews.apache.org/r/28145/diff/3/?file=770558#file770558line74 We don't need this any more? I was thinking about cleaning it and then restoring the code in the non-staged map join JIRA. But, after talking with Szehon, I decided to keep it anyway. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28145/#review62285 --- On Nov. 19, 2014, 11:35 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28145/ --- (Updated Nov. 19, 2014, 11:35 p.m.) Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8883 https://issues.apache.org/jira/browse/HIVE-8883 Repository: hive-git Description --- This test fails with the following stack trace: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-11-14 17:05:09,206 ERROR [Executor task launch worker-4]: spark.SparkReduceRecordHandler (SparkReduceRecordHandler.java:processRow(285)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{reducesinkkey0:val_0},value:{_col0:0}} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:328) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:318) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84
Re: Review Request 28145: HIVE-8883 - Investigate test failures on auto_join30.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28145/ --- (Updated Nov. 19, 2014, 11:57 p.m.) Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8883 https://issues.apache.org/jira/browse/HIVE-8883 Repository: hive-git Description --- This test fails with the following stack trace: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-11-14 17:05:09,206 ERROR [Executor task launch worker-4]: spark.SparkReduceRecordHandler (SparkReduceRecordHandler.java:processRow(285)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{reducesinkkey0:val_0},value:{_col0:0}} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:328) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:318) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) ... 14 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) ... 17 more auto_join27.q and auto_join31.q seem to fail with the same error. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 2895d80 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 141ae6f Diff: https://reviews.apache.org/r/28145/diff/ Testing --- Tested with auto_join30.q, auto_join31.q, and auto_join27.q. They now generates correct results. Thanks, Chao Sun
Review Request 28145: HIVE-8883 - Investigate test failures on auto_join30.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28145/ --- Review request for hive, Jimmy Xiang and Szehon Ho. Bugs: HIVE-8883 https://issues.apache.org/jira/browse/HIVE-8883 Repository: hive-git Description --- This test fails with the following stack trace: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-11-14 17:05:09,206 ERROR [Executor task launch worker-4]: spark.SparkReduceRecordHandler (SparkReduceRecordHandler.java:processRow(285)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{reducesinkkey0:val_0},value:{_col0:0}} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:328) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:318) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) ... 14 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) ... 17 more auto_join27.q and auto_join31.q seem to fail with the same error. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 2895d80 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 141ae6f Diff: https://reviews.apache.org/r/28145/diff/ Testing --- Tested with auto_join30.q, auto_join31.q, and auto_join27.q. They now generates correct results. Thanks, Chao Sun
Re: Review Request 28145: HIVE-8883 - Investigate test failures on auto_join30.q [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28145/ --- (Updated Nov. 18, 2014, 2:51 a.m.) Review request for hive, Jimmy Xiang and Szehon Ho. Changes --- Last patch failed because of upstream change on HashTableLoader#load(). Now fixed. Bugs: HIVE-8883 https://issues.apache.org/jira/browse/HIVE-8883 Repository: hive-git Description --- This test fails with the following stack trace: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2014-11-14 17:05:09,206 ERROR [Executor task launch worker-4]: spark.SparkReduceRecordHandler (SparkReduceRecordHandler.java:processRow(285)) - org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {key:{reducesinkkey0:val_0},value:{_col0:0}} at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:328) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processRow(SparkReduceRecordHandler.java:276) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:48) at org.apache.hadoop.hive.ql.exec.spark.HiveReduceFunctionResultList.processNextRecord(HiveReduceFunctionResultList.java:28) at org.apache.hadoop.hive.ql.exec.spark.HiveBaseFunctionResultList$ResultIterator.hasNext(HiveBaseFunctionResultList.java:96) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:214) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:186) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unexpected exception: null at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:318) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:815) at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:84) at org.apache.hadoop.hive.ql.exec.spark.SparkReduceRecordHandler.processKeyValues(SparkReduceRecordHandler.java:319) ... 14 more Caused by: java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.MapJoinOperator.processOp(MapJoinOperator.java:257) ... 17 more auto_join27.q and auto_join31.q seem to fail with the same error. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java 2895d80 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 141ae6f Diff: https://reviews.apache.org/r/28145/diff/ Testing --- Tested with auto_join30.q, auto_join31.q, and auto_join27.q. They now generates correct results. Thanks, Chao Sun
Review Request 28045: HIVE-8865 - Needs to set hashTableMemoryUsage for MapJoinDesc [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28045/ --- Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-8865 https://issues.apache.org/jira/browse/HIVE-8865 Repository: hive-git Description --- If this part is not done, hashTableMemoryUsage is always 0.0, which will cause MapJoinMemoryExhaustionException. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java 83d54bd Diff: https://reviews.apache.org/r/28045/diff/ Testing --- Thanks, Chao Sun
Review Request 28051: HIVE-8860 - Populate ExecMapperContext in SparkReduceRecordHandler [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28051/ --- Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-8860 https://issues.apache.org/jira/browse/HIVE-8860 Repository: hive-git Description --- Currently, only SparkMapRecordHandler populates this information. However, since in Spark branch a HashTableSinkOperator could also appear in a ReduceWork, and it needs to have a ExecMapperContext to get a MapredLocalWork, we need to do the same thing in SparkReduceRecordHandler as well. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkReduceRecordHandler.java 21ac7ab Diff: https://reviews.apache.org/r/28051/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 28064: HIVE-8844 Choose a persisent policy for RDD caching [Spark Branch]
On Nov. 15, 2014, 2:34 a.m., Szehon Ho wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java, line 39 https://reviews.apache.org/r/28064/diff/1/?file=764642#file764642line39 OK, does spark handle that if we pass NONE in by doing no-op? If that's the case, then maybe cleaner for our code in that case. I'm a bit confused what NONE means. If we dont want to call NONE due to side-effects, can we just change the HadoopRDD call to: storageHandler.equals(StorageHandler.NONE) ? hadoopRdd : ... Then the logic is centralized to there. Jimmy Xiang wrote: Sure. Will fix it as suggested. Thanks. persist() also register the RDD for GC clean up, but there seem to have no extra cost besides that. Either way is fine to me. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28064/#review61616 --- On Nov. 15, 2014, 12:32 a.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28064/ --- (Updated Nov. 15, 2014, 12:32 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8844 https://issues.apache.org/jira/browse/HIVE-8844 Repository: hive-git Description --- Changed spark cache policy to be configurable with default memory+disk. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 79baea7 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java 8565ba0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 11f4236 Diff: https://reviews.apache.org/r/28064/diff/ Testing --- Thanks, Jimmy Xiang
Review Request 28017: HIVE-8776 - Generate MapredLocalWork in SparkMapJoinResolver [Spark Brach]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28017/ --- Review request for hive, Jimmy Xiang, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8776 https://issues.apache.org/jira/browse/HIVE-8776 Repository: hive-git Description --- In SparkMapJoinResolver, we need to populate MapredLocalWork for all MapWorks with MapJoinOperator. It is needed later in HashTableLoader, for example, to retrieve small hash tables and direct fetch tables. We need to set up information, such as aliasToWork, aliasToFetchWork, directFetchOp, inputFileChangeSensitive, tmpPath, etc., for the new local works. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java d30ae51 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 4b9a6cb ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java 785e4a0 Diff: https://reviews.apache.org/r/28017/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 28017: HIVE-8776 - Generate MapredLocalWork in SparkMapJoinResolver [Spark Brach]
On Nov. 14, 2014, 1:53 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 138 https://reviews.apache.org/r/28017/diff/1/?file=763012#file763012line138 currentTask seems to be the container for sparkWork. Do we need to pass in both of them? BTW, currentTask seems to be a misleading varaible name. Good point. I always forget this.. Changed the name to originalTask. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28017/#review61373 --- On Nov. 14, 2014, 12:03 a.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28017/ --- (Updated Nov. 14, 2014, 12:03 a.m.) Review request for hive, Jimmy Xiang, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8776 https://issues.apache.org/jira/browse/HIVE-8776 Repository: hive-git Description --- In SparkMapJoinResolver, we need to populate MapredLocalWork for all MapWorks with MapJoinOperator. It is needed later in HashTableLoader, for example, to retrieve small hash tables and direct fetch tables. We need to set up information, such as aliasToWork, aliasToFetchWork, directFetchOp, inputFileChangeSensitive, tmpPath, etc., for the new local works. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java d30ae51 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 4b9a6cb ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java 785e4a0 Diff: https://reviews.apache.org/r/28017/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 28017: HIVE-8776 - Generate MapredLocalWork in SparkMapJoinResolver [Spark Brach]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/28017/ --- (Updated Nov. 14, 2014, 2:43 a.m.) Review request for hive, Jimmy Xiang, Szehon Ho, and Xuefu Zhang. Changes --- Thanks Xuefu for the comments! Bugs: HIVE-8776 https://issues.apache.org/jira/browse/HIVE-8776 Repository: hive-git Description --- In SparkMapJoinResolver, we need to populate MapredLocalWork for all MapWorks with MapJoinOperator. It is needed later in HashTableLoader, for example, to retrieve small hash tables and direct fetch tables. We need to set up information, such as aliasToWork, aliasToFetchWork, directFetchOp, inputFileChangeSensitive, tmpPath, etc., for the new local works. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HashTableLoader.java d30ae51 ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java 9ce1a18 ql/src/java/org/apache/hadoop/hive/ql/plan/MapredLocalWork.java 785e4a0 Diff: https://reviews.apache.org/r/28017/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27933: HIVE-8810 Make HashTableSinkOperator works for Spark Branch [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27933/#review61150 --- ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java https://reviews.apache.org/r/27933/#comment102640 Don't need this check anymore. ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java https://reviews.apache.org/r/27933/#comment102639 Can we use SPARKHASHTABLESINK, or something similar? - Chao Sun On Nov. 12, 2014, 11:58 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27933/ --- (Updated Nov. 12, 2014, 11:58 p.m.) Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8810 https://issues.apache.org/jira/browse/HIVE-8810 Repository: hive-git Description --- Fixed the Spark HashTableSinkOperator Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java 78d9012 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java f1c3564 ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java a58a6c5 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkHashTableSinkDesc.java PRE-CREATION Diff: https://reviews.apache.org/r/27933/diff/ Testing --- Thanks, Jimmy Xiang
Re: Review Request 27933: HIVE-8810 Make HashTableSinkOperator works for Spark Branch [Spark Branch]
On Nov. 13, 2014, 12:34 a.m., Chao Sun wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java, line 326 https://reviews.apache.org/r/27933/diff/2/?file=760734#file760734line326 Can we use SPARKHASHTABLESINK, or something similar? Jimmy Xiang wrote: Does this need match the Operator type? I think these two are not related. Somebody can correct me if I'm wrong. One potential issue with using the same name is that RuleRegExp may become harder to define. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27933/#review61150 --- On Nov. 12, 2014, 11:58 p.m., Jimmy Xiang wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27933/ --- (Updated Nov. 12, 2014, 11:58 p.m.) Review request for hive, Chao Sun, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8810 https://issues.apache.org/jira/browse/HIVE-8810 Repository: hive-git Description --- Fixed the Spark HashTableSinkOperator Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java 78d9012 ql/src/java/org/apache/hadoop/hive/ql/exec/OperatorFactory.java f1c3564 ql/src/java/org/apache/hadoop/hive/ql/exec/SparkHashTableSinkOperator.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkReduceSinkMapJoinProc.java a58a6c5 ql/src/java/org/apache/hadoop/hive/ql/plan/SparkHashTableSinkDesc.java PRE-CREATION Diff: https://reviews.apache.org/r/27933/diff/ Testing --- Thanks, Jimmy Xiang
Review Request 27955: HIVE-8842 - auto_join2.q produces incorrect tree [Spark Branch]
: string), _col10 (type: string), _col11 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-7 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:3 Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) Stage: Stage-6 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:2 Vertices: Map 3 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col1, _col5, _col6 input vertices: 1 Map 1 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type: boolean) Statistics: Num rows: 8 Data size: 1653 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {_col0} {_col1} {_col5} {_col6} 1 {key} {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java a8b7ac6 Diff: https://reviews.apache.org/r/27955/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27955: HIVE-8842 - auto_join2.q produces incorrect tree [Spark Branch]
: _col0 (type: string), _col1 (type: string), _col5 (type: string), _col6 (type: string), _col10 (type: string), _col11 (type: string) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-7 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:3 Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) Stage: Stage-6 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:2 Vertices: Map 3 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col1, _col5, _col6 input vertices: 1 Map 1 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type: boolean) Statistics: Num rows: 8 Data size: 1653 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {_col0} {_col1} {_col5} {_col6} 1 {key} {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java a8b7ac6 Diff: https://reviews.apache.org/r/27955/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27955: HIVE-8842 - auto_join2.q produces incorrect tree [Spark Branch]
On Nov. 13, 2014, 3:56 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 136 https://reviews.apache.org/r/27955/diff/1/?file=760901#file760901line136 It seems that originalWork is the work enclosed in originalTask. Do we really need both as parameters? You're right - originalWork is redundant. Let me change it. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27955/#review61198 --- On Nov. 13, 2014, 2:29 a.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27955/ --- (Updated Nov. 13, 2014, 2:29 a.m.) Review request for hive, Szehon Ho and Xuefu Zhang. Bugs: HIVE-8842 https://issues.apache.org/jira/browse/HIVE-8842 Repository: hive-git Description --- Enabling the SparkMapJoinResolver and SparkReduceSinkMapJoinProc, I see the following: explain select * from src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key + src2.key = src3.key); Enabling the SparkMapJoinResolver and SparkReduceSinkMapJoinProc, I see the following: {noformat} explain select * from src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key + src2.key = src3.key); {noformat} produces too many stages (six), and too many HashTableSink. {noformat} STAGE DEPENDENCIES: Stage-5 is a root stage Stage-4 depends on stages: Stage-5 Stage-3 depends on stages: Stage-4 Stage-7 is a root stage Stage-6 depends on stages: Stage-7 Stage-0 is a root stage STAGE PLANS: Stage: Stage-5 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:3 Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) Stage: Stage-4 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:2 Vertices: Map 3 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} {value} 1 {key} {value} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col1, _col5, _col6 input vertices: 1 Map 1 Statistics: Num rows: 16 Data size: 3306 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type: boolean) Statistics: Num rows: 8 Data size: 1653 Basic stats: COMPLETE Column stats: NONE HashTable Sink Operator condition expressions: 0 {_col0} {_col1} {_col5} {_col6} 1 {key} {value} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) Stage: Stage-3 Spark DagName: szehon_20141112105656_dd50e07d-94ad-4f9d-899e-bcb6d9a39c13:1 Vertices: Map 2 Map Operator Tree: TableScan alias: src3 Statistics: Num rows: 29 Data size: 5812 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: UDFToDouble(key) is not null (type: boolean) Statistics: Num rows: 15 Data size: 3006 Basic stats
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 9, 2014, 10:39 p.m.) Review request for hive. Changes --- Adopting Xuefu's pseudo code. Now for each BaseWork with MJ operator, use a SparkWork for its parent BaseWorks that contain HashTableSinkOperator. I manually tested this patch with several qfiles containing map-join queries, and results look correct. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214 https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214 This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear. Chao Sun wrote: I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right? Otherwise, they will form a linear dependency too. Xuefu Zhang wrote: w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not linear. To put more details, for each work that has map join op, we need to create a SparkWork to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created. Chao Sun wrote: I'm not getting it, why This dependency is not linear? Can you give a counter example? Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following: HTS_1 HTS_2 HTS_3HTS_4 \ / \ / \/ \ / MJ_1 MJ_2 | | | | HTS_5HTS_6 \/ \ / \/ \ / \/ MJ_3 Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same SparkWork, say SW_1 then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 in another SparkWork SW_3. SW_1 - SW_2 - SW_3. Xuefu Zhang wrote: I don't think we should put (HTS1,HTS2) and (HTS3, HTS4) in the same SparkWork. They belong to different MJ handling different sets of small tables. This will complicate things, making HashTableSinkOperator and HashTableLoader more complicated. Per dependency, MJ1 doesn't need to wait for HTS3/HTS4 in order to run, and vice versa. Please refer to pseudo code posted in the JIRA for implementation ideas. Thanks. Resolved via a offline chat. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review60482 --- On Nov. 9, 2014, 10:39 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 9, 2014, 10:39 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 46d02bf Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214 https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214 This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear. I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right? Otherwise, they will form a linear dependency too. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review60482 --- On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 7, 2014, 6:07 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 8, 2014, 12:44 a.m., Szehon Ho wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 224 https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line224 I've been thinking about this, as you had brought up a pretty rare use-case where a big-table parent of mapjoin1 still had a HTS , but its for another(!) mapjoin. I dont know if this is still a valid case , but do you think this handles it, as it just indisciriminately adds it to the parent map if it has HTS? Fixed through a offline chat. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review60380 --- On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 7, 2014, 6:07 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 100 https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line100 It seems possible that current is MJwork, right? Are you going to add it to the target? Yes, it's possible. But that MJwork will be a one of which all HTS are already handled, so we can go through it to some HTS for other MJworks. On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 115 https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line115 Frankly, I'm not 100% following the logic. The diagram has operators mixed with works, which makes it hard. But I'm seeing where you're coming from. Maybe you can explain to me better in person. Here the operator name (MJ, HTS) means a work contains the operator, so MJ is a BaseWork containing MJ operator, and same for HTS. Yes, I think explaining in person would be better. On Nov. 7, 2014, 11:07 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 155 https://reviews.apache.org/r/27627/diff/2/?file=754549#file754549line155 I think there is a separate JIRA handling combining mapjoins, owned by Szehon. In my understanding, Szehon's JIRA is try to put MJ operators in the same BaseWork. But, there're some cases that we cannot apply this optimization, and MJ operators will be in different BaseWorks. My work here is to try to put them in the same SparkWork, if there's no dependency among them. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review60403 --- On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 7, 2014, 6:07 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 8, 2014, 3:15 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 214 https://reviews.apache.org/r/27627/diff/3/?file=754597#file754597line214 This assumes that result SparkWorks will be linearly dependent on each other, which isn't true in general.Let's say the are two works (w1 and w2), each having a map join operator. w1 and w2 are connected to w3 via HTS. w3 also contains map join operator. Dependency in this scenario will be graphic rather than linear. Chao Sun wrote: I was thinking, in this case, if there's no dependency between w1 and w2, they can be put in the same SparkWork, right? Otherwise, they will form a linear dependency too. Xuefu Zhang wrote: w1 and w2 are fine. they will be in the same SparkWork. This SparkWork will depends on both the SparkWork generated at w1 and SparkWork generated at w2. This dependency is not linear. To put more details, for each work that has map join op, we need to create a SparkWork to handle its small tables. So, both w1 and w2 will need to create such SparkWork. While w1 and w2 are in the same SparkWork, this SparkWork depends on the two SparkWorks created. I'm not getting it, why This dependency is not linear? Can you give a counter example? Suppose w1(MJ_1) w2(MJ_2), and w3(MJ_3) are like the following: HTS_1 HTS_2 HTS_3HTS_4 \ / \ / \/ \ / MJ_1 MJ_2 | | | | HTS_5HTS_6 \/ \ / \/ \ / \/ MJ_3 Then, what I'm doing is to put HTS_1, HTS_2, HTS_3, and HTS_4 in the same SparkWork, say SW_1 then, MJ_1, MJ_2, HTS_5, and HTS_6 will be in another SparkWork SW_2, and MJ_3 in another SparkWork SW_3. SW_1 - SW_2 - SW_3. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review60482 --- On Nov. 7, 2014, 6:07 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 7, 2014, 6:07 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 7, 2014, 3:57 p.m.) Review request for hive. Changes --- Another patch with a cleaner solution in my opinion. I tested it with subquery_multiinsert.q and result looks fine. Please give suggestions! Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 7, 2014, 6:07 p.m.) Review request for hive. Changes --- Instead of using a Set, we should use a Map from a BaseWork w/ MJ to all its parent BaseWorks w/ HTSs. The principle is, we cannot process all BaseWorks below this MJ until all HTSs are processed. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/plan/SparkWork.java 66fd6b6 Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 5, 2014, 9:24 p.m., Szehon Ho wrote: Hi Chao, I left a review for a form of this patch at https://reviews.apache.org/r/27640/, as Suhas put it up for a separate review in combination with his patch. Thanks, I'll take a look there. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review60034 --- On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 5, 2014, 5:51 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27627: Split map-join plan into 2 SparkTasks in 3 stages [Spark Branch]
On Nov. 5, 2014, 7:16 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 128 https://reviews.apache.org/r/27627/diff/1/?file=750389#file750389line128 Do you mean parentTasks != null? That was a silly mistake. On Nov. 5, 2014, 7:16 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 185 https://reviews.apache.org/r/27627/diff/1/?file=750389#file750389line185 Merge with itself? Yes, in this case (current BaseWork has no MJ), we merge all parent SparkWorks into the current SparkWork. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/#review59987 --- On Nov. 5, 2014, 5:51 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27627/ --- (Updated Nov. 5, 2014, 5:51 p.m.) Review request for hive. Bugs: HIVE-8622 https://issues.apache.org/jira/browse/HIVE-8622 Repository: hive-git Description --- This is a sub-task of map-join for spark https://issues.apache.org/jira/browse/HIVE-7613 This can use the baseline patch for map-join https://issues.apache.org/jira/browse/HIVE-8616 Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION Diff: https://reviews.apache.org/r/27627/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27640: HIVE-8700 Replace ReduceSink to HashTableSink (or equi.) for small tables [Spark Branch]
On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 188 https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line188 Can you elaborate why we need this assumption? This may not be true in all cases. Actually, we don't need this assumption anymore. I'll remove it. On Nov. 5, 2014, 9:23 p.m., Szehon Ho wrote: ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java, line 141 https://reviews.apache.org/r/27640/diff/1/?file=750693#file750693line141 Please use proper javadoc notation for your javadocs. I didn't use javadoc since it's a private method. Maybe I can write a better description on what it does? - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27640/#review60031 --- On Nov. 5, 2014, 8:29 p.m., Suhas Satish wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27640/ --- (Updated Nov. 5, 2014, 8:29 p.m.) Review request for hive, Chao Sun, Jimmy Xiang, Szehon Ho, and Xuefu Zhang. Repository: hive-git Description --- This replaces ReduceSinks with HashTableSinks in smaller tables for a map-join. But the condition check field to detect map-join is actually being set in CommonJoinResolver, which doesnt exist yet. We need to decide where is the right place to populate this field. Diffs - ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/SparkMapJoinResolver.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 795a5d7 Diff: https://reviews.apache.org/r/27640/diff/ Testing --- Thanks, Suhas Satish
Re: Review Request 27117: HIVE-8457 - MapOperator initialization fails when multiple Spark threads is enabled [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27117/ --- (Updated Oct. 24, 2014, 4:51 p.m.) Review request for hive and Xuefu Zhang. Changes --- Thanks Xuefu for the comments. I've updated my patch. Bugs: HIVE-8457 https://issues.apache.org/jira/browse/HIVE-8457 Repository: hive-git Description --- Currently, on the Spark branch, each thread it is bound with a thread-local IOContext, which gets initialized when we generates an input HadoopRDD, and later used in MapOperator, FilterOperator, etc. And, given the introduction of HIVE-8118, we may have multiple downstream RDDs that share the same input HadoopRDD, and we would like to have the HadoopRDD to be cached, to avoid scanning the same table multiple times. A typical case would be like the following: inputRDD inputRDD || MT_11MT_12 || RT_1 RT_2 Here, MT_11 and MT_12 are MapTran from a splitted MapWork, and RT_1 and RT_2 are two ReduceTran. Note that, this example is simplified, as we may also have ShuffleTran between MapTran and ReduceTran. When multiple Spark threads are running, MT_11 may be executed first, and it will ask for an iterator from the HadoopRDD will trigger the creation of the iterator, which in turn triggers the initialization of the IOContext associated with that particular thread. Now, the problem is: before MT_12 starts executing, it will also ask for an iterator from the HadoopRDD, and since the RDD is already cached, instead of creating a new iterator, it will just fetch it from the cached result. However, this will skip the initialization of the IOContext associated with this particular thread. And, when MT_12 starts executing, it will try to initialize the MapOperator, but since the IOContext is not initialized, this will fail miserably. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java 20ea977 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 00a6f3d ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 4de3ad4 ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java 58e1ceb ql/src/java/org/apache/hadoop/hive/ql/io/IOContext.java 5fb3b13 Diff: https://reviews.apache.org/r/27117/diff/ Testing --- All multi-insertion related tests are passing on my local machine. Thanks, Chao Sun
Review Request 27148: HIVE-8533 - Enable all q-tests for multi-insertion [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27148/ --- Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8533 https://issues.apache.org/jira/browse/HIVE-8533 Repository: hive-git Description --- As HIVE-8436 is done, we should be able to enable all multi-insertion related tests. This JIRA is created to track this and record any potential issue encountered. Diffs - itests/src/test/resources/testconfiguration.properties db8866d ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby10.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby11.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby3_map_skew.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8_map.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8_noskew.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby9.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby_complex_types.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby_complex_types_multi_single_reducer.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby_multi_insert_common_distinct.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/pcr.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/smb_mapjoin_13.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/smb_mapjoin_15.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/smb_mapjoin_16.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/table_access_keys_stats.q.out PRE-CREATION Diff: https://reviews.apache.org/r/27148/diff/ Testing --- auto_smb_mapjoin_14.q groupby10.q groupby11.q groupby3_map_skew.q groupby7.q groupby7_noskew_multi_single_reducer.q groupby8.q groupby8_map.q groupby8_map_skew.q groupby8_noskew.q groupby9.q groupby_complex_types.q groupby_complex_types_multi_single_reducer.q groupby_multi_insert_common_distinct.q pcr.q smb_mapjoin_13.q smb_mapjoin_15.q smb_mapjoin_16.q table_access_keys_stats.q Thanks, Chao Sun
Re: Review Request 27148: HIVE-8533 - Enable all q-tests for multi-insertion [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27148/ --- (Updated Oct. 24, 2014, 6:03 p.m.) Review request for hive, Brock Noland, Szehon Ho, and Xuefu Zhang. Bugs: HIVE-8533 https://issues.apache.org/jira/browse/HIVE-8533 Repository: hive-git Description --- As HIVE-8436 is done, we should be able to enable all multi-insertion related tests. This JIRA is created to track this and record any potential issue encountered. Diffs - itests/src/test/resources/testconfiguration.properties db8866d ql/src/test/results/clientpositive/spark/auto_smb_mapjoin_14.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby10.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby11.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby3_map_skew.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_noskew_multi_single_reducer.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8_map.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8_map_skew.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby8_noskew.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby9.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby_complex_types.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby_complex_types_multi_single_reducer.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby_multi_insert_common_distinct.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/pcr.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/smb_mapjoin_13.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/smb_mapjoin_15.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/smb_mapjoin_16.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/table_access_keys_stats.q.out PRE-CREATION Diff: https://reviews.apache.org/r/27148/diff/ Testing --- auto_smb_mapjoin_14.q groupby10.q groupby11.q groupby3_map_skew.q groupby7.q groupby7_noskew_multi_single_reducer.q groupby8.q groupby8_map.q groupby8_map_skew.q groupby8_noskew.q groupby9.q groupby_complex_types.q groupby_complex_types_multi_single_reducer.q groupby_multi_insert_common_distinct.q pcr.q smb_mapjoin_13.q smb_mapjoin_15.q smb_mapjoin_16.q table_access_keys_stats.q Thanks, Chao Sun
Re: Build failure on trunk
Maybe it's because the patch didn't apply? 2014-10-24 17:14:50,934 INFO LocalCommand$CollectLogPolicy.handleOutput:69 The patch does not appear to apply with p0, p1, or p2 2014-10-24 17:14:50,938 INFO LocalCommand$CollectLogPolicy.handleOutput:69 + exit 1 2014-10-24 17:14:50,939 ERROR PTest.run:175 Test run exited with an unexpected error org.apache.hive.ptest.execution.ssh.NonZeroExitCodeException: Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n /usr/java/jdk1.7.0_45-cloudera ]] On Fri, Oct 24, 2014 at 2:19 PM, Prasanth Jayachandran pjayachand...@hortonworks.com wrote: Unit test run for HIVE-8454 spent 2hr 48mins but finally it says “no tests executed”. https://issues.apache.org/jira/browse/HIVE-8454?focusedCommentId=14183509page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14183509 http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-TRUNK-Build/1444/ Anyone know why? - Prasanth On Thu, Oct 23, 2014 at 9:26 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Thanks Xuefu - I appreciate it! On Thu, Oct 23, 2014 at 9:15 PM, Xuefu Zhang xzh...@cloudera.com wrote: You can add CLEAR LIBRARY CACHE in the description for any JIRA, which will clear local maven repo. I added it to HIVE-6165. On Thu, Oct 23, 2014 at 9:09 PM, Gunther Hagleitner ghagleit...@hortonworks.com wrote: Builds are running again (reverted patch). I've re-uploaded the patches that had a failed run because of it. Sorry about that... Thanks, Gunther. On Thu, Oct 23, 2014 at 8:07 PM, Gunther Hagleitner gunther.hagleit...@gmail.com wrote: The builds are failing right now on trunk after I committed a change that requires new/updated calcite libs. (Sorry about that). Is it possible for someone to wipe the .m2 cache on the build machine, so it would download a new version with the changes? Thank you, Gunther. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
Re: Review Request 27046: HIVE-8545 - Exception when casting Text to BytesWritable [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/ --- (Updated Oct. 23, 2014, 5:32 p.m.) Review request for hive, Brock Noland and Xuefu Zhang. Changes --- Thanks Xuefu for the suggestions. This patch uses a blank Configuration instead of serialize/deserialize JobConf. Bugs: hive-8545 https://issues.apache.org/jira/browse/hive-8545 Repository: hive-git Description --- With the current multi-insertion implementation, when caching is enabled for input RDD, query may fail with the following exception: 2014-10-21 13:57:34,742 WARN [task-result-getter-0]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.BytesWritable org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:67) org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:61) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:234) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) The fix should be easy. However, interestingly, this error doesn't show up when the caching is turned off. We need to find out why. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java dc5d148 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 9849b49 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 25a4515 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java 8a3dbf2 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 0f21b46 Diff: https://reviews.apache.org/r/27046/diff/ Testing --- Thanks, Chao Sun
Review Request 27117: HIVE-8457 - MapOperator initialization fails when multiple Spark threads is enabled [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27117/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-8457 https://issues.apache.org/jira/browse/HIVE-8457 Repository: hive-git Description --- Currently, on the Spark branch, each thread it is bound with a thread-local IOContext, which gets initialized when we generates an input HadoopRDD, and later used in MapOperator, FilterOperator, etc. And, given the introduction of HIVE-8118, we may have multiple downstream RDDs that share the same input HadoopRDD, and we would like to have the HadoopRDD to be cached, to avoid scanning the same table multiple times. A typical case would be like the following: inputRDD inputRDD || MT_11MT_12 || RT_1 RT_2 Here, MT_11 and MT_12 are MapTran from a splitted MapWork, and RT_1 and RT_2 are two ReduceTran. Note that, this example is simplified, as we may also have ShuffleTran between MapTran and ReduceTran. When multiple Spark threads are running, MT_11 may be executed first, and it will ask for an iterator from the HadoopRDD will trigger the creation of the iterator, which in turn triggers the initialization of the IOContext associated with that particular thread. Now, the problem is: before MT_12 starts executing, it will also ask for an iterator from the HadoopRDD, and since the RDD is already cached, instead of creating a new iterator, it will just fetch it from the cached result. However, this will skip the initialization of the IOContext associated with this particular thread. And, when MT_12 starts executing, it will try to initialize the MapOperator, but since the IOContext is not initialized, this will fail miserably. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkMapRecordHandler.java 20ea977 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 00a6f3d ql/src/java/org/apache/hadoop/hive/ql/io/HiveContextAwareRecordReader.java 58e1ceb Diff: https://reviews.apache.org/r/27117/diff/ Testing --- All multi-insertion related tests are passing on my local machine. Thanks, Chao Sun
Review Request 27046: HIVE-8545 - Exception when casting Text to BytesWritable [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/ --- Review request for hive, Brock Noland and Xuefu Zhang. Bugs: hive-8545 https://issues.apache.org/jira/browse/hive-8545 Repository: hive-git Description --- With the current multi-insertion implementation, when caching is enabled for input RDD, query may fail with the following exception: 2014-10-21 13:57:34,742 WARN [task-result-getter-0]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.BytesWritable org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:67) org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:61) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:234) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) The fix should be easy. However, interestingly, this error doesn't show up when the caching is turned off. We need to find out why. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java dc5d148 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveCopyFunction.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 9849b49 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 25a4515 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java 8a3dbf2 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 0f21b46 Diff: https://reviews.apache.org/r/27046/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27046: HIVE-8545 - Exception when casting Text to BytesWritable [Spark Branch]
On Oct. 22, 2014, 11:40 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java, line 25 https://reviews.apache.org/r/27046/diff/1/?file=728820#file728820line25 Why KO becomes Writable now? Should it be WritableComparable according to MapInput? My mistake, it should be WritableComparable. Thanks for pointing out. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/#review57932 --- On Oct. 22, 2014, 5:50 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/ --- (Updated Oct. 22, 2014, 5:50 p.m.) Review request for hive, Brock Noland and Xuefu Zhang. Bugs: hive-8545 https://issues.apache.org/jira/browse/hive-8545 Repository: hive-git Description --- With the current multi-insertion implementation, when caching is enabled for input RDD, query may fail with the following exception: 2014-10-21 13:57:34,742 WARN [task-result-getter-0]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.BytesWritable org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:67) org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:61) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:234) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) The fix should be easy. However, interestingly, this error doesn't show up when the caching is turned off. We need to find out why. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java dc5d148 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveCopyFunction.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 9849b49 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 25a4515 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java 8a3dbf2 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 0f21b46 Diff: https://reviews.apache.org/r/27046/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27046: HIVE-8545 - Exception when casting Text to BytesWritable [Spark Branch]
On Oct. 22, 2014, 11:36 p.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java, line 77 https://reviews.apache.org/r/27046/diff/1/?file=728816#file728816line77 I think we should let this stay in SparkUtils which otherwise now become an empty class. OK. To make it consistent, I also moved copyHiveKey to SparkUtilities. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/#review57929 --- On Oct. 22, 2014, 5:50 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/ --- (Updated Oct. 22, 2014, 5:50 p.m.) Review request for hive, Brock Noland and Xuefu Zhang. Bugs: hive-8545 https://issues.apache.org/jira/browse/hive-8545 Repository: hive-git Description --- With the current multi-insertion implementation, when caching is enabled for input RDD, query may fail with the following exception: 2014-10-21 13:57:34,742 WARN [task-result-getter-0]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.BytesWritable org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:67) org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:61) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:234) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) The fix should be easy. However, interestingly, this error doesn't show up when the caching is turned off. We need to find out why. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java dc5d148 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveCopyFunction.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 9849b49 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 25a4515 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java 8a3dbf2 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 0f21b46 Diff: https://reviews.apache.org/r/27046/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 27046: HIVE-8545 - Exception when casting Text to BytesWritable [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/27046/ --- (Updated Oct. 23, 2014, 12:42 a.m.) Review request for hive, Brock Noland and Xuefu Zhang. Changes --- Thanks Xuefu for the comments. I've changed my patch accordingly. Bugs: hive-8545 https://issues.apache.org/jira/browse/hive-8545 Repository: hive-git Description --- With the current multi-insertion implementation, when caching is enabled for input RDD, query may fail with the following exception: 2014-10-21 13:57:34,742 WARN [task-result-getter-0]: scheduler.TaskSetManager (Logging.scala:logWarning(71)) - Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.BytesWritable org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:67) org.apache.hadoop.hive.ql.exec.spark.MapInput$CopyFunction.call(MapInput.java:61) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) org.apache.spark.api.java.JavaPairRDD$$anonfun$pairFunToScalaFun$1.apply(JavaPairRDD.scala:1002) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:234) org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:163) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) org.apache.spark.scheduler.Task.run(Task.scala:56) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) The fix should be easy. However, interestingly, this error doesn't show up when the caching is turned off. We need to find out why. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java dc5d148 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveCopyFunction.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 9849b49 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 25a4515 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java 8a3dbf2 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 0f21b46 Diff: https://reviews.apache.org/r/27046/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
On Oct. 19, 2014, 12:15 a.m., Xuefu Zhang wrote: ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q, line 1 https://reviews.apache.org/r/26706/diff/4/?file=724864#file724864line1 Could we put this test as spark only, as splitting doesn't apply mr or tez? I think we have a dir for spark only tests. Chao Sun wrote: I also wanted to make this as a spark-only test. But the feature hasn't been implemented yet (I think Szehon is working on it). I made the file name to start with spark_ so in future we can move it to spark-only test directory. But currently, there's no test dir for spark, only result dir. Xuefu Zhang wrote: In the case, let rename the test to have a generic name. it's a valid test case for MR also, but also a special case for Spark. OK, thanks. I've updated the patch accordingly. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/#review57286 --- On Oct. 19, 2014, 12:46 a.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 19, 2014, 12:46 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs - itests/src/test/resources/testconfiguration.properties 558dd02 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java c956101 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 3773dcb ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_map.q.out 2d99a81 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out ca73985 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 2d2c55b ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 942cdaa ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 399fe41 ql/src/test/results/clientpositive/spark/groupby_position.q.out 5e68807 ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4259412 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out e0e882e ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out a43921e ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 1f3b484 ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src/test/results/clientpositive/spark/insert1.q.out 65032cb ql/src/test/results/clientpositive/spark/insert_into3.q.out 5318a8b ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out 3b669fc ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 50c052d ql/src/test/results
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
checked against the old results. Also I created a new test spark_multi_insert_spill_work.q to check splitting won't generate duplicate FSs. Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
, and manually checked against the old results. Also I created a new test spark_multi_insert_spill_work.q to check splitting won't generate duplicate FSs. Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
On Oct. 20, 2014, 9:52 p.m., Xuefu Zhang wrote: itests/src/test/resources/testconfiguration.properties, line 509 https://reviews.apache.org/r/26706/diff/7/?file=726397#file726397line509 We might need to change this as well. Can't believe I missed this. Sorry for the sloppyness! - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/#review57445 --- On Oct. 20, 2014, 9:10 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 20, 2014, 9:10 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs - itests/src/test/resources/testconfiguration.properties 558dd02 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveBaseFunctionResultList.java c956101 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkUtilities.java 3773dcb ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/queries/clientpositive/multi_insert_mixed.q PRE-CREATION ql/src/test/results/clientpositive/multi_insert_mixed.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_map.q.out 310f2fe ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out e6054c9 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out d0f3e76 ql/src/test/results/clientpositive/spark/groupby_cube1.q.out d40c7bb ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out b4ded62 ql/src/test/results/clientpositive/spark/groupby_position.q.out d2529bb ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 7fa6130 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out 4a4070b ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 62c179e ql/src/test/results/clientpositive/spark/input12.q.out a4b7a3c ql/src/test/results/clientpositive/spark/input13.q.out 5c799dc ql/src/test/results/clientpositive/spark/input1_limit.q.out 1105ed8 ql/src/test/results/clientpositive/spark/input_part2.q.out 514f54a ql/src/test/results/clientpositive/spark/insert1.q.out 1b88026 ql/src/test/results/clientpositive/spark/insert_into3.q.out 5b2aa78 ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out cbf7204 ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 3905d84 ql/src/test/results/clientpositive/spark/multi_insert.q.out 0404119 ql/src/test/results/clientpositive/spark/multi_insert_gby3.q.out 903e966 ql/src/test/results/clientpositive/spark/multi_insert_lateral_view.q.out 730fb4f ql/src/test/results/clientpositive/spark/multi_insert_mixed.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/multi_insert_move_tasks_share_dependencies.q.out 1f31f56 ql/src/test/results/clientpositive/spark/multigroupby_singlemr.q.out 4ded9d2 ql/src/test/results
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
spark_multi_insert_spill_work.q to check splitting won't generate duplicate FSs. Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
On Oct. 19, 2014, 12:15 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java, line 64 https://reviews.apache.org/r/26706/diff/4/?file=724853#file724853line64 Could we reuse this as a utility? I think we have same/similar thing somewhere. You're right - HiveBaseFunctionResultList has the same method. I've put it in the SparkUtilities. On Oct. 19, 2014, 12:15 a.m., Xuefu Zhang wrote: ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java, line 250 https://reviews.apache.org/r/26706/diff/4/?file=724854#file724854line250 Do we need to disconnect it or remove does this automatically? Yes, remove also remove all edges connected to this node. On Oct. 19, 2014, 12:15 a.m., Xuefu Zhang wrote: ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q, line 1 https://reviews.apache.org/r/26706/diff/4/?file=724864#file724864line1 Could we put this test as spark only, as splitting doesn't apply mr or tez? I think we have a dir for spark only tests. I also wanted to make this as a spark-only test. But the feature hasn't been implemented yet (I think Szehon is working on it). I made the file name to start with spark_ so in future we can move it to spark-only test directory. But currently, there's no test dir for spark, only result dir. - Chao --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/#review57286 --- On Oct. 17, 2014, 9:24 p.m., Chao Sun wrote: --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 17, 2014, 9:24 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs - itests/src/test/resources/testconfiguration.properties 558dd02 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_map.q.out 2d99a81 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out ca73985 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 2d2c55b ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 942cdaa ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 399fe41 ql/src/test/results/clientpositive/spark/groupby_position.q.out 5e68807 ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4259412 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out e0e882e ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out a43921e ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 1f3b484 ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
results are regenerated, and manually checked against the old results. Also I created a new test spark_multi_insert_spill_work.q to check splitting won't generate duplicate FSs. Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 17, 2014, 6:04 p.m.) Review request for hive and Xuefu Zhang. Changes --- Added a test to check that splitting work doesn't create duplicate FSs. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_map.q.out 2d99a81 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out ca73985 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 2d2c55b ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 942cdaa ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 399fe41 ql/src/test/results/clientpositive/spark/groupby_position.q.out 5e68807 ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4259412 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out e0e882e ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out a43921e ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 1f3b484 ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src/test/results/clientpositive/spark/insert1.q.out 65032cb ql/src/test/results/clientpositive/spark/insert_into3.q.out 5318a8b ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out 3b669fc ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 50c052d ql/src/test/results/clientpositive/spark/multi_insert.q.out bae325f ql/src/test/results/clientpositive/spark/multi_insert_gby3.q.out 280a893 ql/src/test/results/clientpositive/spark/multi_insert_lateral_view.q.out b07c582 ql/src/test/results/clientpositive/spark/multi_insert_move_tasks_share_dependencies.q.out fd477ca ql/src/test/results/clientpositive/spark/multigroupby_singlemr.q.out 44991e3 ql/src/test/results/clientpositive/spark/ppd_multi_insert.q.out 96f2c06 ql/src/test/results/clientpositive/spark/ppd_transform.q.out 7ec5d8d ql/src/test/results/clientpositive/spark/spark_multi_insert_split_work.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out 2b4a331 ql/src/test/results/clientpositive/spark/union18.q.out f94fa0b ql/src/test/results/clientpositive/spark/union19.q.out 8dcb543 ql/src/test/results/clientpositive/spark/union_remove_6.q.out 6730010 ql/src/test/results/clientpositive/spark/vectorized_ptf.q.out 909378b Diff: https://reviews.apache.org/r/26706/diff/ Testing --- Thanks, Chao Sun
Review Request 26884: HIVE-8496 - Re-enable statistics [Spark Branch]
/clientpositive/spark/union23.q.out 22aa965 ql/src/test/results/clientpositive/spark/union25.q.out bad0e5c ql/src/test/results/clientpositive/spark/union28.q.out 1478976 ql/src/test/results/clientpositive/spark/union3.q.out 8a7954b ql/src/test/results/clientpositive/spark/union30.q.out a33e999 ql/src/test/results/clientpositive/spark/union33.q.out 840cb4d ql/src/test/results/clientpositive/spark/union4.q.out 78c3979 ql/src/test/results/clientpositive/spark/union5.q.out 9717853 ql/src/test/results/clientpositive/spark/union6.q.out eb42a40 ql/src/test/results/clientpositive/spark/union7.q.out 8606278 ql/src/test/results/clientpositive/spark/union9.q.out 9db0539 ql/src/test/results/clientpositive/spark/union_ppr.q.out 15dec39 ql/src/test/results/clientpositive/spark/union_remove_1.q.out 0d0ec26 ql/src/test/results/clientpositive/spark/union_remove_10.q.out e03c2d9 ql/src/test/results/clientpositive/spark/union_remove_15.q.out ab98518 ql/src/test/results/clientpositive/spark/union_remove_16.q.out 90cb97c ql/src/test/results/clientpositive/spark/union_remove_18.q.out 83fab64 ql/src/test/results/clientpositive/spark/union_remove_19.q.out 07e1cc3 ql/src/test/results/clientpositive/spark/union_remove_2.q.out 00dd51e ql/src/test/results/clientpositive/spark/union_remove_20.q.out 9140453 ql/src/test/results/clientpositive/spark/union_remove_21.q.out b921b1a ql/src/test/results/clientpositive/spark/union_remove_24.q.out 7d54e78 ql/src/test/results/clientpositive/spark/union_remove_25.q.out d8292aa ql/src/test/results/clientpositive/spark/union_remove_4.q.out db816e4 ql/src/test/results/clientpositive/spark/union_remove_5.q.out 7c85791 ql/src/test/results/clientpositive/spark/union_remove_6.q.out 6730010 ql/src/test/results/clientpositive/spark/union_remove_7.q.out ed30b09 ql/src/test/results/clientpositive/spark/union_remove_8.q.out 16f15f4 ql/src/test/results/clientpositive/spark/union_remove_9.q.out 4a33436 ql/src/test/results/clientpositive/spark/vector_cast_constant.q.out f30c803 ql/src/test/results/clientpositive/spark/vector_data_types.q.out d21c68f ql/src/test/results/clientpositive/spark/vector_decimal_aggregate.q.out 99606d9 ql/src/test/results/clientpositive/spark/vector_left_outer_join.q.out 8c28349 ql/src/test/results/clientpositive/spark/vectorization_14.q.out f1e4916 ql/src/test/results/clientpositive/spark/vectorization_15.q.out 3eb3722 ql/src/test/results/clientpositive/spark/vectorization_9.q.out 21434d4 ql/src/test/results/clientpositive/spark/vectorization_part_project.q.out c6458ec ql/src/test/results/clientpositive/spark/vectorized_mapjoin.q.out e8751a6 ql/src/test/results/clientpositive/spark/vectorized_nested_mapjoin.q.out d163d42 ql/src/test/results/clientpositive/spark/vectorized_ptf.q.out 909378b ql/src/test/results/clientpositive/spark/vectorized_shufflejoin.q.out e8751a6 ql/src/test/results/clientpositive/spark/vectorized_timestamp_funcs.q.out abf1d86 Diff: https://reviews.apache.org/r/26884/diff/ Testing --- All test results are regenerated. Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 17, 2014, 9:22 p.m.) Review request for hive and Xuefu Zhang. Changes --- Included a qfile result for MR mode. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs (updated) - itests/src/test/resources/testconfiguration.properties 558dd02 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_map.q.out 2d99a81 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out ca73985 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 2d2c55b ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 942cdaa ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 399fe41 ql/src/test/results/clientpositive/spark/groupby_position.q.out 5e68807 ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4259412 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out e0e882e ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out a43921e ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 1f3b484 ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src/test/results/clientpositive/spark/insert1.q.out 65032cb ql/src/test/results/clientpositive/spark/insert_into3.q.out 5318a8b ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out 3b669fc ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 50c052d ql/src/test/results/clientpositive/spark/multi_insert.q.out bae325f ql/src/test/results/clientpositive/spark/multi_insert_gby3.q.out 280a893 ql/src/test/results/clientpositive/spark/multi_insert_lateral_view.q.out b07c582 ql/src/test/results/clientpositive/spark/multi_insert_move_tasks_share_dependencies.q.out fd477ca ql/src/test/results/clientpositive/spark/multigroupby_singlemr.q.out 44991e3 ql/src/test/results/clientpositive/spark/ppd_multi_insert.q.out 96f2c06 ql/src/test/results/clientpositive/spark/ppd_transform.q.out 7ec5d8d ql/src/test/results/clientpositive/spark/spark_multi_insert_split_work.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out 2b4a331 ql/src/test/results/clientpositive/spark/union18.q.out f94fa0b ql/src/test/results/clientpositive/spark/union19.q.out 8dcb543 ql/src/test/results/clientpositive/spark/union_remove_6.q.out 6730010 ql/src/test/results/clientpositive/spark/vectorized_ptf.q.out 909378b ql/src/test/results/clientpositive/spark_multi_insert_split_work.q.out PRE-CREATION Diff: https://reviews.apache.org/r/26706/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 17, 2014, 9:24 p.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs - itests/src/test/resources/testconfiguration.properties 558dd02 ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/queries/clientpositive/spark_multi_insert_split_work.q PRE-CREATION ql/src/test/results/clientpositive/spark/groupby7_map.q.out 2d99a81 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out ca73985 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 2d2c55b ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 942cdaa ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 399fe41 ql/src/test/results/clientpositive/spark/groupby_position.q.out 5e68807 ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4259412 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out e0e882e ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out a43921e ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 1f3b484 ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src/test/results/clientpositive/spark/insert1.q.out 65032cb ql/src/test/results/clientpositive/spark/insert_into3.q.out 5318a8b ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out 3b669fc ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 50c052d ql/src/test/results/clientpositive/spark/multi_insert.q.out bae325f ql/src/test/results/clientpositive/spark/multi_insert_gby3.q.out 280a893 ql/src/test/results/clientpositive/spark/multi_insert_lateral_view.q.out b07c582 ql/src/test/results/clientpositive/spark/multi_insert_move_tasks_share_dependencies.q.out fd477ca ql/src/test/results/clientpositive/spark/multigroupby_singlemr.q.out 44991e3 ql/src/test/results/clientpositive/spark/ppd_multi_insert.q.out 96f2c06 ql/src/test/results/clientpositive/spark/ppd_transform.q.out 7ec5d8d ql/src/test/results/clientpositive/spark/spark_multi_insert_split_work.q.out PRE-CREATION ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out 2b4a331 ql/src/test/results/clientpositive/spark/union18.q.out f94fa0b ql/src/test/results/clientpositive/spark/union19.q.out 8dcb543 ql/src/test/results/clientpositive/spark/union_remove_6.q.out 6730010 ql/src/test/results/clientpositive/spark/vectorized_ptf.q.out 909378b ql/src/test/results/clientpositive/spark_multi_insert_split_work.q.out PRE-CREATION Diff: https://reviews.apache.org/r/26706/diff/ Testing (updated) --- All multi-insertion related results are regenerated, and manually checked against the old results. Also I created a new test spark_multi_insert_spill_work.q to check splitting won't generate duplicate FSs. Thanks, Chao Sun
Re: Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- (Updated Oct. 16, 2014, 1:25 a.m.) Review request for hive and Xuefu Zhang. Changes --- Addressing the comments. Also, I'm thinking about adding another test for multi-insert in another JIRA, specifically check if the plan after splitting is in the correct shape. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/results/clientpositive/spark/groupby7_map.q.out 2d99a81 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out ca73985 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out 2d2c55b ql/src/test/results/clientpositive/spark/groupby_cube1.q.out 942cdaa ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 399fe41 ql/src/test/results/clientpositive/spark/groupby_position.q.out 5e68807 ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4259412 ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out e0e882e ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out a43921e ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 1f3b484 ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src/test/results/clientpositive/spark/insert1.q.out 65032cb ql/src/test/results/clientpositive/spark/insert_into3.q.out 5318a8b ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out 3b669fc ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 50c052d ql/src/test/results/clientpositive/spark/multi_insert.q.out bae325f ql/src/test/results/clientpositive/spark/multi_insert_gby3.q.out 280a893 ql/src/test/results/clientpositive/spark/multi_insert_lateral_view.q.out b07c582 ql/src/test/results/clientpositive/spark/multi_insert_move_tasks_share_dependencies.q.out fd477ca ql/src/test/results/clientpositive/spark/multigroupby_singlemr.q.out 44991e3 ql/src/test/results/clientpositive/spark/ppd_multi_insert.q.out 96f2c06 ql/src/test/results/clientpositive/spark/ppd_transform.q.out 7ec5d8d ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out 2b4a331 ql/src/test/results/clientpositive/spark/union18.q.out f94fa0b ql/src/test/results/clientpositive/spark/union19.q.out 8dcb543 ql/src/test/results/clientpositive/spark/union_remove_6.q.out 6730010 ql/src/test/results/clientpositive/spark/vectorized_ptf.q.out 909378b Diff: https://reviews.apache.org/r/26706/diff/ Testing --- Thanks, Chao Sun
Review Request 26706: HIVE-8436 - Modify SparkWork to split works with multiple child works [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26706/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-8436 https://issues.apache.org/jira/browse/HIVE-8436 Repository: hive-git Description --- Based on the design doc, we need to split the operator tree of a work in SparkWork if the work is connected to multiple child works. The way splitting the operator tree is performed by cloning the original work and removing unwanted branches in the operator tree. Please refer to the design doc for details. This process should be done right before we generate SparkPlan. We should have a utility method that takes the orignal SparkWork and return a modified SparkWork. This process should also keep the information about the original work and its clones. Such information will be needed during SparkPlan generation (HIVE-8437). Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java 7d9feac ql/src/java/org/apache/hadoop/hive/ql/exec/spark/HiveReduceFunction.java 5153885 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 3fd37a0 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 126cb9f ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkProcContext.java d7744e9 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkUtils.java 280edde ql/src/java/org/apache/hadoop/hive/ql/parse/spark/GenSparkWork.java ac94ea0 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkCompiler.java 644c681 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMergeTaskProcessor.java 1d01040 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkMultiInsertionProcessor.java 93940bc ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkProcessAnalyzeTable.java 20eb344 ql/src/java/org/apache/hadoop/hive/ql/parse/spark/SparkTableScanProcessor.java a62643a ql/src/java/org/apache/hadoop/hive/ql/plan/BaseWork.java 05be1f1 ql/src/test/results/clientpositive/spark/groupby7_map.q.out 95d7b59 ql/src/test/results/clientpositive/spark/groupby7_map_skew.q.out b425c67 ql/src/test/results/clientpositive/spark/groupby7_noskew.q.out dc713b3 ql/src/test/results/clientpositive/spark/groupby_cube1.q.out cd8e85e ql/src/test/results/clientpositive/spark/groupby_multi_single_reducer.q.out 801ac8a ql/src/test/results/clientpositive/spark/groupby_position.q.out b04e55c ql/src/test/results/clientpositive/spark/groupby_rollup1.q.out 4bde6ea ql/src/test/results/clientpositive/spark/groupby_sort_1_23.q.out ab2fe84 ql/src/test/results/clientpositive/spark/groupby_sort_skew_1_23.q.out 5c1cbc4 ql/src/test/results/clientpositive/spark/input12.q.out 4b0cf44 ql/src/test/results/clientpositive/spark/input13.q.out 260a65a ql/src/test/results/clientpositive/spark/input1_limit.q.out 90bc8ea ql/src/test/results/clientpositive/spark/input_part2.q.out f2f3a2d ql/src/test/results/clientpositive/spark/insert1.q.out 65032cb ql/src/test/results/clientpositive/spark/insert_into3.q.out 7964802 ql/src/test/results/clientpositive/spark/load_dyn_part1.q.out 3b669fc ql/src/test/results/clientpositive/spark/load_dyn_part8.q.out 50c052d ql/src/test/results/clientpositive/spark/multi_insert.q.out 31ebbeb ql/src/test/results/clientpositive/spark/multi_insert_gby3.q.out 0a983d8 ql/src/test/results/clientpositive/spark/multi_insert_lateral_view.q.out 68b1312 ql/src/test/results/clientpositive/spark/multi_insert_move_tasks_share_dependencies.q.out f7867ac ql/src/test/results/clientpositive/spark/multigroupby_singlemr.q.out dbb78a6 ql/src/test/results/clientpositive/spark/orc_analyze.q.out a0af7ba ql/src/test/results/clientpositive/spark/parallel.q.out acd418f ql/src/test/results/clientpositive/spark/ppd_multi_insert.q.out 169d2f1 ql/src/test/results/clientpositive/spark/ppd_transform.q.out 54b8a8a ql/src/test/results/clientpositive/spark/subquery_multiinsert.q.out 6f8066d ql/src/test/results/clientpositive/spark/union18.q.out 07ea2c5 ql/src/test/results/clientpositive/spark/union19.q.out 2fefe8e ql/src/test/results/clientpositive/spark/union_remove_6.q.out 147f1fe ql/src/test/results/clientpositive/spark/vectorized_ptf.q.out e12943c Diff: https://reviews.apache.org/r/26706/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 26569: HIVE-8276 - Separate shuffle from ReduceTran and so create ShuffleTran [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26569/ --- (Updated Oct. 11, 2014, 3:01 p.m.) Review request for hive and Xuefu Zhang. Changes --- Addressing comments. Thanks Xuefu! Bugs: HIVE-8276 https://issues.apache.org/jira/browse/HIVE-8276 Repository: hive-git Description --- Currently ShuffleTran captures both shuffle and reduce side processing. Per HIVE-8118, sometimes the output RDD from shuffle needs to be cached for better performance. Thus, it makes sense to separate shuffle from Reduce and create ShuffleTran class. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/IdentityTran.java 6c3cf2f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 0732e06 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java e62527c ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 52ac724 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 8e251df ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java e770158 Diff: https://reviews.apache.org/r/26569/diff/ Testing --- Thanks, Chao Sun
Review Request 26569: HIVE-8276 - Separate shuffle from ReduceTran and so create ShuffleTran [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26569/ --- Review request for hive and Xuefu Zhang. Bugs: HIVE-8276 https://issues.apache.org/jira/browse/HIVE-8276 Repository: hive-git Description --- Currently ShuffleTran captures both shuffle and reduce side processing. Per HIVE-8118, sometimes the output RDD from shuffle needs to be cached for better performance. Thus, it makes sense to separate shuffle from Reduce and create ShuffleTran class. Diffs - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/IdentityTran.java 6c3cf2f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 0732e06 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java e62527c ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 52ac724 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 8e251df ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java e770158 Diff: https://reviews.apache.org/r/26569/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 26569: HIVE-8276 - Separate shuffle from ReduceTran and so create ShuffleTran [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26569/ --- (Updated Oct. 10, 2014, 7:18 p.m.) Review request for hive and Xuefu Zhang. Changes --- Added cache flag for ShuffleTran. Bugs: HIVE-8276 https://issues.apache.org/jira/browse/HIVE-8276 Repository: hive-git Description --- Currently ShuffleTran captures both shuffle and reduce side processing. Per HIVE-8118, sometimes the output RDD from shuffle needs to be cached for better performance. Thus, it makes sense to separate shuffle from Reduce and create ShuffleTran class. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/IdentityTran.java 6c3cf2f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 0732e06 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java e62527c ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 52ac724 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 8e251df ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java e770158 Diff: https://reviews.apache.org/r/26569/diff/ Testing --- Thanks, Chao Sun
Re: Review Request 26569: HIVE-8276 - Separate shuffle from ReduceTran and so create ShuffleTran [Spark Branch]
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/26569/ --- (Updated Oct. 11, 2014, 12:31 a.m.) Review request for hive and Xuefu Zhang. Bugs: HIVE-8276 https://issues.apache.org/jira/browse/HIVE-8276 Repository: hive-git Description --- Currently ShuffleTran captures both shuffle and reduce side processing. Per HIVE-8118, sometimes the output RDD from shuffle needs to be cached for better performance. Thus, it makes sense to separate shuffle from Reduce and create ShuffleTran class. Diffs (updated) - ql/src/java/org/apache/hadoop/hive/ql/exec/spark/IdentityTran.java 6c3cf2f ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapInput.java 0732e06 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/MapTran.java e62527c ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ReduceTran.java 52ac724 ql/src/java/org/apache/hadoop/hive/ql/exec/spark/ShuffleTran.java PRE-CREATION ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkPlanGenerator.java 8e251df ql/src/java/org/apache/hadoop/hive/ql/exec/spark/SparkTran.java e770158 Diff: https://reviews.apache.org/r/26569/diff/ Testing --- Thanks, Chao Sun