[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923492#comment-15923492 ] liyunzhang_intel commented on HIVE-14919: - [~lirui]: {quote} One thing I noted is the Xms flag was removed from the executor's options via SPARK-12384. We may want to set it the same as Xmx to achieve better performance. {quote} not very understand this point, because now spark does not allow to use Xmx to specify max heap memory settings and only use ${{spark.executor.memory}} org.apache.spark.SparkConf#validateSettings {code} // Validate spark.executor.extraJavaOptions getOption(executorOptsKey).foreach { javaOpts => if (javaOpts.contains("-Dspark")) { val msg = s"$executorOptsKey is not allowed to set Spark options (was '$javaOpts'). " + "Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit." throw new Exception(msg) } if (javaOpts.contains("-Xmx")) { val msg = s"$executorOptsKey is not allowed to specify max heap memory settings " + s"(was '$javaOpts'). Use spark.executor.memory instead." throw new Exception(msg) } } {code} > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927700#comment-15927700 ] Rui Li commented on HIVE-14919: --- I mean we can set Xms, not Xmx. > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929407#comment-15929407 ] liyunzhang_intel commented on HIVE-14919: - [~lirui]: I guess you mean to set spark.executor.extraJavaOptions in conf/spark-defaults.conf like following {code} spark.executor.extraJavaOptions -Xms${the value of spark.executor.memory} {code} Is my understanding right? > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929448#comment-15929448 ] Rui Li commented on HIVE-14919: --- Yeah that's what I meant. > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933440#comment-15933440 ] Sahil Takiar commented on HIVE-14919: - [~lirui], [~Ferd] would it make sense to add HoS integration with Spark's DataFrame / DataSets API? From the Spark docs it sounds like the DataFrame / DataSets API could improve performance since the APIs require specifying column types. > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934145#comment-15934145 ] liyunzhang_intel commented on HIVE-14919: - [~stakiar]: i think it is a good point to integration HoS with Spark's DataFrame/DataSets API, here my question is what's the suitable case to benefit from it? A table which contains complex column types? > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934965#comment-15934965 ] Sahil Takiar commented on HIVE-14919: - [~kellyzly] my guess is that any Hive table should benefit from using the DataFrames API. I'm not a Spark expert, but I believe RDDs are just a distributed, collection of objects, but those objects don't have a defined schema. A DataFrame is similar to a table in a database, it has a set of named columns. So naturally I would think Hive fits more into the DataFrames model since it works with tables that have a set of pre-defined columns. According to some blog posts on the DataFrames API, it has a number of performance optimizations built in due to the fact that column types are known. These optimizations were not possible with RDDs because RDDs don't have a schema. >From a DataBricks blog post: {quote} It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames. {quote} Sources: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935686#comment-15935686 ] Rui Li commented on HIVE-14919: --- [~stakiar], [~kellyzly] we're using RDD with type like . That's inline with MR (probably Tez too). And I think that's what Hive operators and SerDe expect. I don't know much about the DataFrame/DataSet API, but from the discussion above, it seems to require a lot of work. [~xuefuz] could you share your thoughts? > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with > Spark 1.6. We can see performance improvments about 5.4% in general and 45% > for the best case. However, some queries doesn't have significant performance > improvements. This JIRA is the umbrella ticket addressing those performance > issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15643352#comment-15643352 ] Rui Li commented on HIVE-14919: --- One thing I noted is the Xms flag was removed from the executor's options via SPARK-12384. We may want to set it the same as Xmx to achieve better performance. > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > Attachments: benchmark.xlsx > > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 10 GB data set comparing > with Spark 1.6. We can see quite some performance degradation for most of the > queries for BigBench. For detailed information, please see the attached file > for detailed information. This JIRA is the umbrella ticket addressing those > performance issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0
[ https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15561385#comment-15561385 ] Ferdinand Xu commented on HIVE-14919: - cc [~kellyzly] [~dapengsun] > Improve the performance of Hive on Spark 2.0.0 > -- > > Key: HIVE-14919 > URL: https://issues.apache.org/jira/browse/HIVE-14919 > Project: Hive > Issue Type: Improvement >Reporter: Ferdinand Xu >Assignee: Ferdinand Xu > Attachments: benchmark.xlsx > > > In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel > BigBench[1] to run benchmark with Spark 2.0 over 10 GB data set comparing > with Spark 1.6. We can see quite some performance degradation for most of the > queries for BigBench. For detailed information, please see the attached file > for detailed information. This JIRA is the umbrella ticket addressing those > performance issues. > [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench -- This message was sent by Atlassian JIRA (v6.3.4#6332)