[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-13 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15923492#comment-15923492
 ] 

liyunzhang_intel commented on HIVE-14919:
-

[~lirui]: {quote}
One thing I noted is the Xms flag was removed from the executor's options via 
SPARK-12384. We may want to set it the same as Xmx to achieve better 
performance.
{quote}

 not very understand this point, because now spark does not allow to use Xmx to 
specify max heap memory settings and only use 
${{spark.executor.memory}}
org.apache.spark.SparkConf#validateSettings
{code}
 // Validate spark.executor.extraJavaOptions
getOption(executorOptsKey).foreach { javaOpts =>
  if (javaOpts.contains("-Dspark")) {
val msg = s"$executorOptsKey is not allowed to set Spark options (was 
'$javaOpts'). " +
  "Set them directly on a SparkConf or in a properties file when using 
./bin/spark-submit."
throw new Exception(msg)
  }
  if (javaOpts.contains("-Xmx")) {
val msg = s"$executorOptsKey is not allowed to specify max heap memory 
settings " +
  s"(was '$javaOpts'). Use spark.executor.memory instead."
throw new Exception(msg)
  }
}
{code}

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927700#comment-15927700
 ] 

Rui Li commented on HIVE-14919:
---

I mean we can set Xms, not Xmx.

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-16 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929407#comment-15929407
 ] 

liyunzhang_intel commented on HIVE-14919:
-

[~lirui]:  I guess you mean to set spark.executor.extraJavaOptions in 
conf/spark-defaults.conf like following
{code}
spark.executor.extraJavaOptions -Xms${the value of spark.executor.memory}

{code}

Is my understanding right?

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-16 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929448#comment-15929448
 ] 

Rui Li commented on HIVE-14919:
---

Yeah that's what I meant.

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-20 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933440#comment-15933440
 ] 

Sahil Takiar commented on HIVE-14919:
-

[~lirui], [~Ferd] would it make sense to add HoS integration with Spark's 
DataFrame / DataSets API? From the Spark docs it sounds like the DataFrame / 
DataSets API could improve performance since the APIs require specifying column 
types.

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-20 Thread liyunzhang_intel (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934145#comment-15934145
 ] 

liyunzhang_intel commented on HIVE-14919:
-

[~stakiar]:  i think it is a good point to integration HoS with Spark's 
DataFrame/DataSets API,  here my question is what's the suitable case to 
benefit from  it? A table which contains complex column types?

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-21 Thread Sahil Takiar (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934965#comment-15934965
 ] 

Sahil Takiar commented on HIVE-14919:
-

[~kellyzly] my guess is that any Hive table should benefit from using the 
DataFrames API. I'm not a Spark expert, but I believe RDDs are just a 
distributed, collection of objects, but those objects don't have a defined 
schema. A DataFrame is similar to a table in a database, it has a set of named 
columns. So naturally I would think Hive fits more into the DataFrames model 
since it works with tables that have a set of pre-defined columns.

According to some blog posts on the DataFrames API, it has a number of 
performance optimizations built in due to the fact that column types are known. 
These optimizations were not possible with RDDs because RDDs don't have a 
schema.

>From a DataBricks blog post:

{quote}
It can also perform lower level optimizations such as eliminating expensive 
object allocations and reducing virtual function calls. As a result, we expect 
performance improvements for existing Spark programs when they migrate to 
DataFrames.
{quote}

Sources:

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2017-03-21 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15935686#comment-15935686
 ] 

Rui Li commented on HIVE-14919:
---

[~stakiar], [~kellyzly] we're using RDD with type like . That's inline with MR (probably Tez too). And I think that's 
what Hive operators and SerDe expect.
I don't know much about the DataFrame/DataSet API, but from the discussion 
above, it seems to require a lot of work. [~xuefuz] could you share your 
thoughts?

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 1 TB data set comparing with 
> Spark 1.6. We can see performance improvments about 5.4% in general and 45% 
> for the best case. However, some queries doesn't have significant performance 
> improvements.  This JIRA is the umbrella ticket addressing those performance 
> issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2016-11-06 Thread Rui Li (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15643352#comment-15643352
 ] 

Rui Li commented on HIVE-14919:
---

One thing I noted is the Xms flag was removed from the executor's options via 
SPARK-12384. We may want to set it the same as Xmx to achieve better 
performance.

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
> Attachments: benchmark.xlsx
>
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 10 GB data set comparing 
> with Spark 1.6. We can see quite some performance degradation for most of the 
> queries for BigBench. For detailed information, please see the attached file 
> for detailed information. This JIRA is the umbrella ticket addressing those 
> performance issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (HIVE-14919) Improve the performance of Hive on Spark 2.0.0

2016-10-09 Thread Ferdinand Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-14919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15561385#comment-15561385
 ] 

Ferdinand Xu commented on HIVE-14919:
-

cc [~kellyzly] [~dapengsun]

> Improve the performance of Hive on Spark 2.0.0
> --
>
> Key: HIVE-14919
> URL: https://issues.apache.org/jira/browse/HIVE-14919
> Project: Hive
>  Issue Type: Improvement
>Reporter: Ferdinand Xu
>Assignee: Ferdinand Xu
> Attachments: benchmark.xlsx
>
>
> In HIVE-14029, we have updated Spark dependency to 2.0.0. We use Intel 
> BigBench[1] to run benchmark with Spark 2.0 over 10 GB data set comparing 
> with Spark 1.6. We can see quite some performance degradation for most of the 
> queries for BigBench. For detailed information, please see the attached file 
> for detailed information. This JIRA is the umbrella ticket addressing those 
> performance issues.
> [1] https://github.com/intel-hadoop/Big-Data-Benchmark-for-Big-Bench



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)