[GitHub] spark pull request #11209: [SPARK-13325][SQL] Create a 64-bit hashcode expre...

2016-07-31 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/11209#discussion_r72918850
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/HashBenchmark.scala ---
@@ -119,11 +146,12 @@ object HashBenchmark {
   .add("map", mapOfInt)
   .add("mapOfMap", MapType(IntegerType, mapOfInt))
 /*
-Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
 Hash For map:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 
---
-interpreted version  1820 / 1861  0.0  
444347.2   1.0X
-codegen version   205 /  223  0.0  
 49936.5   8.9X
+interpreted version  1612 / 1618  0.0  
393553.4   1.0X
+codegen version   149 /  150  0.0  
 36381.2  10.8X
+codegen version 64-bit144 /  145  0.0  
 35122.1  11.2X
  */
--- End diff --

This result is because we have removed the `hashCode` implementation from 
`ArrayBasedMapData`, and the system default `hashCode` is really fast(almost 
no-op).

We did this because they have same logic at the time, but it's not true 
now. @gatorsmile do you wanna send a PR to fix this(use interpreted hash 
expression instead of `BaseGenericInternalRow.hashCode`)? thanks!




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11209: [SPARK-13325][SQL] Create a 64-bit hashcode expre...

2016-07-31 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/11209#discussion_r72902008
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/HashBenchmark.scala ---
@@ -119,11 +146,12 @@ object HashBenchmark {
   .add("map", mapOfInt)
   .add("mapOfMap", MapType(IntegerType, mapOfInt))
 /*
-Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
 Hash For map:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 
---
-interpreted version  1820 / 1861  0.0  
444347.2   1.0X
-codegen version   205 /  223  0.0  
 49936.5   8.9X
+interpreted version  1612 / 1618  0.0  
393553.4   1.0X
+codegen version   149 /  150  0.0  
 36381.2  10.8X
+codegen version 64-bit144 /  145  0.0  
 35122.1  11.2X
  */
--- End diff --

@gatorsmile I get the following results, running the benchmark:
```
[info] Running benchmark: Hash For map
[info]   Running case: interpreted version
[info]   Stopped after 28358 iterations, 2000 ms
[info]   Running case: codegen version
[info]   Stopped after 10 iterations, 2079 ms
[info]   Running case: codegen version 64-bit
[info]   Stopped after 13 iterations, 2133 ms
[info] 
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6
[info] Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
[info] 
[info] Hash For map:Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
[info] 

[info] interpreted version  0 /0 
61.8  16.2   1.0X
[info] codegen version176 /  208  
0.0   42927.5   0.0X
[info] codegen version 64-bit 154 /  164  
0.0   37711.4   0.0X
```
So the results are similar. 

I expect that there is some JVM cleverness going on here; the JVM might be 
caching the `hashCode()` results for  `BaseGenericInternalRow.hashCode` or is 
really good at predicting what is going on. We added a warmup period to the 
benchmarking logic in order to make the results more reliable (see the lines 
like `Stopped after 28358 iterations, 2000 ms`) and this might trigger this 
behavior.

It is by design that we call `BaseGenericInternalRow.hashCode` (I think the 
result of the `Hash` expression and this method should yield the same result). 
@cloud-fan could you explain why we did this?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11209: [SPARK-13325][SQL] Create a 64-bit hashcode expre...

2016-07-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11209#discussion_r72897064
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/HashBenchmark.scala ---
@@ -119,11 +146,12 @@ object HashBenchmark {
   .add("map", mapOfInt)
   .add("mapOfMap", MapType(IntegerType, mapOfInt))
 /*
-Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
 Hash For map:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 
---
-interpreted version  1820 / 1861  0.0  
444347.2   1.0X
-codegen version   205 /  223  0.0  
 49936.5   8.9X
+interpreted version  1612 / 1618  0.0  
393553.4   1.0X
+codegen version   149 /  150  0.0  
 36381.2  10.8X
+codegen version 64-bit144 /  145  0.0  
 35122.1  11.2X
  */
--- End diff --

The interpreted version is actually calling `BaseGenericInternalRow`'s 
`hashCode` function. It is not calling the expected hash function, right?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #11209: [SPARK-13325][SQL] Create a 64-bit hashcode expre...

2016-07-30 Thread gatorsmile
Github user gatorsmile commented on a diff in the pull request:

https://github.com/apache/spark/pull/11209#discussion_r72896977
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/HashBenchmark.scala ---
@@ -119,11 +146,12 @@ object HashBenchmark {
   .add("map", mapOfInt)
   .add("mapOfMap", MapType(IntegerType, mapOfInt))
 /*
-Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+Intel(R) Core(TM) i7-4750HQ CPU @ 2.00GHz
 Hash For map:   Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative
 
---
-interpreted version  1820 / 1861  0.0  
444347.2   1.0X
-codegen version   205 /  223  0.0  
 49936.5   8.9X
+interpreted version  1612 / 1618  0.0  
393553.4   1.0X
+codegen version   149 /  150  0.0  
 36381.2  10.8X
+codegen version 64-bit144 /  145  0.0  
 35122.1  11.2X
  */
--- End diff --

I am seeing totally different results. 
```
Java HotSpot(TM) 64-Bit Server VM 1.7.0_80-b15 on Mac OS X 10.11.6
Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz

Hash For map:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


interpreted version  0 /0 85.3  
11.7   1.0X
codegen version215 /  231  0.0  
 52443.1   0.0X
codegen version 64-bit 160 /  173  0.0  
 38973.6   0.0X
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org