[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-08-14 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
ping @rednaxelafx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-08-06 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
ping @rednaxelafx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-07-22 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
ping @rednaxelafx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-07-05 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
ping @rednaxelafx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-06-13 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
ping @rednaxelafx


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-30 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@rxin What do you think about this data?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-25 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@rxin @cloud-fan I am very sorry for preparing performance data since I was 
busy these weeks. 
I confirmed that `MemoryBlock` approach is faster for OnHeap and is equal 
to for OffHeap compared to `Platform` approach.
I ran [this benchmark 
program](https://gist.github.com/kiszk/3d0db455166a7e2835ffdaf15c633f29) that 
includes potentially polymorphic call for `MemoryBlock` like
```
def getMB(g: Int): MemoryBlock = {
  if (g < mIters) {
byteArrayMB
  } else if (g < mIters * 2) {
onHeapMB
  } else if (g < mIters * 3) {
offHeapMB
...
  } else {
null
  }
}

benchmark.addCase("ByteArrayMemoryBlock getInt()") { t: Int =>
  val mb = getMB(g)
  if (t >= 0) g += 1
  var sum = 0
  for (_ <- 0L until iters) {
var i = 0
while (i < N / 4) {
  sum += mb.getInt(i * 4)
  i += 1
}
  }
}

```

```
OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 
4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
Memory access benchmarks:Best/Avg Time(ms)Rate(M/s)   
Per Row(ns)   Relative


Platform getInt(Object byte[])1348 / 1355199.1  
 5.0   1.0X
ByteArrayMemoryBlock getInt()  375 /  375716.6  
 1.4   3.6X
OnHeapMemoryBlock getInt() 374 /  376717.1  
 1.4   3.6X
Platform getInt(Object offHeap)327 /  328820.0  
 1.2   4.1X
OffHeapMemoryBlock get()   325 /  403827.1  
 1.2   4.2X
Platform putInt(Object byte[]) 848 /  852316.5  
 3.2   1.6X
ByteArrayMemoryBlock putInt()  571 /  571470.4  
 2.1   2.4X
OnHeapMemoryBlock putInt() 575 /  575467.2  
 2.1   2.3X
Platform putInt(Object offHeap)582 /  584461.0  
 2.2   2.3X
OffHeapMemoryBlock putInt()583 /  584460.1  
 2.2   2.3X
```

WDYT?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-23 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
I think that one memory block in each iteration is more representative with 
having possibility of megamorphism. This is because in the typicalusages in 
Spark, a data structure is actually dominated by one of memory types.
For example, `UTF8String` uses only `ByteArrayMemoryBlock` while 
`OnHeapMemoryBlock` and `ByteArrayMemoryBlock` are loaded
In the future, I think that we will use only one of three Memoryblocks for 
`UnsafeRow` depends on the setting in `SparkConf`. We will not use 
`OffHeapMemoryBlock` for some of `UnsafeRow` and `OnHeapMemoryBlock` for the 
rest of `UnsafeRow`.

I think that current concern is whether there is performance degradation at 
possible megamorphic call sites when three MemoryBlock are loaded.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-22 Thread cloud-fan
Github user cloud-fan commented on the issue:

https://github.com/apache/spark/pull/19222
  
Instead of round-robin the memory block type across iterations, can we just 
operate on all the memory blocks in each iteration? Then we can remove the 
`if-else` and make the benchmark focus more on the memory block?

As a comparison, we can create a byte array, a long array, an offheap 
array, and also operate on them in each iteration.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-20 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@rxin Sorry, I do not have more data now. I will work for this next week.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-20 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19222
  
@kiszk do you have more data now?



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19222
  
OK thanks please do that. Does TPC-DS even trigger 2 call sites? E.g. 
ByteArrayMemoryBlock and OnHeapMemoryBlock. Even there it might introduce a 
conditional branch after JIT that could lead to perf degradation.

I also really worry about off-heap mode, in which all three callsites can 
exist and lead to massive degradation.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-18 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@rxin While I did not perf microbench for megamorphic (up to 3 
`ByteArrayMemoryBlock`, `OnHeapMemoryBlock`, and `OffHeapMemoryBlock`) 
callsites, we confirmed that there is no performance regression in TPC-DS. It 
seem to be representative regarding real workloads.

I will create and perf microbench for megamorphic (up to 3 
`ByteArrayMemoryBlock`, `OnHeapMemoryBlock`, and `OffHeapMemoryBlock`) 
callsites.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-18 Thread rxin
Github user rxin commented on the issue:

https://github.com/apache/spark/pull/19222
  
Sorry this thread is too long for me to follow. I might be bringing up a 
point that has been brought up before.

@kiszk did your perf tests take into account megamorphic callsites? It 
seems to me from a quick cursory look the benchmark result might not be 
accurate for real workloads if there are only one implementations of the 
MemoryBlock loaded.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...

2018-04-06 Thread kiszk
Github user kiszk commented on the issue:

https://github.com/apache/spark/pull/19222
  
@cloud-fan yeah, good point. I created a new JIRA entry which has sub-tasks 
for the following works.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org