[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 ping @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 ping @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 ping @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 ping @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 ping @rednaxelafx --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @rxin What do you think about this data? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @rxin @cloud-fan I am very sorry for preparing performance data since I was busy these weeks. I confirmed that `MemoryBlock` approach is faster for OnHeap and is equal to for OffHeap compared to `Platform` approach. I ran [this benchmark program](https://gist.github.com/kiszk/3d0db455166a7e2835ffdaf15c633f29) that includes potentially polymorphic call for `MemoryBlock` like ``` def getMB(g: Int): MemoryBlock = { if (g < mIters) { byteArrayMB } else if (g < mIters * 2) { onHeapMB } else if (g < mIters * 3) { offHeapMB ... } else { null } } benchmark.addCase("ByteArrayMemoryBlock getInt()") { t: Int => val mb = getMB(g) if (t >= 0) g += 1 var sum = 0 for (_ <- 0L until iters) { var i = 0 while (i < N / 4) { sum += mb.getInt(i * 4) i += 1 } } } ``` ``` OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz Memory access benchmarks:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Platform getInt(Object byte[])1348 / 1355199.1 5.0 1.0X ByteArrayMemoryBlock getInt() 375 / 375716.6 1.4 3.6X OnHeapMemoryBlock getInt() 374 / 376717.1 1.4 3.6X Platform getInt(Object offHeap)327 / 328820.0 1.2 4.1X OffHeapMemoryBlock get() 325 / 403827.1 1.2 4.2X Platform putInt(Object byte[]) 848 / 852316.5 3.2 1.6X ByteArrayMemoryBlock putInt() 571 / 571470.4 2.1 2.4X OnHeapMemoryBlock putInt() 575 / 575467.2 2.1 2.3X Platform putInt(Object offHeap)582 / 584461.0 2.2 2.3X OffHeapMemoryBlock putInt()583 / 584460.1 2.2 2.3X ``` WDYT? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 I think that one memory block in each iteration is more representative with having possibility of megamorphism. This is because in the typicalusages in Spark, a data structure is actually dominated by one of memory types. For example, `UTF8String` uses only `ByteArrayMemoryBlock` while `OnHeapMemoryBlock` and `ByteArrayMemoryBlock` are loaded In the future, I think that we will use only one of three Memoryblocks for `UnsafeRow` depends on the setting in `SparkConf`. We will not use `OffHeapMemoryBlock` for some of `UnsafeRow` and `OnHeapMemoryBlock` for the rest of `UnsafeRow`. I think that current concern is whether there is performance degradation at possible megamorphic call sites when three MemoryBlock are loaded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/19222 Instead of round-robin the memory block type across iterations, can we just operate on all the memory blocks in each iteration? Then we can remove the `if-else` and make the benchmark focus more on the memory block? As a comparison, we can create a byte array, a long array, an offheap array, and also operate on them in each iteration. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @rxin Sorry, I do not have more data now. I will work for this next week. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19222 @kiszk do you have more data now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19222 OK thanks please do that. Does TPC-DS even trigger 2 call sites? E.g. ByteArrayMemoryBlock and OnHeapMemoryBlock. Even there it might introduce a conditional branch after JIT that could lead to perf degradation. I also really worry about off-heap mode, in which all three callsites can exist and lead to massive degradation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @rxin While I did not perf microbench for megamorphic (up to 3 `ByteArrayMemoryBlock`, `OnHeapMemoryBlock`, and `OffHeapMemoryBlock`) callsites, we confirmed that there is no performance regression in TPC-DS. It seem to be representative regarding real workloads. I will create and perf microbench for megamorphic (up to 3 `ByteArrayMemoryBlock`, `OnHeapMemoryBlock`, and `OffHeapMemoryBlock`) callsites. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/19222 Sorry this thread is too long for me to follow. I might be bringing up a point that has been brought up before. @kiszk did your perf tests take into account megamorphic callsites? It seems to me from a quick cursory look the benchmark result might not be accurate for real workloads if there are only one implementations of the MemoryBlock loaded. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #19222: [SPARK-10399][SPARK-23879][CORE][SQL] Introduce multiple...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/19222 @cloud-fan yeah, good point. I created a new JIRA entry which has sub-tasks for the following works. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org