[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 @ueshin Thanks for the fix --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user ueshin commented on the issue: https://github.com/apache/spark/pull/16541 I sent a pr #17473. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user brkyvz commented on the issue: https://github.com/apache/spark/pull/16541 This PR unfortunately broke Scala 2.10 compilation https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/4110/console --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 thanks, merging to master! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75263/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16541 **[Test build #75263 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75263/testReport)** for PR 16541 at commit [`d04e043`](https://github.com/apache/spark/commit/d04e043fcd00204531553cb0a8ac1148d85436f4). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16541 **[Test build #75263 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75263/testReport)** for PR 16541 at commit [`d04e043`](https://github.com/apache/spark/commit/d04e043fcd00204531553cb0a8ac1148d85436f4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75255/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/16541 **[Test build #75255 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75255/testReport)** for PR 16541 at commit [`d04e043`](https://github.com/apache/spark/commit/d04e043fcd00204531553cb0a8ac1148d85436f4). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 ok to test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Thanks. Made the suggested changes in my latest commit. I also encountered a minor problem when doing final testing. When using a collection type that is a type alias (e.g., scala.List), the companion object's `newBuilder` could not be found. Fixed it by dealiasing the collection type before obtaining the companion object. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 LGTM except 2 minor comments --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 That seems to be the case here, yes. What about the other benefits I mentioned (adding support for Java `List`s and future Scala 2.13 compatibility)? I think the codegen is also more straightforward/clear (and much shorter). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16541 I didn't look into the details here, but very often scanning data twice doesn't necessarily slow things down, especially in the case of sequential scan. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Well, technically yes. But I would say it's a little more than that. The current approach to deserialization of `Seq`s is to copy the data into an array, construct a `WrappedArray` (which extends `Seq`) and optionally copy the data (again) into the new collection. This needs to go through all the elements twice if anything other than a `Seq` (or `WrappedArray` directly) is requested. This PR takes a more straightforward approach and constructs a mutable collection builder (which is defined for every Scala collection), adds the elements to it and retrieves the result. I assumed this would be a performance improvement and am quite surprised that there is no difference. But I think this might be due to the improvement being so small as to drown in the usual operation overhead. Sadly, I do not have the resources to measure the operation on larger amounts of data, having the benchmarks fail on larger collections on my setup. I am also not that familiar with Spark's internals to determine whether @kiszk's suggestion would improve operations in other ways, eg. during operations in the cluster environment. That is why I decided to implement it but keep it separate from my proposed changes for the time being. As to other benefits that this PR would bring other than peformance improvements: * I would like to implement Java `List`s support in a manner similar to what I did with `Map`s in #16986 (I would also ask you to take a look at that when you have the time) by just slightly altering the code. * I didn't know this when I implemented this but Scala 2.13 will introduce a major [collections rework](https://www.scala-lang.org/blog/2017/02/28/collections-rework.html) that will change the way the `to` method (used for conversions in the current solution) works. This will require a rewrite whereas I believe the method used here will remain largely the same. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16541 is it a performance improvement? there is no difference in the benchmark results --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Also please note the [UnsafeArrayData-producing branch](https://github.com/michalsenkyr/spark/compare/dataset-seq-builder...michalsenkyr:dataset-seq-builder-unsafe) that is not yet merged into this branch. I'd like to get somebody's opinion on that before I do it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Would it be possible for somebody to review this PR for me? I have a few ideas that are dependent on this and I'd like to get to work on them. Most notably support for Java Lists. Maybe @cloud-fan could take a look at this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Apologies for taking so long. I tried modifying the serialization logic as best as I could to serialize into `UnsafeArrayData` ([branch diff](https://github.com/michalsenkyr/spark/compare/dataset-seq-builder...michalsenkyr:dataset-seq-builder-unsafe)). I had to first convert into an array to use `fromPrimitiveArray` on the result. That's probably the reason why the benchmark came up slightly worse: ``` OpenJDK 64-Bit Server VM 1.8.0_121-b13 on Linux 4.9.6-1-ARCH AMD A10-4600M APU with Radeon(tm) HD Graphics collect: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative Seq256 / 287 0,0 255670,1 1,0X List 161 / 220 0,0 161091,7 1,6X mutable.Queue 304 / 324 0,0 303823,3 0,8X ``` I am not entirely sure how `GenericArrayData` and `UnsafeArrayData` is handled on transformations and shuffles though, so it's possible that more complex tests will reveal better performance. However, I'm not sure that I can test this properly on my single-machine setup. I'd definitely be interested in benchmark results on a cluster setup. Generated code: ``` /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private scala.collection.Iterator[] inputs; /* 008 */ private scala.collection.Iterator inputadapter_input; /* 009 */ private boolean CollectObjects_loopIsNull1; /* 010 */ private int CollectObjects_loopValue0; /* 011 */ private UnsafeRow deserializetoobject_result; /* 012 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder deserializetoobject_holder; /* 013 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter deserializetoobject_rowWriter; /* 014 */ private scala.collection.immutable.List mapelements_argValue; /* 015 */ private UnsafeRow mapelements_result; /* 016 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder mapelements_holder; /* 017 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter mapelements_rowWriter; /* 018 */ private scala.collection.immutable.List serializefromobject_argValue; /* 019 */ private UnsafeRow serializefromobject_result; /* 020 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; /* 021 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; /* 022 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter serializefromobject_arrayWriter; /* 023 */ /* 024 */ public GeneratedIterator(Object[] references) { /* 025 */ this.references = references; /* 026 */ } /* 027 */ /* 028 */ public void init(int index, scala.collection.Iterator[] inputs) { /* 029 */ partitionIndex = index; /* 030 */ this.inputs = inputs; /* 031 */ inputadapter_input = inputs[0]; /* 032 */ /* 033 */ deserializetoobject_result = new UnsafeRow(1); /* 034 */ this.deserializetoobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(deserializetoobject_result, 32); /* 035 */ this.deserializetoobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(deserializetoobject_holder, 1); /* 036 */ /* 037 */ mapelements_result = new UnsafeRow(1); /* 038 */ this.mapelements_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(mapelements_result, 32); /* 039 */ this.mapelements_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(mapelements_holder, 1); /* 040 */ /* 041 */ serializefromobject_result = new UnsafeRow(1); /* 042 */ this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 32); /* 043 */ this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); /* 044 */ this.serializefromobject_arrayWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter(); /* 045 */ /* 046 */ } /* 047 */ /* 048 */ protected void process
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 I added the benchmarks based on the code you provided but I am getting almost the same results before and after the optimization (see description). So either the added benefit is really small or I didn't write/tune the benchmarks quite right. I would appreciate if you could take a look at them. I will also take a look at the `UnsafeArrayData` optimization later and try to include it in my next commit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/16541 Can we get additional performance improvement if we could generate `UnsafeArrayData` instead of `GenericArrayData` for this statement ```/* 104 */ final ArrayData serializefromobject_value = false ? null : new org.apache.spark.sql.catalyst.util.GenericArrayData(serializefromobject_argValue);```? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user kiszk commented on the issue: https://github.com/apache/spark/pull/16541 It looks like the similar optimization to https://github.com/apache/spark/pull/15044. Does [this code](https://github.com/apache/spark/pull/15044/files#diff-d6f03c9d3e82f3774d1110559b039a6d) help you? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Added benchmarks. I didn't find any standardized way of benchmarking codegen so I wrote a simple script for Spark Shell. Benchmarks were run on a laptop so the collections couldn't be too large. Nevertheless, the benchmarks are consistently (even if not significantly) faster. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Added codegen comparison for a simple `List` dataset. I will also prepare a benchmark and add some results later. Those will be for `List`, `mutable.Queue` and `Seq`. Where `List` and `mutable.Queue` should benefit from the change (one less pass) and `Seq` should be approximately same (as it is a supertype of `WrappedArray` and therefore skips the final conversion in the original approach). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/16541 Is this a perf optimization? If yes, can you show some benchmarks? Also for codegen it's good to show the generated code before/after this change. You can get that with ``` df.queryExecution.debug.codegen() ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user michalsenkyr commented on the issue: https://github.com/apache/spark/pull/16541 Also, the new `CollectObjects` copies quite a bit of code from `MapObjects`. Should I move the code into a common trait in order to reduce duplicity or should I leave it as is? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/16541 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org