[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

michalsenkyr Fri, 10 Mar 2017 14:46:02 -0800

Github user michalsenkyr commented on the issue:

    https://github.com/apache/spark/pull/16541
  
    Well, technically yes. But I would say it's a little more than that.
    
    The current approach to deserialization of `Seq`s is to copy the data into 
an array, construct a `WrappedArray` (which extends `Seq`) and optionally copy 
the data (again) into the new collection. This needs to go through all the 
elements twice if anything other than a `Seq` (or `WrappedArray` directly) is 
requested.
    This PR takes a more straightforward approach and constructs a mutable 
collection builder (which is defined for every Scala collection), adds the 
elements to it and retrieves the result.
    
    I assumed this would be a performance improvement and am quite surprised 
that there is no difference. But I think this might be due to the improvement 
being so small as to drown in the usual operation overhead. Sadly, I do not 
have the resources to measure the operation on larger amounts of data, having 
the benchmarks fail on larger collections on my setup.
    
    I am also not that familiar with Spark's internals to determine whether 
@kiszk's suggestion would improve operations in other ways, eg. during 
operations in the cluster environment. That is why I decided to implement it 
but keep it separate from my proposed changes for the time being.
    
    As to other benefits that this PR would bring other than peformance 
improvements:
    
    * I would like to implement Java `List`s support in a manner similar to 
what I did with `Map`s in #16986 (I would also ask you to take a look at that 
when you have the time) by just slightly altering the code.
    * I didn't know this when I implemented this but Scala 2.13 will introduce 
a major [collections 
rework](https://www.scala-lang.org/blog/2017/02/28/collections-rework.html) 
that will change the way the `to` method (used for conversions in the current 
solution) works. This will require a rewrite whereas I believe the method used 
here will remain largely the same.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16541: [SPARK-19088][SQL] Optimize sequence type deserializatio...

Reply via email to