[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

rdblue Fri, 04 May 2018 10:11:29 -0700

Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/21118
  
    I just did a performance test based on our 2.1.1 and a real table. I tested 
a full scan of an hour of data with a single data filter.
    
    The scan had 13,083 tasks and read 1084.8 GB. I used identical Spark 
applications with 100 executors, each with 1 core and 6 GB memory.
    * **With project to UnsafeRow**: wall time: 12m, total task time: 19h, 
longest task: 51s.
    * **Without projection, using InternalRow**: wall time: 11m, total task 
time: 17.8h, longest task: 26s.
    
    Clearly, this is not a benchmark. But this shows a 6% performance 
improvement for not making unnecessary copies. Eliminating copies is a pretty 
easy way to get better performance, if we can update a few operators to work 
with both `InternalRow` and `UnsafeRow`.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

Reply via email to