David C Navas created SPARK-17529:
-------------------------------------

             Summary: On highly skewed data, outer join merges are slow
                 Key: SPARK-17529
                 URL: https://issues.apache.org/jira/browse/SPARK-17529
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 2.0.0, 1.6.2
            Reporter: David C Navas
            Priority: Trivial


All timings were taken from 1.6.2, but it appears that 2.0.0 suffers from the 
same performance problem.

My co-worker Yewei Zhang was investigating a performance problem with a highly 
skewed dataset.
"Part of this query performs a full outer join over [an ID] on highly skewed 
data. On the left view, there is one record for id = 0 out of 2,272,486 
records; On the right view there are 8,353,097 records for id = 0 out of 
12,685,073 records"

The sub-query was taking 5.2 minutes.  We discovered that snappy was 
responsible for some measure of this problem and incorporated the new snappy 
release.  This brought the sub-query down to 2.4 minutes.  A large percentage 
of the remaining time was spent in the merge code which I tracked down to a 
BitSet clearing issue.  We have noted that you already have the snappy fix, 
this issue describes the problem with the BitSet.

The BitSet grows to handle the largest matching set of keys and is used to 
track joins.  The BitSet is re-used in all subsequent joins (unless it is too 
small)

The skewing of our data caused a very large BitSet to be allocated on the very 
first row joined.  Unfortunately, the entire BitSet is cleared on each re-use.  
For each of the remaining rows which likely match only a few rows on the other 
side, the entire 1MB of the BitSet is cleared.  If unpartitioned, this would 
happen roughly 6 million times.  The fix I developed for this is to clear only 
the portion of the BitSet that is needed.  After applying it, the sub-query 
dropped from 2.4 minutes to 29 seconds.

Small (0 or negative) IDs are often used as place-holders for null, empty, or 
unknown data, so I expect this fix to be generally useful, if rarely 
encountered to this particular degree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to