[jira] [Updated] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)

Sean Owen (JIRA) Wed, 16 Sep 2015 01:06:08 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-4273:
-----------------------------
    Assignee: Michael Armbrust

> Providing ExternalSet to avoid OOM when count(distinct)
> -------------------------------------------------------
>
>                 Key: SPARK-4273
>                 URL: https://issues.apache.org/jira/browse/SPARK-4273
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>            Reporter: YanTang Zhai
>            Assignee: Michael Armbrust
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> Some task may OOM when count(distinct) if it needs to process many records. 
> CombineSetsAndCountFunction puts all records into an OpenHashSet, if it 
> fetchs many records, it may occupy large memory.
> I think a data structure ExternalSet like ExternalAppendOnlyMap could be 
> provided to store OpenHashSet data in disks when it's capacity exceeds some 
> threshold.
> For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with 
> hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be 
> indicated as follows:
> ohs1 [d, b, c, a] => [a, b, c, d] => file1
> ohs2 [e, f, g, a] => [a, e, f, g] => file2
> ohs3 [e, h, i, g] => [e, g, h, i] => file3
> ohs4 [j, h, a] => [a, h, j] => sortedSet
> When output, all keys with the same hashCode will be put into a OpenHashSet, 
> then the iterator of this OpenHashSet is accessing. The procedure could be 
> indicated as follows:
> file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a;
> file1 -> b -> ohsB; ohsB -> b;
> file1 -> c -> ohsC; ohsC -> c;
> file1 -> d -> ohsD; ohsD -> d;
> file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e;
> ...
> I think using the ExternalSet could avoid OOM when count(distinct). Welcomes 
> comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4273) Providing ExternalSet to avoid OOM when count(distinct)

Reply via email to