[ https://issues.apache.org/jira/browse/SPARK-4273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen updated SPARK-4273: ----------------------------- Assignee: Michael Armbrust > Providing ExternalSet to avoid OOM when count(distinct) > ------------------------------------------------------- > > Key: SPARK-4273 > URL: https://issues.apache.org/jira/browse/SPARK-4273 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL > Reporter: YanTang Zhai > Assignee: Michael Armbrust > Priority: Minor > Fix For: 1.5.0 > > > Some task may OOM when count(distinct) if it needs to process many records. > CombineSetsAndCountFunction puts all records into an OpenHashSet, if it > fetchs many records, it may occupy large memory. > I think a data structure ExternalSet like ExternalAppendOnlyMap could be > provided to store OpenHashSet data in disks when it's capacity exceeds some > threshold. > For example, OpenHashSet1(ohs1) has [d, b, c, a]. It is spilled to file1 with > hashCode sorted, then the file1 contains [a, b, c, d]. The procedure could be > indicated as follows: > ohs1 [d, b, c, a] => [a, b, c, d] => file1 > ohs2 [e, f, g, a] => [a, e, f, g] => file2 > ohs3 [e, h, i, g] => [e, g, h, i] => file3 > ohs4 [j, h, a] => [a, h, j] => sortedSet > When output, all keys with the same hashCode will be put into a OpenHashSet, > then the iterator of this OpenHashSet is accessing. The procedure could be > indicated as follows: > file1-> a -> ohsA; file2 -> a -> ohsA; sortedSet -> a -> ohsA; ohsA -> a; > file1 -> b -> ohsB; ohsB -> b; > file1 -> c -> ohsC; ohsC -> c; > file1 -> d -> ohsD; ohsD -> d; > file2 -> e -> ohsE; file3 -> e -> ohsE; ohsE-> e; > ... > I think using the ExternalSet could avoid OOM when count(distinct). Welcomes > comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org