[ 
https://issues.apache.org/jira/browse/SPARK-19255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ashok Kumar updated SPARK-19255:
--------------------------------
    Description: 
Since its difficult to load huge dataset, below steps will help in reproducing 
the issue
Test steps.
1.CREATE TABLE sample(imei string,age int,task bigint,num double,level 
decimal(10,3),productdate timestamp,name string,point int)USING 
com.databricks.spark.csv OPTIONS (path "data.csv", header "false", inferSchema 
"false");

2. set spark.sql.shuffle.partitions=100000;
3. select count(*) from (select task,sum(age) from sample group by task) t;

After running above query, number of objects in map variable 
_stageIdToStageMetrics has increase to very high number , this increment is 
proportional to number of shuffle partition.

Please have a look at attached screenshot



  was:
Test steps.
1.CREATE TABLE sample(imei string,age int,task bigint,num double,level 
decimal(10,3),productdate timestamp,name string,point int)USING 
com.databricks.spark.csv OPTIONS (path "data.csv", header "false", inferSchema 
"false");

2. set spark.sql.shuffle.partitions=100000;
3. select count(*) from (select task,sum(age) from sample group by task) t;

After running above query, number of objects in map variable 
_stageIdToStageMetrics has increase to very high number , this increment is 
proportional to number of shuffle partition.

Please have a look at attached screenshot




> SQL Listener is causing out of memory, in case of  data size is in petabytes.
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-19255
>                 URL: https://issues.apache.org/jira/browse/SPARK-19255
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>         Environment: Linux
>            Reporter: Ashok Kumar
>            Priority: Minor
>         Attachments: spark_sqllistener_oom.png
>
>
> Since its difficult to load huge dataset, below steps will help in 
> reproducing the issue
> Test steps.
> 1.CREATE TABLE sample(imei string,age int,task bigint,num double,level 
> decimal(10,3),productdate timestamp,name string,point int)USING 
> com.databricks.spark.csv OPTIONS (path "data.csv", header "false", 
> inferSchema "false");
> 2. set spark.sql.shuffle.partitions=100000;
> 3. select count(*) from (select task,sum(age) from sample group by task) t;
> After running above query, number of objects in map variable 
> _stageIdToStageMetrics has increase to very high number , this increment is 
> proportional to number of shuffle partition.
> Please have a look at attached screenshot



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to