[ https://issues.apache.org/jira/browse/SPARK-16361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15361073#comment-15361073 ]
lichenglin commented on SPARK-16361: ------------------------------------ The data'size is 1 million. I'm sure that 40 GB memory is faster than the job with 20 GB. But building a cube with 1 million data need more than 40 GB memory to reduce the GC TIME. It's really not cool. > It takes a long time for gc when building cube with many fields > ---------------------------------------------------------------- > > Key: SPARK-16361 > URL: https://issues.apache.org/jira/browse/SPARK-16361 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.6.2 > Reporter: lichenglin > > I'm using spark to build cube on a dataframe with 1m data. > I found that when I add too many fields (about 8 or above) > the worker takes a lot of time for GC. > I try to increase the memory of each worker but it not work well. > but I don't know why,sorry. > here is my simple code and monitoring > Cuber is a util class for building cube. > {code:title=Bar.java|borderStyle=solid} > sqlContext.udf().register("jidu", (Integer f) -> { > return (f - 1) / 3 + 1; > } , DataTypes.IntegerType); > DataFrame d = > sqlContext.table("dw.dw_cust_info").selectExpr("*", "cast (CUST_AGE as > double) as c_age", > "month(day) as month", "year(day) as year", > "cast ((datediff(now(),INTIME)/365+1) as int ) as zwsc", > "jidu(month(day)) as jidu"); > Bucketizer b = new > Bucketizer().setInputCol("c_age").setSplits(new double[] { > Double.NEGATIVE_INFINITY, 0, 10, > 20, 30, 40, 50, 60, 70, 80, 90, 100, > Double.POSITIVE_INFINITY }).setOutputCol("age"); > DataFrame cube = new Cuber(b.transform(d)) > .addFields("day", "AREA_CODE", "CUST_TYPE", > "age", "zwsc", "month", "jidu", "year","SUBTYPE").max("age") > .min("age").sum("zwsc").count().buildcube(); > > cube.write().mode(SaveMode.Overwrite).saveAsTable("dt.cuberdemo"); > {code} > Summary Metrics for 12 Completed Tasks > Metric Min 25th percentile Median 75th percentile Max > Duration 2.6 min 2.7 min 2.7 min 2.7 min 2.7 min > GC Time 1.6 min 1.6 min 1.6 min 1.6 min 1.6 min > Shuffle Read Size / Records 728.4 KB / 21886 736.6 KB / 22258 > 738.7 KB / 22387 746.6 KB / 22542 748.6 KB / 22783 > Shuffle Write Size / Records 74.3 MB / 1926282 75.8 MB / 1965860 > 76.2 MB / 1976004 76.4 MB / 1981516 77.9 MB / 2021142 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org