[ https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15567965#comment-15567965 ]
Littlestar commented on SPARK-6678: ----------------------------------- I compre it to my other spark code, it do same thing: count distinct C_UID . I think “org.apache.spark.rdd.RDD.collect(RDD.scala:813)” is very slow, it's not necessary. I just need distict count, not each C_UID value. > select count(DISTINCT C_UID) from parquetdir may be can optimize > ---------------------------------------------------------------- > > Key: SPARK-6678 > URL: https://issues.apache.org/jira/browse/SPARK-6678 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.3.0 > Reporter: Littlestar > Priority: Minor > > 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique > C_UID). > I run the following sql, may be RDD.collect is very slow .... > select count(DISTINCT C_UID) from parquetdir > select count(DISTINCT C_UID) from parquetdir > collect at SparkPlan.scala:83 +details > org.apache.spark.rdd.RDD.collect(RDD.scala:813) > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83) > org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815) > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178) > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231) > org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218) > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > java.lang.reflect.Method.invoke(Method.java:606) > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79) > org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37) > org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64) > java.security.AccessController.doPrivileged(Native Method) > javax.security.auth.Subject.doAs(Subject.java:415) > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548) > org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493) > org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60) > com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source) > org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org