Littlestar created SPARK-6678:
---------------------------------

             Summary: select count(DISTINCT C_UID) from parquetdir may be can 
optimize
                 Key: SPARK-6678
                 URL: https://issues.apache.org/jira/browse/SPARK-6678
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.3.0
            Reporter: Littlestar
            Priority: Minor


2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
C_UID).

I run the following sql, may be RDD.collect is very slow ....
select count(DISTINCT C_UID) from parquetdir

select count(DISTINCT C_UID) from parquetdir
collect at SparkPlan.scala:83 +details
org.apache.spark.rdd.RDD.collect(RDD.scala:813)
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:415)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to