[jira] [Commented] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize

2017-02-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880463#comment-15880463
 ] 

Hyukjin Kwon commented on SPARK-6678:
-

gentle ping [~cnstar9988]

> select count(DISTINCT C_UID) from parquetdir may be can optimize
> 
>
> Key: SPARK-6678
> URL: https://issues.apache.org/jira/browse/SPARK-6678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
> C_UID).
> I run the following sql, may be RDD.collect is very slow 
> select count(DISTINCT C_UID) from parquetdir
> select count(DISTINCT C_UID) from parquetdir
> collect at SparkPlan.scala:83 +details
> org.apache.spark.rdd.RDD.collect(RDD.scala:813)
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:415)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize

2017-02-11 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15862455#comment-15862455
 ] 

Hyukjin Kwon commented on SPARK-6678:
-

[~cnstar9988], do you mind if I ask your other spark code? I would like to try 
both to verify the problem.

> select count(DISTINCT C_UID) from parquetdir may be can optimize
> 
>
> Key: SPARK-6678
> URL: https://issues.apache.org/jira/browse/SPARK-6678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
> C_UID).
> I run the following sql, may be RDD.collect is very slow 
> select count(DISTINCT C_UID) from parquetdir
> select count(DISTINCT C_UID) from parquetdir
> collect at SparkPlan.scala:83 +details
> org.apache.spark.rdd.RDD.collect(RDD.scala:813)
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:415)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize

2016-10-12 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567965#comment-15567965
 ] 

Littlestar commented on SPARK-6678:
---

I compre it to my other spark code, it do same thing: count distinct C_UID .

I think “org.apache.spark.rdd.RDD.collect(RDD.scala:813)” is very slow, it's 
not necessary.
I just need distict count, not each C_UID value.


> select count(DISTINCT C_UID) from parquetdir may be can optimize
> 
>
> Key: SPARK-6678
> URL: https://issues.apache.org/jira/browse/SPARK-6678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
> C_UID).
> I run the following sql, may be RDD.collect is very slow 
> select count(DISTINCT C_UID) from parquetdir
> select count(DISTINCT C_UID) from parquetdir
> collect at SparkPlan.scala:83 +details
> org.apache.spark.rdd.RDD.collect(RDD.scala:813)
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:415)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6678) select count(DISTINCT C_UID) from parquetdir may be can optimize

2016-10-08 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15557729#comment-15557729
 ] 

Hyukjin Kwon commented on SPARK-6678:
-

Hi [~cnstar9988] Is this issue describing the slow performance itself without 
kind of targets to compare?

> select count(DISTINCT C_UID) from parquetdir may be can optimize
> 
>
> Key: SPARK-6678
> URL: https://issues.apache.org/jira/browse/SPARK-6678
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.3.0
>Reporter: Littlestar
>Priority: Minor
>
> 2.2T parquet files(5000 files total, 100 billion records, 2 billion unique 
> C_UID).
> I run the following sql, may be RDD.collect is very slow 
> select count(DISTINCT C_UID) from parquetdir
> select count(DISTINCT C_UID) from parquetdir
> collect at SparkPlan.scala:83 +details
> org.apache.spark.rdd.RDD.collect(RDD.scala:813)
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:83)
> org.apache.spark.sql.DataFrame.collect(DataFrame.scala:815)
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.run(Shim13.scala:178)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementInternal(HiveSessionImpl.java:231)
> org.apache.hive.service.cli.session.HiveSessionImpl.executeStatementAsync(HiveSessionImpl.java:218)
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> java.lang.reflect.Method.invoke(Method.java:606)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:79)
> org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:37)
> org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:64)
> java.security.AccessController.doPrivileged(Native Method)
> javax.security.auth.Subject.doAs(Subject.java:415)
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
> org.apache.hadoop.hive.shims.HadoopShimsSecure.doAs(HadoopShimsSecure.java:493)
> org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:60)
> com.sun.proxy.$Proxy23.executeStatementAsync(Unknown Source)
> org.apache.hive.service.cli.CLIService.executeStatementAsync(CLIService.java:233)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org