[ https://issues.apache.org/jira/browse/SPARK-18457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15669150#comment-15669150 ]
Hyukjin Kwon edited comment on SPARK-18457 at 11/16/16 2:39 AM: ---------------------------------------------------------------- Hm, don't we push down no column when performing {{count(*)}}? {code} val data = (0 to 255).zip(0 to 255).toDF("a", "b") val path = "/tmp/aa" data.write.orc(path) spark.read.orc(path).createOrReplaceTempView("a_orc_table") spark.sql("select count(*) from a_orc_table").explain(true) {code} {code} == Parsed Logical Plan == 'Project [unresolvedalias('count(1), None)] +- 'UnresolvedRelation `a_orc_table` == Analyzed Logical Plan == count(1): bigint Aggregate [count(1) AS count(1)#28L] +- SubqueryAlias a_orc_table +- Relation[a#16,b#17] orc == Optimized Logical Plan == Aggregate [count(1) AS count(1)#28L] +- Project +- Relation[a#16,b#17] orc == Physical Plan == *HashAggregate(keys=[], functions=[count(1)], output=[count(1)#28L]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#30L]) +- *FileScan orc [] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/tmp/aa], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {code} It seems ORC datasource does not try to read all columns and I just verified this by debugging with IDE. was (Author: hyukjin.kwon): Hm, don't we push down no column when performing `count()`? {code} val data = (0 to 255).zip(0 to 255).toDF("a", "b") val path = "/tmp/aa" data.write.orc(path) spark.read.orc(path).createOrReplaceTempView("a_orc_table") spark.sql("select count(*) from a_orc_table").explain(true) {code} {code} == Parsed Logical Plan == 'Project [unresolvedalias('count(1), None)] +- 'UnresolvedRelation `a_orc_table` == Analyzed Logical Plan == count(1): bigint Aggregate [count(1) AS count(1)#28L] +- SubqueryAlias a_orc_table +- Relation[a#16,b#17] orc == Optimized Logical Plan == Aggregate [count(1) AS count(1)#28L] +- Project +- Relation[a#16,b#17] orc == Physical Plan == *HashAggregate(keys=[], functions=[count(1)], output=[count(1)#28L]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#30L]) +- *FileScan orc [] Batched: false, Format: ORC, Location: InMemoryFileIndex[file:/tmp/aa], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<> {code} It seems ORC datasource does not try to read all columns and I just verified this by debugging with IDE. > ORC and other columnar formats using HiveShim read all columns when doing a > simple count > ---------------------------------------------------------------------------------------- > > Key: SPARK-18457 > URL: https://issues.apache.org/jira/browse/SPARK-18457 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.6.3, 2.0.2 > Environment: Hadoop 2.7.0 > Reporter: Andrew Ray > Priority: Minor > > Doing a `select count(*) from a_orc_table` reads all columns and thus is > slower than a query selecting one like `select count(a_column) from > a_orc_table`. Data read can be seen in the UI (appears to only be accurate > for Hadoop 2.5+ based on comment in FileScanRDD.scala line 80). > I will create a PR shortly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org