gagan taneja created SPARK-19145: ------------------------------------ Summary: Timestamp to String casting is slowing the query significantly Key: SPARK-19145 URL: https://issues.apache.org/jira/browse/SPARK-19145 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: gagan taneja
i have a time series table with timestamp column Following query SELECT COUNT(*) AS `count` FROM `default`.`table` WHERE `time` >= '2017-01-02 19:53:51' AND `time` <= '2017-01-09 19:53:51' LIMIT 50000 is significantly SLOWER than SELECT COUNT(*) AS `count` FROM `default`.`table` WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000 After investigation i found that in the first query time colum is cast to String before applying the filter However in the second query no such casting is performed and its a filter with long value Below are the generate Physical plan for slower execution followed by physical plan for faster execution SELECT COUNT(*) AS `count` FROM `default`.`table` WHERE `time` >= '2017-01-02 19:53:51' AND `time` <= '2017-01-09 19:53:51' LIMIT 50000 == Physical Plan == CollectLimit 50000 +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#3339L]) +- *Project +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51)) +- *FileScan parquet default.cstat[time#3314] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: struct<time:timestamp> SELECT COUNT(*) AS `count` FROM `default`.`table` WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD HH24:MI:SS−0800') LIMIT 50000 == Physical Plan == CollectLimit 50000 +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)], output=[count#3287L]) +- *Project +- *Filter ((isnotnull(time#3262) && (time#3262 >= 1483404831000000)) && (time#3262 <= 1484009631000000)) +- *FileScan parquet default.cstat[time#3262] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], PartitionFilters: [], PushedFilters: [IsNotNull(time), GreaterThanOrEqual(time,2017-01-02 19:53:51.0), LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp> In Impala both query run efficiently without and performance difference Spark should be able to parse the Date string and convert to Long/Timestamp during generation of Optimized Logical Plan so that both the query would have similar performance -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org