gagan taneja created SPARK-19145:
------------------------------------

             Summary: Timestamp to String casting is slowing the query 
significantly
                 Key: SPARK-19145
                 URL: https://issues.apache.org/jira/browse/SPARK-19145
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.1.0
            Reporter: gagan taneja


i have a time series table with timestamp column 

Following query
SELECT COUNT(*) AS `count`
               FROM `default`.`table`
               WHERE `time` >= '2017-01-02 19:53:51'
AND `time` <= '2017-01-09 19:53:51' LIMIT 50000

is significantly SLOWER than 

SELECT COUNT(*) AS `count`
FROM `default`.`table`
WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD 
HH24:MI:SS−0800')
  AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD 
HH24:MI:SS−0800') LIMIT 50000


After investigation i found that in the first query time colum is cast to 
String before applying the filter 
However in the second query no such casting is performed and its a filter with 
long value 

Below are the generate Physical plan for slower execution followed by physical 
plan for faster execution 

SELECT COUNT(*) AS `count`
               FROM `default`.`table`
               WHERE `time` >= '2017-01-02 19:53:51'
AND `time` <= '2017-01-09 19:53:51' LIMIT 50000

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
   +- Exchange SinglePartition
      +- *HashAggregate(keys=[], functions=[partial_count(1)], 
output=[count#3339L])
         +- *Project
            +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) >= 
2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 19:53:51))
               +- *FileScan parquet default.cstat[time#3314] Batched: true, 
Format: Parquet, Location: 
InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
struct<time:timestamp>

SELECT COUNT(*) AS `count`
FROM `default`.`table`
WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','YYYY-MM-DD 
HH24:MI:SS−0800')
  AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','YYYY-MM-DD 
HH24:MI:SS−0800') LIMIT 50000

== Physical Plan ==
CollectLimit 50000
+- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
   +- Exchange SinglePartition
      +- *HashAggregate(keys=[], functions=[partial_count(1)], 
output=[count#3287L])
         +- *Project
            +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
1483404831000000)) && (time#3262 <= 1484009631000000))
               +- *FileScan parquet default.cstat[time#3262] Batched: true, 
Format: Parquet, Location: 
InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
PartitionFilters: [], PushedFilters: [IsNotNull(time), 
GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
LessThanOrEqual(time,2017-01-09..., ReadSchema: struct<time:timestamp>

In Impala both query run efficiently without and performance difference
Spark should be able to parse the Date string and convert to Long/Timestamp 
during generation of Optimized Logical Plan so that both the query would have 
similar performance




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to