[ https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tomas updated SPARK-26841: -------------------------- Affects Version/s: (was: 2.5.0) 2.4.0 > Timestamp pushdown on Kafka table > --------------------------------- > > Key: SPARK-26841 > URL: https://issues.apache.org/jira/browse/SPARK-26841 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.4.0 > Reporter: Tomas > Priority: Major > Labels: Kafka, pushdown, timestamp > > As a Spark user I'd like to have fast queries on Kafka table restricted by > timestamp. > I'd like to have quick answers on questions like "What was inserted in Kafka > in past x minutes", "what was inserted in Kafka in specified time range", ... > Example: > {quote}select * from kafka_table where timestamp > > from_unixtime(unix_timestamp() - 5 * 60, "YYYY-MM-dd HH:mm:ss") > select * from kafka_table where timestamp > $from_time and timestamp < > $end_time > {quote} > Currently timestamp restrictions is not pushdown to KafkaRelation and > querying by timestamp on a large Kafka topic takes forever to complete. > *Technical solution* > Technically its possible to retrieve Kafka's offsets by provided timestamp > with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. > Afterwards we can query Kafka topic by retrieved timestamp ranges. > Querying by timestamp range is already implemented so this change should have > just a minor impact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org