[ https://issues.apache.org/jira/browse/SPARK-26841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16763149#comment-16763149 ]
Jungtaek Lim commented on SPARK-26841: -------------------------------------- [~Bartalos] Are you working on the patch? Because I'm interested on addressing offset by timestamp, though my first goal is not a pushdown but alternative of startingOffsets/endingOffsets. > Timestamp pushdown on Kafka table > --------------------------------- > > Key: SPARK-26841 > URL: https://issues.apache.org/jira/browse/SPARK-26841 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.4.0 > Reporter: Tomas Bartalos > Priority: Major > Labels: Kafka, pushdown, timestamp > > As a Spark user I'd like to have fast queries on Kafka table restricted by > timestamp. > I'd like to have quick answers on questions like: > * What was inserted to Kafka in past x minutes > * What was inserted to Kafka in specified time range > Example: > {quote}select * from kafka_table where timestamp > > from_unixtime(unix_timestamp() - 5 * 60, "YYYY-MM-dd HH:mm:ss") > select * from kafka_table where timestamp > $from_time and timestamp < > $end_time > {quote} > Currently timestamp restrictions are not pushdown to KafkaRelation and > querying by timestamp on a large Kafka topic takes forever to complete. > *Technical solution* > Technically its possible to retrieve Kafka's offsets by provided timestamp > with org.apache.kafka.clients.consumer.Consumer#offsetsForTimes(..) method. > Afterwards we can query Kafka topic by retrieved timestamp ranges. > Querying by timestamp range is already implemented so this change should have > minor impact. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org