[ 
https://issues.apache.org/jira/browse/KAFKA-1403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinyao Hu updated KAFKA-1403:
-----------------------------

    Description: 
Right now, kafka doesn't have timestamp per message. It makes an assumption 
that all the messages in the same file has the same timestamp which is the 
mtime of the file. This makes it inefficient to scan all the messages within a 
time window, which is a valid use case in a lot of realtime data analysis. 

One way to hack this is to roll a new file in a short period of time. However, 
this will result in opening lots of files (KAFKA-1404) which crashed the 
servers eventually. 

My guess this is not implemented due to the efficiency reason. It will cost 
additional four bytes per message which might be pinned in memory for fast 
access. There might be some simple perf optimization, such as differential 
encoding + var length encoding, which should bring down the cost to 1-2 bytes 
avg per message. 

Let me know if this makes sense. 

  was:
Right now, kafka doesn't have timestamp per message. It makes an assumption 
that all the messages in the same file has the same timestamp which is the 
mtime of the file. This makes it inefficient to scan all the messages within a 
time window, which is a valid use case in a lot of realtime data analysis. 

My guess this is not implemented due to the efficiency reason. It will cost 
additional four bytes per message which might be pinned in memory for fast 
access. There might be some simple perf optimization, such as differential 
encoding + var length encoding, which should bring down the cost to 1-2 bytes 
avg per message. 

Let me know if this makes sense. 


> Adding timestamp to kafka index structure
> -----------------------------------------
>
>                 Key: KAFKA-1403
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1403
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Xinyao Hu
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Right now, kafka doesn't have timestamp per message. It makes an assumption 
> that all the messages in the same file has the same timestamp which is the 
> mtime of the file. This makes it inefficient to scan all the messages within 
> a time window, which is a valid use case in a lot of realtime data analysis. 
> One way to hack this is to roll a new file in a short period of time. 
> However, this will result in opening lots of files (KAFKA-1404) which crashed 
> the servers eventually. 
> My guess this is not implemented due to the efficiency reason. It will cost 
> additional four bytes per message which might be pinned in memory for fast 
> access. There might be some simple perf optimization, such as differential 
> encoding + var length encoding, which should bring down the cost to 1-2 bytes 
> avg per message. 
> Let me know if this makes sense. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to