Re: [akka-user] Apache Kafka as journal - retention times/PersistentView and partitions

Andrzej Dębski Tue, 26 Aug 2014 11:25:13 -0700


> You're right. If you want to keep all data in Kafka without ever deleting 
> them, you'd need to add partitions dynamically (which is currently possible 
> with APIs that back the CLI). On the other hand, using Kafka this way is 
> the wrong approach IMO. If you really need to keep the full event history, 
> keep old events on HDFS or wherever and only the more recent ones in Kafka 
> (where a full replay must first read from HDFS and then from Kafka) or use 
> a journal plugin that is explicitly designed for long-term event storage. 
>


That was worrying me all the time with using Kafka in a situation where I 
would want to keep the events all the time (or at least unknown amount of 
time). The thing that seemed nice is that I would have journal/event store 
and pub-sub solution implemented in one technology - basically I want to go 
around current limitation of PersistentView. I wanted to use Kafka topic 
and replay all events from the topic to dynamically added read models in my 
cluster. Maybe in this situation I should stick to distributed 
publish-subscribe in cluster for current event-sending and Cassandra as 
journal/snapshot store. I did not read that much about Cassandra and the 
way it stores data so I do not know if reading all events would be easy.

The main reason why I developed the Kafka plugin was to integrate my Akka 
> applications in unified log processing architectures as descibed in Jay 
> Kreps' excellent article 
> <http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying>.
>  
> Also mentioned in this article is a snapshotting strategy that fits typical 
> retention times in Kafka.
>

Thanks for the link. 

The most interesting next Kafka plugin feature for me to develop is an HDFS 
> integration for long-term event storage (and full event history replay). 
> WDYT?
>

That would be interesting feature - certainly would make akka + Kafka 
combination viable for more use cases.

W dniu wtorek, 26 sierpnia 2014 19:44:05 UTC+2 użytkownik Martin Krasser 
napisał:
>
>  
> On 26.08.14 16:44, Andrzej Dębski wrote:
>  
> My mind must have filtered out the possibility of making snapshots using 
> Views - thanks. 
>
>  About partitions: I suspected as much. The only thing that I am 
> wondering now is: if it is possible to dynamically create partitions in 
> Kafka? AFAIK the number of partitions is set during topic creation (be it 
> programmatically using API or CLI tools) and there is CLI tool you can use 
> to modify existing topic: 
> https://cwiki.apache.org/confluence/display/KAFKA/Replication+tools#Replicationtools-5.AddPartitionTool.
>  
> To keep the invariant  " PersistentActor is the only writer to a 
> partitioned journal topic" you would have to create those partitions 
> dynamically (usually you don't know up front how many PersistentActors your 
> system will have) on per-PersistentActor basis.
>  
>
> You're right. If you want to keep all data in Kafka without ever deleting 
> them, you'd need to add partitions dynamically (which is currently possible 
> with APIs that back the CLI). On the other hand, using Kafka this way is 
> the wrong approach IMO. If you really need to keep the full event history, 
> keep old events on HDFS or wherever and only the more recent ones in Kafka 
> (where a full replay must first read from HDFS and then from Kafka) or use 
> a journal plugin that is explicitly designed for long-term event storage. 
>
> The main reason why I developed the Kafka plugin was to integrate my Akka 
> applications in unified log processing architectures as descibed in Jay 
> Kreps' excellent article 
> <http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying>.
>  
> Also mentioned in this article is a snapshotting strategy that fits typical 
> retention times in Kafka.
>
>  
>  On the other hand maybe you are assuming that each actor is writing to 
> different topic 
>  
>
> yes, and the Kafka plugin is currently implemented that way.
>
>  - but I think this solution is not viable because information about 
> topics is limited by ZK and other factors: 
> http://grokbase.com/t/kafka/users/133v60ng6v/limit-on-number-of-kafka-topic
> .
>  
>
> A more in-depth discussion about these limitations is given at 
> http://www.quora.com/How-many-topics-can-be-created-in-Apache-Kafka with 
> a detailed comment from Jay. I'd say that if you designed your application 
> to run more than a few hundred persistent actors, then the Kafka plugin is 
> the probably wrong choice. I tend to design my applications to have only a 
> small number of persistent actors (which is in contrast to many other 
> discussions on akka-user) which makes the Kafka plugin a good candidate. 
>
> To recap, the Kafka plugin is a reasonable choice if
>
> - frequent snapshotting is done by persistent actors (every day or so)
> - you don't have more than a few hundred persistent actors and
> - your application is a component of a unified log processing architecture 
> (backed by Kafka)
>
> The most interesting next Kafka plugin feature for me to develop is an 
> HDFS integration for long-term event storage (and full event history 
> replay). WDYT?
>
>  
> W dniu wtorek, 26 sierpnia 2014 15:28:47 UTC+2 użytkownik Martin Krasser 
> napisał: 
>>
>>  Hi Andrzej,
>>
>> On 26.08.14 09:15, Andrzej Dębski wrote:
>>  
>> Hello 
>>
>>  Lately I have been reading about a possibility of using Apache Kafka as 
>> journal/snapshot store for akka-persistence. 
>>
>> I am aware of the plugin created by Martin Krasser: 
>> https://github.com/krasserm/akka-persistence-kafka/ and also I read 
>> other topic about Kafka as journal 
>> https://groups.google.com/forum/#!searchin/akka-user/kakfka/akka-user/iIHmvC6bVrI/zeZJtW0_6FwJ
>> .
>>
>>  In both sources I linked two ideas were presented:
>>
>>  1. Set log retention to 7 days, take snapshots every 3 days (example 
>> values)
>> 2. Set log retention to unlimited.
>>
>>  Here is the first question: in first case wouldn't it mean that 
>> persistent views would receive skewed view of the PersistentActor state 
>> (only events from 7 days) - is it really viable solution? As far as I know 
>> PersistentView can only receive events - it can't receive snapshots from 
>> corresponding PersistentActor (which is good in general case).
>>  
>>
>> PersistentViews can create their own snapshots which are isolated from 
>> the corresponding PersistentActor's snapshots.
>>
>>  
>>  Second question (more directed to Martin): in the thread I linked you 
>> wrote: 
>>
>>   I don't go into Kafka partitioning details here but it is possible to 
>>> implement the journal driver in a way that both a single persistent actor's 
>>> data are partitioned *and* kept in order
>>>
>>
>>   I am very interested in this idea. AFAIK it is not yet implemented in 
>> current plugin but I was wondering if you could share high level idea how 
>> would you achieve that (one persistent actor, multiple partitions, ordering 
>> ensured)?
>>  
>>
>> The idea is to
>>
>> - first write events 1 to n to partition 1
>> - then write events n+1 to 2n to partition 2
>> - then write events 2n+1 to 3n to partition 3
>> - ... and so on
>>
>> This works because a PersistentActor is the only writer to a partitioned 
>> journal topic. During replay, you first replay partition 1, then partition 
>> 2 and so on. This should be rather easy to implement in the Kafka journal, 
>> just didn't have time so far; pull requests are welcome :) Btw, the 
>> Cassandra 
>> journal <https://github.com/krasserm/akka-persistence-cassandra> follows 
>> the very same strategy for scaling with data volume (by using different 
>> partition keys). 
>>
>> Cheers,
>> Martin
>>
>>  -- 
>> >>>>>>>>>> Read the docs: http://akka.io/docs/
>> >>>>>>>>>> Check the FAQ: 
>> http://doc.akka.io/docs/akka/current/additional/faq.html
>> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Akka User List" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to akka-user+...@googlegroups.com.
>> To post to this group, send email to akka...@googlegroups.com.
>> Visit this group at http://groups.google.com/group/akka-user.
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> -- 
>> Martin Krasser
>>
>> blog:    http://krasserm.blogspot.com
>> code:    http://github.com/krasserm
>> twitter: http://twitter.com/mrt1nz
>>
>>   -- 
> >>>>>>>>>> Read the docs: http://akka.io/docs/
> >>>>>>>>>> Check the FAQ: 
> http://doc.akka.io/docs/akka/current/additional/faq.html
> >>>>>>>>>> Search the archives: https://groups.google.com/group/akka-user
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Akka User List" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to akka-user+...@googlegroups.com <javascript:>.
> To post to this group, send email to akka...@googlegroups.com 
> <javascript:>.
> Visit this group at http://groups.google.com/group/akka-user.
> For more options, visit https://groups.google.com/d/optout.
>
>
> -- 
> Martin Krasser
>
> blog:    http://krasserm.blogspot.com
> code:    http://github.com/krasserm
> twitter: http://twitter.com/mrt1nz
>
> 

-- 
>>>>>>>>>>      Read the docs: http://akka.io/docs/
>>>>>>>>>>      Check the FAQ: 
>>>>>>>>>> http://doc.akka.io/docs/akka/current/additional/faq.html
>>>>>>>>>>      Search the archives: https://groups.google.com/group/akka-user
--- 
You received this message because you are subscribed to the Google Groups "Akka 
User List" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to akka-user+unsubscr...@googlegroups.com.
To post to this group, send email to akka-user@googlegroups.com.
Visit this group at http://groups.google.com/group/akka-user.
For more options, visit https://groups.google.com/d/optout.

Re: [akka-user] Apache Kafka as journal - retention times/PersistentView and partitions

Reply via email to