Thanks for your reply Roger! Very insightful (:

> 6. If there was a highly-optimized and reliable way of ingesting
> partitioned streams quickly into your online serving system, would that
> help you leverage Samza more effectively?

>> 6. Can you elaborate please?

Sure. The feature set I have in mind is the following:

  *   Provide a thinly-wrapped Kafka producer which does appropriate 
partitioning and includes useful metadata (such as production timestamp, etc.) 
alongside the payload. This producer would be used in the last step of 
processing of a Samza topology, in order to emit to Kafka some 
processed/joined/enriched data which is destined for online serving.
  *   Provide a consumer process which can be co-located on the same hosts as 
your data serving system. This process consumes from the appropriate partitions 
and checkpoints its offsets on its own. It leverages Kafka batching and 
compression to make consumption very efficient.
  *   For each records the consumer process issues a put/insert locally to the 
co-located serving process. Since this is a local operation, it is also very 
cheap and efficient.
  *   The consumer process can also optionally throttle its insertion rate by 
monitoring some performance metrics of the co-located data serving process. For 
example, if the data serving process exposes a p99 latency via JMX or other 
means, this can be used in a tight feedback loop to back off if read latency 
degrades beyond a certain threshold.
  *   This ingestion platform should be easy to integrate with any 
consistently-routed data serving system, by implementing some simple interfaces 
to let the ingestion system understand the key-to-partition assignment 
strategy, as well as the partition-to-node assignment strategy. Optionally, a 
hook to access performance metrics could also be implemented if throttling is 
deemed important (as described in the previous point).
  *   Since the consumer process lives in a separate process, the system 
benefits from good isolation guarantees. The consumer process can be capped to 
a low amount of heap, and its GC is inconsequential for the serving platform. 
It's also possible to bounce the consumer and data serving processes 
independently of each other, if need be.

There are some more nuances and additional features which could be nice to 
have, but that's the general idea.


It seems to me like such system would be valuable, but I'm wondering what other 
people in the open-source community think, hence why I was interested in 
starting this thread...


Thanks for your feedback!

-F

Reply via email to