Felix/Jordan, 1 - 2 is exactly what I was looking for as well. I want to expose webservices call to Kafka/samza. As there is no concept of a session, I was wondering how to send back enriched data to the web services request. Or am I way off on this? Meaning, is this a completely wrong use case to use Kafka/Samza?
- Shekar On Fri, Mar 27, 2015 at 12:42 PM, Jordan Shaw <jor...@pubnub.com> wrote: > Felix, > Here are my thoughts below > > 1 - 2) I think so far a majority of samza applications are internal so far. > However I've developed a Samza Publisher for PubNub that would allow you to > send data from process or window out over a Data Stream Network. Right now > it looks something like this: > > (.send collector (OutgoingMessageEnvelope. (SystemStream. > "pubnub.some-channel") {:pub_key demo :sub_key demo} some-data)). > > At smaller scale you could do the same with socket.io etc... If you're > interested in this I can send you the src or jar. If their is wider > interest I can open source it on github but it needs some cleanup first. > > 3) We currently don't have the need to warehouse our stream but we have > thought about piping samza generated data into some Hadoop based system for > longer term analysis. Then running Hive queries over that data or something > alike. > > 4) I can't comment on the throughput of the other systems (HBase etc..) but > our Kafka, Samza through put is pretty impressive considering the single > thread nature of the system. We are seeing raw throughput per partition > over well 10MB/s. > > 5) I haven't run into this to prevent data loss/backup if we can't process > a message we have considered dropping it into a "unprocessed topic" but we > haven't really run into this need. If you needed to reprocess all raw data > it would be pretty straightforward, you could just add a partition to > support the extra load. > > 6) Kafka is pretty good at ingesting things so could you elaborate more on > this? > > On Fri, Mar 27, 2015 at 9:52 AM, Felix GV <fville...@linkedin.com.invalid> > wrote: > > > Hi Samza devs, users and enthusiasts, > > > > I've kept an eye on the Samza project for a while and I think it's super > > cool! I hope it continues to mature and expand as it seems very > promising (: > > > > One thing I've been wondering for a while is: how do people serve the > data > > they computed on Samza? More specifically: > > > > 1. How do you expose the output of Samza jobs to online applications > > that need low-latency reads? > > 2. Are these online apps mostly internal (i.e.: analytics, dashboards, > > etc.) or public/user-facing? > > 3. What systems do you currently use (or plan to use in the > short-term) > > to host the data generated in Samza? HBase? Cassandra? MySQL? Druid? > Others? > > 4. Are you satisfied or are you facing challenges in terms of the > write > > throughput supported by these storage/serving systems? What about read > > throughput? > > 5. Are there situations where you wish to re-process all historical > > data when making improvements to your Samza job, which results in the > need > > to re-ingest all of the Samza output into your online serving system (as > > described in the Kappa Architecture< > > > http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html > >) > > ? Is this easy breezy or painful? Do you need to throttle it lest your > > serving system will fall over? > > 6. If there was a highly-optimized and reliable way of ingesting > > partitioned streams quickly into your online serving system, would that > > help you leverage Samza more effectively? > > > > Your insights would be much appreciated! > > > > > > Thanks (: > > > > > > -- > > Felix > > > > > > -- > Jordan Shaw > Full Stack Software Engineer > PubNub Inc > 1045 17th St > San Francisco, CA 94107 >