Hi, We are building an internal analytics application. Kind of an event store. We have all the basic analytics use cases like filtering, aggregation, segmentation etc. So far our architecture used ElasticSearch extensively but that is not scaling anymore. One unique requirement we have is an event should be available for querying within 5 seconds of the event. We were thinking of a lambda architecture where streaming data still goes to elastic search (only 1 day's data), batch pipeline goes to s3. Every day one, a spark job will transform that data and store again in s3. One problem we were not able to solve was when a query comes, how to aggregate results from 2 data sources (ES for current data & s3 for old data). We felt this approach wont scale.
Spark Structured Streaming seems to solve this. Correct me if i am wrong. With structured streaming, will the following architecture work? Read data from kafka using spark. For every batch of data, do the transformations and store in s3. But when a query comes, query from both s3 & in memory batch at the same time. Will this approach work? Also one more condition is, querying should respond immediately. With a max latency of 1s for simple queries and 5s for complex queries. If the above method is not the right way, please suggest an alternative to solve this. Thanks Aravindh.S -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-in-validating-an-architecture-using-Structured-Streaming-tp27801.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org