If you use spark direct streams , it ensure end to end guarantee for messages.
On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar <ali.rac...@gmail.com> wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could be idempotent if done as upserts, otherwise updates will be > idempotent but not inserts. > > Data should not be lost. The system should be as fault tolerant as > possible. > > What's the advantage of using Spark for reading Kafka instead of direct > Kafka consumers? > > On Thu, Sep 29, 2016 at 8:28 PM, Cody Koeninger <c...@koeninger.org> > wrote: > >> I wouldn't give up the flexibility and maturity of a relational >> database, unless you have a very specific use case. I'm not trashing >> cassandra, I've used cassandra, but if all I know is that you're doing >> analytics, I wouldn't want to give up the ability to easily do ad-hoc >> aggregations without a lot of forethought. If you're worried about >> scaling, there are several options for horizontally scaling Postgres >> in particular. One of the current best from what I've worked with is >> Citus. >> >> On Thu, Sep 29, 2016 at 10:15 AM, Deepak Sharma <deepakmc...@gmail.com> >> wrote: >> > Hi Cody >> > Spark direct stream is just fine for this use case. >> > But why postgres and not cassandra? >> > Is there anything specific here that i may not be aware? >> > >> > Thanks >> > Deepak >> > >> > On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger <c...@koeninger.org> >> wrote: >> >> >> >> How are you going to handle etl failures? Do you care about lost / >> >> duplicated data? Are your writes idempotent? >> >> >> >> Absent any other information about the problem, I'd stay away from >> >> cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream >> >> feeding postgres. >> >> >> >> On Thu, Sep 29, 2016 at 10:04 AM, Ali Akhtar <ali.rac...@gmail.com> >> wrote: >> >> > Is there an advantage to that vs directly consuming from Kafka? >> Nothing >> >> > is >> >> > being done to the data except some light ETL and then storing it in >> >> > Cassandra >> >> > >> >> > On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma < >> deepakmc...@gmail.com> >> >> > wrote: >> >> >> >> >> >> Its better you use spark's direct stream to ingest from kafka. >> >> >> >> >> >> On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar <ali.rac...@gmail.com> >> >> >> wrote: >> >> >>> >> >> >>> I don't think I need a different speed storage and batch storage. >> Just >> >> >>> taking in raw data from Kafka, standardizing, and storing it >> somewhere >> >> >>> where >> >> >>> the web UI can query it, seems like it will be enough. >> >> >>> >> >> >>> I'm thinking about: >> >> >>> >> >> >>> - Reading data from Kafka via Spark Streaming >> >> >>> - Standardizing, then storing it in Cassandra >> >> >>> - Querying Cassandra from the web ui >> >> >>> >> >> >>> That seems like it will work. My question now is whether to use >> Spark >> >> >>> Streaming to read Kafka, or use Kafka consumers directly. >> >> >>> >> >> >>> >> >> >>> On Thu, Sep 29, 2016 at 7:41 PM, Mich Talebzadeh >> >> >>> <mich.talebza...@gmail.com> wrote: >> >> >>>> >> >> >>>> - Spark Streaming to read data from Kafka >> >> >>>> - Storing the data on HDFS using Flume >> >> >>>> >> >> >>>> You don't need Spark streaming to read data from Kafka and store >> on >> >> >>>> HDFS. It is a waste of resources. >> >> >>>> >> >> >>>> Couple Flume to use Kafka as source and HDFS as sink directly >> >> >>>> >> >> >>>> KafkaAgent.sources = kafka-sources >> >> >>>> KafkaAgent.sinks.hdfs-sinks.type = hdfs >> >> >>>> >> >> >>>> That will be for your batch layer. To analyse you can directly >> read >> >> >>>> from >> >> >>>> hdfs files with Spark or simply store data in a database of your >> >> >>>> choice via >> >> >>>> cron or something. Do not mix your batch layer with speed layer. >> >> >>>> >> >> >>>> Your speed layer will ingest the same data directly from Kafka >> into >> >> >>>> spark streaming and that will be online or near real time >> (defined >> >> >>>> by your >> >> >>>> window). >> >> >>>> >> >> >>>> Then you have a a serving layer to present data from both speed >> (the >> >> >>>> one from SS) and batch layer. >> >> >>>> >> >> >>>> HTH >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> Dr Mich Talebzadeh >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> LinkedIn >> >> >>>> >> >> >>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ >> d6zP6AcPCCdOABUrV8Pw >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> http://talebzadehmich.wordpress.com >> >> >>>> >> >> >>>> >> >> >>>> Disclaimer: Use it at your own risk. Any and all responsibility >> for >> >> >>>> any >> >> >>>> loss, damage or destruction of data or any other property which >> may >> >> >>>> arise >> >> >>>> from relying on this email's technical content is explicitly >> >> >>>> disclaimed. The >> >> >>>> author will in no case be liable for any monetary damages arising >> >> >>>> from such >> >> >>>> loss, damage or destruction. >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> >> >> >>>> On 29 September 2016 at 15:15, Ali Akhtar <ali.rac...@gmail.com> >> >> >>>> wrote: >> >> >>>>> >> >> >>>>> The web UI is actually the speed layer, it needs to be able to >> query >> >> >>>>> the data online, and show the results in real-time. >> >> >>>>> >> >> >>>>> It also needs a custom front-end, so a system like Tableau can't >> be >> >> >>>>> used, it must have a custom backend + front-end. >> >> >>>>> >> >> >>>>> Thanks for the recommendation of Flume. Do you think this will >> work: >> >> >>>>> >> >> >>>>> - Spark Streaming to read data from Kafka >> >> >>>>> - Storing the data on HDFS using Flume >> >> >>>>> - Using Spark to query the data in the backend of the web UI? >> >> >>>>> >> >> >>>>> >> >> >>>>> >> >> >>>>> On Thu, Sep 29, 2016 at 7:08 PM, Mich Talebzadeh >> >> >>>>> <mich.talebza...@gmail.com> wrote: >> >> >>>>>> >> >> >>>>>> You need a batch layer and a speed layer. Data from Kafka can be >> >> >>>>>> stored on HDFS using flume. >> >> >>>>>> >> >> >>>>>> - Query this data to generate reports / analytics (There will >> be a >> >> >>>>>> web UI which will be the front-end to the data, and will show >> the >> >> >>>>>> reports) >> >> >>>>>> >> >> >>>>>> This is basically batch layer and you need something like >> Tableau >> >> >>>>>> or >> >> >>>>>> Zeppelin to query data >> >> >>>>>> >> >> >>>>>> You will also need spark streaming to query data online for >> speed >> >> >>>>>> layer. That data could be stored in some transient fabric like >> >> >>>>>> ignite or >> >> >>>>>> even druid. >> >> >>>>>> >> >> >>>>>> HTH >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> Dr Mich Talebzadeh >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> LinkedIn >> >> >>>>>> >> >> >>>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJ >> d6zP6AcPCCdOABUrV8Pw >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> http://talebzadehmich.wordpress.com >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> Disclaimer: Use it at your own risk. Any and all responsibility >> for >> >> >>>>>> any loss, damage or destruction of data or any other property >> which >> >> >>>>>> may >> >> >>>>>> arise from relying on this email's technical content is >> explicitly >> >> >>>>>> disclaimed. The author will in no case be liable for any >> monetary >> >> >>>>>> damages >> >> >>>>>> arising from such loss, damage or destruction. >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> >> >> >>>>>> On 29 September 2016 at 15:01, Ali Akhtar <ali.rac...@gmail.com >> > >> >> >>>>>> wrote: >> >> >>>>>>> >> >> >>>>>>> It needs to be able to scale to a very large amount of data, >> yes. >> >> >>>>>>> >> >> >>>>>>> On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma >> >> >>>>>>> <deepakmc...@gmail.com> wrote: >> >> >>>>>>>> >> >> >>>>>>>> What is the message inflow ? >> >> >>>>>>>> If it's really high , definitely spark will be of great use . >> >> >>>>>>>> >> >> >>>>>>>> Thanks >> >> >>>>>>>> Deepak >> >> >>>>>>>> >> >> >>>>>>>> >> >> >>>>>>>> On Sep 29, 2016 19:24, "Ali Akhtar" <ali.rac...@gmail.com> >> wrote: >> >> >>>>>>>>> >> >> >>>>>>>>> I have a somewhat tricky use case, and I'm looking for ideas. >> >> >>>>>>>>> >> >> >>>>>>>>> I have 5-6 Kafka producers, reading various APIs, and writing >> >> >>>>>>>>> their >> >> >>>>>>>>> raw data into Kafka. >> >> >>>>>>>>> >> >> >>>>>>>>> I need to: >> >> >>>>>>>>> >> >> >>>>>>>>> - Do ETL on the data, and standardize it. >> >> >>>>>>>>> >> >> >>>>>>>>> - Store the standardized data somewhere (HBase / Cassandra / >> Raw >> >> >>>>>>>>> HDFS / ElasticSearch / Postgres) >> >> >>>>>>>>> >> >> >>>>>>>>> - Query this data to generate reports / analytics (There >> will be >> >> >>>>>>>>> a >> >> >>>>>>>>> web UI which will be the front-end to the data, and will show >> >> >>>>>>>>> the reports) >> >> >>>>>>>>> >> >> >>>>>>>>> Java is being used as the backend language for everything >> >> >>>>>>>>> (backend >> >> >>>>>>>>> of the web UI, as well as the ETL layer) >> >> >>>>>>>>> >> >> >>>>>>>>> I'm considering: >> >> >>>>>>>>> >> >> >>>>>>>>> - Using raw Kafka consumers, or Spark Streaming, as the ETL >> >> >>>>>>>>> layer >> >> >>>>>>>>> (receive raw data from Kafka, standardize & store it) >> >> >>>>>>>>> >> >> >>>>>>>>> - Using Cassandra, HBase, or raw HDFS, for storing the >> >> >>>>>>>>> standardized >> >> >>>>>>>>> data, and to allow queries >> >> >>>>>>>>> >> >> >>>>>>>>> - In the backend of the web UI, I could either use Spark to >> run >> >> >>>>>>>>> queries across the data (mostly filters), or directly run >> >> >>>>>>>>> queries against >> >> >>>>>>>>> Cassandra / HBase >> >> >>>>>>>>> >> >> >>>>>>>>> I'd appreciate some thoughts / suggestions on which of these >> >> >>>>>>>>> alternatives I should go with (e.g, using raw Kafka >> consumers vs >> >> >>>>>>>>> Spark for >> >> >>>>>>>>> ETL, which persistent data store to use, and how to query >> that >> >> >>>>>>>>> data store in >> >> >>>>>>>>> the backend of the web UI, for displaying the reports). >> >> >>>>>>>>> >> >> >>>>>>>>> >> >> >>>>>>>>> Thanks. >> >> >>>>>>> >> >> >>>>>>> >> >> >>>>>> >> >> >>>>> >> >> >>>> >> >> >>> >> >> >> >> >> >> >> >> >> >> >> >> -- >> >> >> Thanks >> >> >> Deepak >> >> >> www.bigdatabig.com >> >> >> www.keosha.net >> >> > >> >> > >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> >> > >> > >> > >> > -- >> > Thanks >> > Deepak >> > www.bigdatabig.com >> > www.keosha.net >> > > -- Thanks Deepak www.bigdatabig.com www.keosha.net