Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
OP mentioned HBase or HDFS as persisted storage. Therefore they have to be running YARN if they are considering spark. (Assuming that you’re not trying to do a storage / compute model and use standalone spark outside your cluster. You can, but you have more moving parts…) I never said

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
The OP didn't say anything about Yarn, and why are you contemplating putting Kafka or Spark on public networks to begin with? Gwen's right, absent any actual requirements this is kind of pointless. On Thu, Sep 29, 2016 at 1:27 PM, Michael Segel wrote: > Spark

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Spark standalone is not Yarn… or secure for that matter… ;-) > On Sep 29, 2016, at 11:18 AM, Cody Koeninger wrote: > > Spark streaming helps with aggregation because > > A. raw kafka consumers have no built in framework for shuffling > amongst nodes, short of writing into

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Michael, How about druid here. Hive ORC tables are another option that have Streaming data ingest to Flume and storm However, Spark cannot read ORC transactional tables because of delta files, unless

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
> I still don't understand why writing to a transactional database with locking > and concurrency (read and writes) through JDBC will be fast for this sort of > data ingestion. Who cares about fast if your data is wrong? And it's still plenty fast enough

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
Spark streaming helps with aggregation because A. raw kafka consumers have no built in framework for shuffling amongst nodes, short of writing into an intermediate topic (I'm not touching Kafka Streams here, I don't have experience), and B. it deals with batches, so you can transactionally

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
The way I see this, there are two things involved. 1. Data ingestion through source to Kafka 2. Date conversion and Storage ETL/ELT 3. Presentation Item 2 is the one that needs to be designed correctly. I presume raw data has to confirm to some form of MDM that requires schema mapping

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Michael Segel
Ok… so what’s the tricky part? Spark Streaming isn’t real time so if you don’t mind a slight delay in processing… it would work. The drawback is that you now have a long running Spark Job (assuming under YARN) and that could become a problem in terms of security and resources. (How well does

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The business use case is to read a user's data from a variety of different services through their API, and then allowing the user to query that data, on a per service basis, as well as an aggregate across all services. The way I'm considering doing it, is to do some basic ETL (drop all the

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
No, direct stream in and of itself won't ensure an end-to-end guarantee, because it doesn't know anything about your output actions. You still need to do some work. The point is having easy access to offsets for batches on a per-partition basis makes it easier to do that work, especially in

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Hi Ali, What is the business use case for this? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
If you use spark direct streams , it ensure end to end guarantee for messages. On Thu, Sep 29, 2016 at 9:05 PM, Ali Akhtar wrote: > My concern with Postgres / Cassandra is only scalability. I will look > further into Postgres horizontal scaling, thanks. > > Writes could

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
Yes but still these writes from Spark have to go through JDBC? Correct. Having said that I don't see how doing this through Spark streaming to postgress is going to be faster than source -> Kafka - flume via zookeeper -> HDFS. I believe there is direct streaming from Kakfa to Hive as well and

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
If you're doing any kind of pre-aggregation during ETL, spark direct stream will let you more easily get the delivery semantics you need, especially if you're using a transactional data store. If you're literally just copying individual uniquely keyed items from kafka to a key-value store, use

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
My concern with Postgres / Cassandra is only scalability. I will look further into Postgres horizontal scaling, thanks. Writes could be idempotent if done as upserts, otherwise updates will be idempotent but not inserts. Data should not be lost. The system should be as fault tolerant as

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
I wouldn't give up the flexibility and maturity of a relational database, unless you have a very specific use case. I'm not trashing cassandra, I've used cassandra, but if all I know is that you're doing analytics, I wouldn't want to give up the ability to easily do ad-hoc aggregations without a

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Hi Cody Spark direct stream is just fine for this use case. But why postgres and not cassandra? Is there anything specific here that i may not be aware? Thanks Deepak On Thu, Sep 29, 2016 at 8:41 PM, Cody Koeninger wrote: > How are you going to handle etl failures? Do you

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Cody Koeninger
How are you going to handle etl failures? Do you care about lost / duplicated data? Are your writes idempotent? Absent any other information about the problem, I'd stay away from cassandra/flume/hdfs/hbase/whatever, and use a spark direct stream feeding postgres. On Thu, Sep 29, 2016 at 10:04

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
Is there an advantage to that vs directly consuming from Kafka? Nothing is being done to the data except some light ETL and then storing it in Cassandra On Thu, Sep 29, 2016 at 7:58 PM, Deepak Sharma wrote: > Its better you use spark's direct stream to ingest from kafka.

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Its better you use spark's direct stream to ingest from kafka. On Thu, Sep 29, 2016 at 8:24 PM, Ali Akhtar wrote: > I don't think I need a different speed storage and batch storage. Just > taking in raw data from Kafka, standardizing, and storing it somewhere > where the

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I don't think I need a different speed storage and batch storage. Just taking in raw data from Kafka, standardizing, and storing it somewhere where the web UI can query it, seems like it will be enough. I'm thinking about: - Reading data from Kafka via Spark Streaming - Standardizing, then

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
Since the inflow is huge , flume would also need to be run with multiple channels in distributed fashion. In that case , the resource utilization will be high in that case as well. Thanks Deepak On Thu, Sep 29, 2016 at 8:11 PM, Mich Talebzadeh wrote: > - Spark

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
- Spark Streaming to read data from Kafka - Storing the data on HDFS using Flume You don't need Spark streaming to read data from Kafka and store on HDFS. It is a waste of resources. Couple Flume to use Kafka as source and HDFS as sink directly KafkaAgent.sources = kafka-sources

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
For ui , you need DB such as Cassandra that is designed to work around queries . Ingest the data to spark streaming (speed layer) and write to hdfs(for batch layer). Now you have data at rest as well as in motion(real time). >From spark streaming itself , do further processing and write the final

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Alonso Isidoro Roman
"Using Spark to query the data in the backend of the web UI?" Dont do that. I would recommend that spark streaming process stores data into some nosql or sql database and the web ui to query data from that database. Alonso Isidoro Roman [image: https://]about.me/alonso.isidoro.roman

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
The web UI is actually the speed layer, it needs to be able to query the data online, and show the results in real-time. It also needs a custom front-end, so a system like Tableau can't be used, it must have a custom backend + front-end. Thanks for the recommendation of Flume. Do you think this

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Mich Talebzadeh
You need a batch layer and a speed layer. Data from Kafka can be stored on HDFS using flume. - Query this data to generate reports / analytics (There will be a web UI which will be the front-end to the data, and will show the reports) This is basically batch layer and you need something like

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
It needs to be able to scale to a very large amount of data, yes. On Thu, Sep 29, 2016 at 7:00 PM, Deepak Sharma wrote: > What is the message inflow ? > If it's really high , definitely spark will be of great use . > > Thanks > Deepak > > On Sep 29, 2016 19:24, "Ali

Re: Architecture recommendations for a tricky use case

2016-09-29 Thread Deepak Sharma
What is the message inflow ? If it's really high , definitely spark will be of great use . Thanks Deepak On Sep 29, 2016 19:24, "Ali Akhtar" wrote: > I have a somewhat tricky use case, and I'm looking for ideas. > > I have 5-6 Kafka producers, reading various APIs, and

Architecture recommendations for a tricky use case

2016-09-29 Thread Ali Akhtar
I have a somewhat tricky use case, and I'm looking for ideas. I have 5-6 Kafka producers, reading various APIs, and writing their raw data into Kafka. I need to: - Do ETL on the data, and standardize it. - Store the standardized data somewhere (HBase / Cassandra / Raw HDFS / ElasticSearch /