RE: using spark to load a data warehouse in real time

2017-03-07 Thread Adaryl Wakefield
Hi Henry, I didn’t catch your email until now. When you wrote to the database, how did you enforce the schema? Did the data frames just spit everything out with the necessary keys? Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685

RE: using spark to load a data warehouse in real time

2017-03-04 Thread Adaryl Wakefield
That does thanks. I’m starting to think a straight Kafka solution would be more appropriate. Adaryl "Bob" Wakefield, MBA Principal Mass Street Analytics, LLC 913.938.6685 www.massstreet.net www.linkedin.com/in/bobwakefieldmba

RE: using spark to load a data warehouse in real time

2017-03-04 Thread Adaryl Wakefield
For all the work that is necessary to load a warehouse, could not that work be considered a special case of CEP? Real time means I’m trying to get to zero lag between an event happening in the transactional system and someone being able to do analytics on that data but not just from that

Re: using spark to load a data warehouse in real time

2017-03-01 Thread Sam Elamin
Hi Adaryl Having come from a Web background myself I completely understand your confusion so let me try to clarify a few things First and foremost, Spark is a data processing engine not a general framework. In the Web applications and frameworks world you load the entities, map them to the UI

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Jörn Franke
I am not sure that Spark Streaming is what you want to do. It is for streaming analytics not for loading in a DWH. You need also define what realtime means and what is needed there - it will differ from client to client significantly. From my experience, just SQL is not enough for the users

RE: using spark to load a data warehouse in real time

2017-02-28 Thread Adaryl Wakefield
I’m actually trying to come up with a generalized use case that I can take from client to client. We have structured data coming from some application. Instead of dropping it into Hadoop and then using yet another technology to query that data, I just want to dump it into a relational MPP DW so

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Henry Tremblay
We did this all the time at my last position. 1. We had unstructured data in S3. 2.We read directly from S3 and then gave structure to the data by a dataframe in Spark. 3. We wrote the results to S3 4. We used Redshift's super fast parallel ability to load the results into a table. Henry

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Mohammad Tariq
You could try this as a blueprint : Read the data in through Spark Streaming. Iterate over it and convert each RDD into a DataFrame. Use these DataFrames to perform whatever processing is required and then save that DataFrame into your target relational warehouse. HTH [image: --] Tariq,

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Mohammad Tariq
Hi Adaryl, You could definitely load data into a warehouse through Spark's JDBC support through DataFrames. Could you please explain your use case a bit more? That'll help us in answering your query better. [image: --] Tariq, Mohammad [image: https://]about.me/mti

RE: using spark to load a data warehouse in real time

2017-02-28 Thread Adaryl Wakefield
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of course have to be in any architecture but it looks like they are suggesting that Kafka is all you need. My primary concern is the complexity of loading warehouses. I have a web development background so I have

Re: using spark to load a data warehouse in real time

2017-02-28 Thread Femi Anthony
Have you checked to see if there are any drivers to enable you to write to Greenplum directly from Spark ? You can also take a look at this link: https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q Apparently GPDB is based on Postgres so maybe that approach may