RE: using spark to load a data warehouse in real time

Adaryl Wakefield Tue, 07 Mar 2017 09:33:12 -0800

Hi Henry,
I didn’t catch your email until now. When you wrote to the database, how did 
you enforce the schema? Did the data frames just spit everything out with the 
necessary keys?


Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData

From: Henry Tremblay [mailto:paulhtremb...@gmail.com]
Sent: Tuesday, February 28, 2017 3:56 PM
To: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real time


We did this all the time at my last position.

1. We had unstructured data in S3.

2.We read directly from S3 and then gave structure to the data by a dataframe 
in Spark.

3. We wrote the results to S3

4. We used Redshift's super fast parallel ability to load the results into a 
table.

Henry

On 02/28/2017 11:04 AM, Mohammad Tariq wrote:
You could try this as a blueprint :

Read the data in through Spark Streaming. Iterate over it and convert each RDD 
into a DataFrame. Use these DataFrames to perform whatever processing is 
required and then save that DataFrame into your target relational warehouse.

HTH


[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]



Tariq, Mohammad
about.me/mti








<http://about.me/mti>
Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>

<http://about.me/mti>

<http://about.me/mti>

  <http://about.me/mti>

 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:27 AM, Mohammad Tariq <donta...@gmail.com> 
wrote:<http://about.me/mti>
Hi Adaryl, <http://about.me/mti>
 <http://about.me/mti>
You could definitely load data into a warehouse through Spark's JDBC support 
through DataFrames. Could you please explain your use case a bit more? That'll 
help us in answering your query better.<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

<http://about.me/mti>
[https://thumbs.about.me/thumbnail/users/m/t/i/mti_emailsig.jpg?_1407799609_32]<http://about.me/mti>

 <http://about.me/mti>

Tariq, Mohammad<http://about.me/mti>
about.me/mti<http://about.me/mti>


<http://about.me/mti>

 <http://about.me/mti>

<http://about.me/mti>
 <http://about.me/mti>

<http://about.me/mti>
Tariq, Mohammad
about.me/mti






 <http://about.me/mti>

 <http://about.me/mti>
On Wed, Mar 1, 2017 at 12:15 AM, Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com> wrote:<http://about.me/mti>
I haven’t heard of Kafka connect. I’ll have to look into it. Kafka would, of 
course have to be in any architecture but it looks like they are suggesting 
that Kafka is all you need. <http://about.me/mti>
 <http://about.me/mti>
My primary concern is the complexity of loading warehouses. I have a web 
development background so I have somewhat of an idea on how to insert data into 
a database from an application. I’ve since moved on to straight database 
programming and don’t work with anything that reads from an app anymore. 
<http://about.me/mti>
 <http://about.me/mti>
Loading a warehouse requires a lot of cleaning of data and running and grabbing 
keys to maintain referential integrity. Usually that’s done in a batch process. 
Now I have to do it record by record (or a few records). I have some ideas but 
I’m not quite there yet.<http://about.me/mti>
 <http://about.me/mti>
I thought SparkSQL would be the way to get this done but so far, all the 
examples I’ve seen are just SELECT statements, no INSERTS or MERGE 
statements.<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>
 <http://about.me/mti>
From: Femi Anthony [mailto:femib...@gmail.com]
Sent: Tuesday, February 28, 2017 4:13 AM
To: Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Cc: user@spark.apache.org
Subject: Re: using spark to load a data warehouse in real 
time<http://about.me/mti>
 <http://about.me/mti>
Have you checked to see if there are any drivers to enable you to write to 
Greenplum directly from Spark ?<http://about.me/mti>
 <http://about.me/mti>
You can also take a look at this link:<http://about.me/mti>
 <http://about.me/mti>
https://groups.google.com/a/greenplum.org/forum/m/#!topic/gpdb-users/lnm0Z7WBW6Q<http://about.me/mti>
 <http://about.me/mti>
Apparently GPDB is based on Postgres so maybe that approach may work. 
<http://about.me/mti>
Another approach maybe for Spark Streaming to write to Kafka, and then have 
another process read from Kafka and write to Greenplum.<http://about.me/mti>
 <http://about.me/mti>
Kafka Connect may be useful in this case -<http://about.me/mti>
 <http://about.me/mti>
https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/<http://about.me/mti>
 <http://about.me/mti>
Femi Anthony<http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>

On Feb 27, 2017, at 7:18 PM, Adaryl Wakefield <adaryl.wakefi...@hotmail.com> 
wrote:<http://about.me/mti>
Is anybody using Spark streaming/SQL to load a relational data warehouse in 
real time? There isn’t a lot of information on this use case out there. When I 
google real time data warehouse load, nothing I find is up to date. It’s all 
turn of the century stuff and doesn’t take into account advancements in 
database technology. Additionally, whenever I try to learn spark, it’s always 
the same thing. Play with twitter data never structured data. All the CEP uses 
cases are about data science. <http://about.me/mti>
 <http://about.me/mti>
I’d like to use Spark to load Greenplumb in real time. Intuitively, this should 
be possible. I was thinking Spark Streaming with Spark SQL along with a ORM 
should do it. Am I off base with this? Is the reason why there are no examples 
is because there is a better way to do what I want?<http://about.me/mti>
 <http://about.me/mti>
Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685<http://about.me/mti>
www.massstreet.net<http://about.me/mti>
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData<http://about.me/mti>

 <http://about.me/mti>
 <http://about.me/mti>
 <http://about.me/mti>


<http://about.me/mti>

-- <http://about.me/mti>

Henry Tremblay<http://about.me/mti>

Robert Half Technology<http://about.me/mti>

RE: using spark to load a data warehouse in real time

Reply via email to