RE: Migrate Relational to Distributed

Mohammed Guller Mon, 01 Jun 2015 12:13:54 -0700

Brant,

You should be able to migrate most of your existing SQL code to Spark SQL, but 
remember that Spark SQL does not yet support the full ANSI standard. So you may 
need to rewrite some of your existing queries.


Another thing to keep in mind is that Spark SQL is not real-time.  The response 
time for Spark SQL + Cassandra will not be the same as that of a 
properly-indexed database table (up to a certain size). On the other hand, the 
Spark SQL + Cassandra solution will scale better and provide higher throughput 
and availability more economically than an Oracle based solution. 

Mohammed

-----Original Message-----
From: Brant Seibert [mailto:brantseib...@hotmail.com] 
Sent: Friday, May 22, 2015 3:23 PM
To: user@spark.apache.org
Subject: Migrate Relational to Distributed

Hi,  The healthcare industry can do wonderful things with Apache Spark.  But, 
there is already a very large base of data and applications firmly rooted in 
the relational paradigm and they are resistent to change - stuck on Oracle.  

**
QUESTION 1 - Migrate legacy relational data (plus new transactions) to 
distributed storage?  

DISCUSSION 1 - The primary advantage I see is not having to engage in the 
lengthy (1+ years) process of creating a relational data warehouse and cubes.  
Just store the data in a distributed system and "analyze first" in memory with 
Spark.

**
QUESTION 2 - Will we have to re-write the enormous amount of logic that is 
already built for the old relational system?

DISCUSSION 2 - If we move the data to distributed, can we simply run that 
existing relational logic as SparkSQL queries?  [existing SQL --> Spark Context 
--> Cassandra --> process in SparkSQL --> display in existing UI]. 
Can we create an RDD that uses existing SQL?  Or do we need to rewrite all our 
SQL?

**
DATA SIZE - We are adding many new data sources to a system that already 
manages health care data for over a million people.  The number of rows may not 
be enormous right now compared to the advertising industry, for example, but 
the number of dimensions runs well into the thousands.  If we add to this, IoT 
data for each health care patient, that creates billions of events per day, and 
the number of rows then grows exponentially.  We would like to be prepared to 
handle that huge data scenario.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Migrate-Relational-to-Distributed-tp22999.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: Migrate Relational to Distributed

Reply via email to