Re: Spark SQL queries hive table, real time ?

2015-07-06 Thread Denny Lee
Within the context of your question, Spark SQL utilizing the Hive context
is primarily about very fast queries.  If you want to use real-time
queries, I would utilize Spark Streaming.  A couple of great resources on
this topic include Guest Lecture on Spark Streaming in Stanford CME 323:
Distributed Algorithms and Optimization
http://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford
and Recipes for Running Spark Streaming Applications in Production
https://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/
(from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki florian.spierc...@crisalid.com
wrote:

 Hello,

 I'm actually asking my self about performance of using Spark SQL with Hive
 to do real time analytics.
 I know that Hive has been created for batch processing, and Spark is use to
 do fast queries.

 But, use Spark SQL with Hive will allow me to do real time queries ? Or it
 just will make fastest queries but not real time.
 Should I use an other datawarehouse, like Hbase ?

 Thanks in advance for your time and consideration,
 Florian



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: Spark SQL queries hive table, real time ?

2015-07-06 Thread Mohammed Guller
Hi Florian,
It depends on a number of factors. How much data are you querying? Where is the 
data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)?

In theory, it is possible to use Spark SQL for real-time queries, but cost 
increases as the data size grows. If you can store all of your data in memory, 
then you should be able to query it in real-time ☺ On the other extreme,  if 
Spark SQL has to read a terabyte of data from spinning disk, there is no way it 
can respond in real-time. To be fair, no software can read a terabyte of data 
from HDD in real-time. Simple laws of physics. Either you will have to spread 
out the reads over a large number of disks and read them in parallel. 
Alternatively, index the data so that your queries don’t have to read a 
terabyte of data from disk.

Hope that helps.

Mohammed

From: Denny Lee [mailto:denny.g@gmail.com]
Sent: Monday, July 6, 2015 4:21 AM
To: spierki; user@spark.apache.org
Subject: Re: Spark SQL queries hive table, real time ?

Within the context of your question, Spark SQL utilizing the Hive context is 
primarily about very fast queries.  If you want to use real-time queries, I 
would utilize Spark Streaming.  A couple of great resources on this topic 
include Guest Lecture on Spark Streaming in Stanford CME 323: Distributed 
Algorithms and 
Optimizationhttp://www.slideshare.net/tathadas/guest-lecture-on-spark-streaming-in-standford
 and Recipes for Running Spark Streaming Applications in 
Productionhttps://spark-summit.org/2015/events/recipes-for-running-spark-streaming-applications-in-production/
 (from the recent Spark Summit 2015)

HTH!


On Mon, Jul 6, 2015 at 3:23 PM spierki 
florian.spierc...@crisalid.commailto:florian.spierc...@crisalid.com wrote:
Hello,

I'm actually asking my self about performance of using Spark SQL with Hive
to do real time analytics.
I know that Hive has been created for batch processing, and Spark is use to
do fast queries.

But, use Spark SQL with Hive will allow me to do real time queries ? Or it
just will make fastest queries but not real time.
Should I use an other datawarehouse, like Hbase ?

Thanks in advance for your time and consideration,
Florian



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-queries-hive-table-real-time-tp23642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org