RE: Need advice for Spark newbie

Steve Nunez Thu, 26 Feb 2015 09:39:01 -0800

Hi Vikram,

There was a recent presentation at Strata that you might find useful: Hive on 
Spark is Blazing Fast .. Or Is 
It?<http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final>

Generally those conclusions mirror my own observations: on large data sets, 
Hive still gives the best SQL performance and the curve drops off as the data 
sets get smaller. Of course if you also want to build models from the data than 
Spark is an attractive option with its unified programming model. 
HiveMall<https://github.com/myui/hivemall> might also be applicable in your 
case; I’ve seen increasing adoption of it within certain industries.

If you are going cloud, HDInsights is a good choice. You can run both Spark and 
R on 
HDInsights<http://azure.microsoft.com/blog/2014/11/17/azure-hdinsight-clusters-can-now-be-customized-to-run-a-variety-of-hadoop-projects-including-spark-and-r/>,
 as well as get the newest version of Hive (0.14, with Stinger enhancements 
from 
Microsoft<http://www.slideshare.net/hugfrance/recent-enhancements-to-apache-hive-query-performance>)
 for ‘free’, so once you get your data into a wasb you can try all three 
methods and see which one works best for you. HDInsights works well for mixing 
& matching tools.

HTH,

-          SteveN

-----Original Message-----
From: Dean Wampler [mailto:deanwamp...@gmail.com]
Sent: Thursday, 26 February, 2015 8:54
To: Vikram Kone
Cc: dev@spark.apache.org
Subject: Re: Need advice for Spark newbie

Historically, many orgs. have replaced data warehouses with Hadoop clusters and 
used Hive along with Impala (on Cloudera deployments) or Drill (on MapR

deployments) for SQL. Hive is older and slower, while Impala and Drill are 
newer and faster, but you typically need both for their complementary features, 
at least today.

Spark and Spark SQL are not yet complete replacements for them, but they'll get 
there over time. The good news is, you can mix and match these tools, as 
appropriate, because they can all work with the same datasets.

The challenge is all the tribal knowledge required to setup and manage Hadoop 
clusters, to properly organize your data for best performance for your needs, 
to use all these tools effectively, along with additional Hadoop ETL tools, 
etc. Fortunately, tools like Tableau are already integrated here.

However, none of this will be as polished and integrated as what you're used 
to. You're trading that polish for greater scalability and flexibility.

HTH.

Dean Wampler, Ph.D.

Author: Programming Scala, 2nd Edition

<http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe 
<http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> 
http://polyglotprogramming.com

On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone 
<vikramk...@gmail.com<mailto:vikramk...@gmail.com>> wrote:

> Hi,

> I'm a newbie when it comes to Spark and Hadoop eco system in general.

> Our team has been predominantly a Microsoft shop that uses MS stack

> for most of their BI needs. So we are talking SQL server  for storing

> relational data and SQL Server Analysis services for building MOLAP

> cubes for sub-second query analysis.

> Lately, we have been hitting degradation in our cube query response

> times as our data sizes grew considerably the past year. We are

> talking fact tables which are in 1o-100 billions of rows range and a

> few dimensions in the 10-100's of millions of rows. We tried

> vertically scaling up our SSAS server but queries are still taking few

> minutes. In light of this, I was entrusted with task of figuring out

> an open source solution that would scale to our current and future needs for 
> data analysis.

> I looked at a bunch of open source tools like Apache Drill, Druid,

> AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the

> first step given it's recent rise in popularity and growing eco-system around 
> it.

> Since we are also interested in doing deep data analysis like machine

> learning and graph algorithms on top our data, spark seems to be a

> good solution.

> I would like to build out a POC for our MOLAP cubes using spark with

> HDFS/Hive as the datasource and see how it scales for our

> queries/measures in real time with real data.

> Roughly, these are the requirements for our team 1. Should be able to

> create facts, dimensions and measures from our data sets in an easier

> way.

> 2. Cubes should be query able from Excel and Tableau.

> 3. Easily scale out by adding new nodes when data grows 4. Very less

> maintenance and highly stable for production level workloads 5. Sub

> second query latencies for COUNT DISTINCT measures (since majority of

> our expensive measures are of this type) . Are ok with Approx Distinct

> counts for better perf.

>

> So given these requirements, is Spark the right solution to replace

> our on-premise MOLAP cubes?

> Are there any tutorials or documentation on how to build cubes using Spark?

> Is that even possible? or even necessary? As long as our users can

> pivot/slice & dice the measures quickly from client tools by dragging

> dropping dimensions into rows/columns w/o the need to join to fact

> table, we are ok with however the data is laid out. Doesn't have to be

> a cube. It can be a flat file in hdfs for all we care. I would love to

> chat with some one who has successfully done this kind of migration

> from OLAP cubes to Spark in their team or company .

>

> This is it for now. Looking forward to a great discussion.

>

> P.S. We have decided on using Azure HDInsight as our managed hadoop

> system in the cloud.

>

RE: Need advice for Spark newbie

Reply via email to