Re: Need advice for Spark newbie

Vikram Kone Thu, 26 Feb 2015 13:36:56 -0800

Hi Steve
Thanks for the info. I will look into hivemail. Are you saying that we can
create star/snowflake data models using spark so they can be queried from
tableau ?


On Thursday, February 26, 2015, Steve Nunez <snu...@hortonworks.com> wrote:

>  Hi Vikram,
>
>
>
> There was a recent presentation at Strata that you might find useful: Hive
> on Spark is Blazing Fast .. Or Is It?
> <http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final>
>
>
>
> Generally those conclusions mirror my own observations: on large data
> sets, Hive still gives the best SQL performance and the curve drops off as
> the data sets get smaller. Of course if you also want to build models from
> the data than Spark is an attractive option with its unified programming
> model. HiveMall <https://github.com/myui/hivemall> might also be
> applicable in your case; I’ve seen increasing adoption of it within certain
> industries.
>
>
>
> If you are going cloud, HDInsights is a good choice. You can run both Spark
> and R on HDInsights
> <http://azure.microsoft.com/blog/2014/11/17/azure-hdinsight-clusters-can-now-be-customized-to-run-a-variety-of-hadoop-projects-including-spark-and-r/>,
> as well as get the newest version of Hive (0.14, with Stinger
> enhancements from Microsoft
> <http://www.slideshare.net/hugfrance/recent-enhancements-to-apache-hive-query-performance>)
> for ‘free’, so once you get your data into a wasb you can try all three
> methods and see which one works best for you. HDInsights works well for
> mixing & matching tools.
>
>
>
> HTH,
>
> -          SteveN
>
>
>
> -----Original Message-----
> From: Dean Wampler [mailto:deanwamp...@gmail.com
> <javascript:_e(%7B%7D,'cvml','deanwamp...@gmail.com');>]
> Sent: Thursday, 26 February, 2015 8:54
> To: Vikram Kone
> Cc: dev@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','dev@spark.apache.org');>
> Subject: Re: Need advice for Spark newbie
>
>
>
> Historically, many orgs. have replaced data warehouses with Hadoop
> clusters and used Hive along with Impala (on Cloudera deployments) or Drill
> (on MapR
>
> deployments) for SQL. Hive is older and slower, while Impala and Drill are
> newer and faster, but you typically need both for their complementary
> features, at least today.
>
>
>
> Spark and Spark SQL are not yet complete replacements for them, but
> they'll get there over time. The good news is, you can mix and match these
> tools, as appropriate, because they can all work with the same datasets.
>
>
>
> The challenge is all the tribal knowledge required to setup and manage
> Hadoop clusters, to properly organize your data for best performance for
> your needs, to use all these tools effectively, along with additional
> Hadoop ETL tools, etc. Fortunately, tools like Tableau are already
> integrated here.
>
>
>
> However, none of this will be as polished and integrated as what you're
> used to. You're trading that polish for greater scalability and flexibility.
>
>
>
> HTH.
>
>
>
>
>
> Dean Wampler, Ph.D.
>
> Author: Programming Scala, 2nd Edition
>
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <
> http://typesafe.com> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
>
>
>
> On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone <vikramk...@gmail.com
> <javascript:_e(%7B%7D,'cvml','vikramk...@gmail.com');>> wrote:
>
>
>
> > Hi,
>
> > I'm a newbie when it comes to Spark and Hadoop eco system in general.
>
> > Our team has been predominantly a Microsoft shop that uses MS stack
>
> > for most of their BI needs. So we are talking SQL server  for storing
>
> > relational data and SQL Server Analysis services for building MOLAP
>
> > cubes for sub-second query analysis.
>
> > Lately, we have been hitting degradation in our cube query response
>
> > times as our data sizes grew considerably the past year. We are
>
> > talking fact tables which are in 1o-100 billions of rows range and a
>
> > few dimensions in the 10-100's of millions of rows. We tried
>
> > vertically scaling up our SSAS server but queries are still taking few
>
> > minutes. In light of this, I was entrusted with task of figuring out
>
> > an open source solution that would scale to our current and future needs
> for data analysis.
>
> > I looked at a bunch of open source tools like Apache Drill, Druid,
>
> > AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the
>
> > first step given it's recent rise in popularity and growing eco-system
> around it.
>
> > Since we are also interested in doing deep data analysis like machine
>
> > learning and graph algorithms on top our data, spark seems to be a
>
> > good solution.
>
> > I would like to build out a POC for our MOLAP cubes using spark with
>
> > HDFS/Hive as the datasource and see how it scales for our
>
> > queries/measures in real time with real data.
>
> > Roughly, these are the requirements for our team 1. Should be able to
>
> > create facts, dimensions and measures from our data sets in an easier
>
> > way.
>
> > 2. Cubes should be query able from Excel and Tableau.
>
> > 3. Easily scale out by adding new nodes when data grows 4. Very less
>
> > maintenance and highly stable for production level workloads 5. Sub
>
> > second query latencies for COUNT DISTINCT measures (since majority of
>
> > our expensive measures are of this type) . Are ok with Approx Distinct
>
> > counts for better perf.
>
> >
>
> > So given these requirements, is Spark the right solution to replace
>
> > our on-premise MOLAP cubes?
>
> > Are there any tutorials or documentation on how to build cubes using
> Spark?
>
> > Is that even possible? or even necessary? As long as our users can
>
> > pivot/slice & dice the measures quickly from client tools by dragging
>
> > dropping dimensions into rows/columns w/o the need to join to fact
>
> > table, we are ok with however the data is laid out. Doesn't have to be
>
> > a cube. It can be a flat file in hdfs for all we care. I would love to
>
> > chat with some one who has successfully done this kind of migration
>
> > from OLAP cubes to Spark in their team or company .
>
> >
>
> > This is it for now. Looking forward to a great discussion.
>
> >
>
> > P.S. We have decided on using Azure HDInsight as our managed hadoop
>
> > system in the cloud.
>
> >
>

Re: Need advice for Spark newbie

Reply via email to