Hi Steve
Thanks for the info. I will look into hivemail. Are you saying that we can
create star/snowflake data models using spark so they can be queried from
tableau ?

On Thursday, February 26, 2015, Steve Nunez <snu...@hortonworks.com> wrote:

>  Hi Vikram,
> There was a recent presentation at Strata that you might find useful: Hive
> on Spark is Blazing Fast .. Or Is It?
> <http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final>
> Generally those conclusions mirror my own observations: on large data
> sets, Hive still gives the best SQL performance and the curve drops off as
> the data sets get smaller. Of course if you also want to build models from
> the data than Spark is an attractive option with its unified programming
> model. HiveMall <https://github.com/myui/hivemall> might also be
> applicable in your case; I’ve seen increasing adoption of it within certain
> industries.
> If you are going cloud, HDInsights is a good choice. You can run both Spark
> and R on HDInsights
> <http://azure.microsoft.com/blog/2014/11/17/azure-hdinsight-clusters-can-now-be-customized-to-run-a-variety-of-hadoop-projects-including-spark-and-r/>,
> as well as get the newest version of Hive (0.14, with Stinger
> enhancements from Microsoft
> <http://www.slideshare.net/hugfrance/recent-enhancements-to-apache-hive-query-performance>)
> for ‘free’, so once you get your data into a wasb you can try all three
> methods and see which one works best for you. HDInsights works well for
> mixing & matching tools.
> HTH,
> -          SteveN
> -----Original Message-----
> From: Dean Wampler [mailto:deanwamp...@gmail.com
> <javascript:_e(%7B%7D,'cvml','deanwamp...@gmail.com');>]
> Sent: Thursday, 26 February, 2015 8:54
> To: Vikram Kone
> Cc: dev@spark.apache.org
> <javascript:_e(%7B%7D,'cvml','dev@spark.apache.org');>
> Subject: Re: Need advice for Spark newbie
> Historically, many orgs. have replaced data warehouses with Hadoop
> clusters and used Hive along with Impala (on Cloudera deployments) or Drill
> (on MapR
> deployments) for SQL. Hive is older and slower, while Impala and Drill are
> newer and faster, but you typically need both for their complementary
> features, at least today.
> Spark and Spark SQL are not yet complete replacements for them, but
> they'll get there over time. The good news is, you can mix and match these
> tools, as appropriate, because they can all work with the same datasets.
> The challenge is all the tribal knowledge required to setup and manage
> Hadoop clusters, to properly organize your data for best performance for
> your needs, to use all these tools effectively, along with additional
> Hadoop ETL tools, etc. Fortunately, tools like Tableau are already
> integrated here.
> However, none of this will be as polished and integrated as what you're
> used to. You're trading that polish for greater scalability and flexibility.
> HTH.
> Dean Wampler, Ph.D.
> Author: Programming Scala, 2nd Edition
> <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe <
> http://typesafe.com> @deanwampler <http://twitter.com/deanwampler>
> http://polyglotprogramming.com
> On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone <vikramk...@gmail.com
> <javascript:_e(%7B%7D,'cvml','vikramk...@gmail.com');>> wrote:
> > Hi,
> > I'm a newbie when it comes to Spark and Hadoop eco system in general.
> > Our team has been predominantly a Microsoft shop that uses MS stack
> > for most of their BI needs. So we are talking SQL server  for storing
> > relational data and SQL Server Analysis services for building MOLAP
> > cubes for sub-second query analysis.
> > Lately, we have been hitting degradation in our cube query response
> > times as our data sizes grew considerably the past year. We are
> > talking fact tables which are in 1o-100 billions of rows range and a
> > few dimensions in the 10-100's of millions of rows. We tried
> > vertically scaling up our SSAS server but queries are still taking few
> > minutes. In light of this, I was entrusted with task of figuring out
> > an open source solution that would scale to our current and future needs
> for data analysis.
> > I looked at a bunch of open source tools like Apache Drill, Druid,
> > AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the
> > first step given it's recent rise in popularity and growing eco-system
> around it.
> > Since we are also interested in doing deep data analysis like machine
> > learning and graph algorithms on top our data, spark seems to be a
> > good solution.
> > I would like to build out a POC for our MOLAP cubes using spark with
> > HDFS/Hive as the datasource and see how it scales for our
> > queries/measures in real time with real data.
> > Roughly, these are the requirements for our team 1. Should be able to
> > create facts, dimensions and measures from our data sets in an easier
> > way.
> > 2. Cubes should be query able from Excel and Tableau.
> > 3. Easily scale out by adding new nodes when data grows 4. Very less
> > maintenance and highly stable for production level workloads 5. Sub
> > second query latencies for COUNT DISTINCT measures (since majority of
> > our expensive measures are of this type) . Are ok with Approx Distinct
> > counts for better perf.
> >
> > So given these requirements, is Spark the right solution to replace
> > our on-premise MOLAP cubes?
> > Are there any tutorials or documentation on how to build cubes using
> Spark?
> > Is that even possible? or even necessary? As long as our users can
> > pivot/slice & dice the measures quickly from client tools by dragging
> > dropping dimensions into rows/columns w/o the need to join to fact
> > table, we are ok with however the data is laid out. Doesn't have to be
> > a cube. It can be a flat file in hdfs for all we care. I would love to
> > chat with some one who has successfully done this kind of migration
> > from OLAP cubes to Spark in their team or company .
> >
> > This is it for now. Looking forward to a great discussion.
> >
> > P.S. We have decided on using Azure HDInsight as our managed hadoop
> > system in the cloud.
> >

Reply via email to