Hi Steve Thanks for the info. I will look into hivemail. Are you saying that we can create star/snowflake data models using spark so they can be queried from tableau ?
On Thursday, February 26, 2015, Steve Nunez <snu...@hortonworks.com> wrote: > Hi Vikram, > > > > There was a recent presentation at Strata that you might find useful: Hive > on Spark is Blazing Fast .. Or Is It? > <http://www.slideshare.net/hortonworks/hive-on-spark-is-blazing-fast-or-is-it-final> > > > > Generally those conclusions mirror my own observations: on large data > sets, Hive still gives the best SQL performance and the curve drops off as > the data sets get smaller. Of course if you also want to build models from > the data than Spark is an attractive option with its unified programming > model. HiveMall <https://github.com/myui/hivemall> might also be > applicable in your case; I’ve seen increasing adoption of it within certain > industries. > > > > If you are going cloud, HDInsights is a good choice. You can run both Spark > and R on HDInsights > <http://azure.microsoft.com/blog/2014/11/17/azure-hdinsight-clusters-can-now-be-customized-to-run-a-variety-of-hadoop-projects-including-spark-and-r/>, > as well as get the newest version of Hive (0.14, with Stinger > enhancements from Microsoft > <http://www.slideshare.net/hugfrance/recent-enhancements-to-apache-hive-query-performance>) > for ‘free’, so once you get your data into a wasb you can try all three > methods and see which one works best for you. HDInsights works well for > mixing & matching tools. > > > > HTH, > > - SteveN > > > > -----Original Message----- > From: Dean Wampler [mailto:deanwamp...@gmail.com > <javascript:_e(%7B%7D,'cvml','deanwamp...@gmail.com');>] > Sent: Thursday, 26 February, 2015 8:54 > To: Vikram Kone > Cc: dev@spark.apache.org > <javascript:_e(%7B%7D,'cvml','dev@spark.apache.org');> > Subject: Re: Need advice for Spark newbie > > > > Historically, many orgs. have replaced data warehouses with Hadoop > clusters and used Hive along with Impala (on Cloudera deployments) or Drill > (on MapR > > deployments) for SQL. Hive is older and slower, while Impala and Drill are > newer and faster, but you typically need both for their complementary > features, at least today. > > > > Spark and Spark SQL are not yet complete replacements for them, but > they'll get there over time. The good news is, you can mix and match these > tools, as appropriate, because they can all work with the same datasets. > > > > The challenge is all the tribal knowledge required to setup and manage > Hadoop clusters, to properly organize your data for best performance for > your needs, to use all these tools effectively, along with additional > Hadoop ETL tools, etc. Fortunately, tools like Tableau are already > integrated here. > > > > However, none of this will be as polished and integrated as what you're > used to. You're trading that polish for greater scalability and flexibility. > > > > HTH. > > > > > > Dean Wampler, Ph.D. > > Author: Programming Scala, 2nd Edition > > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) Typesafe < > http://typesafe.com> @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > > > On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone <vikramk...@gmail.com > <javascript:_e(%7B%7D,'cvml','vikramk...@gmail.com');>> wrote: > > > > > Hi, > > > I'm a newbie when it comes to Spark and Hadoop eco system in general. > > > Our team has been predominantly a Microsoft shop that uses MS stack > > > for most of their BI needs. So we are talking SQL server for storing > > > relational data and SQL Server Analysis services for building MOLAP > > > cubes for sub-second query analysis. > > > Lately, we have been hitting degradation in our cube query response > > > times as our data sizes grew considerably the past year. We are > > > talking fact tables which are in 1o-100 billions of rows range and a > > > few dimensions in the 10-100's of millions of rows. We tried > > > vertically scaling up our SSAS server but queries are still taking few > > > minutes. In light of this, I was entrusted with task of figuring out > > > an open source solution that would scale to our current and future needs > for data analysis. > > > I looked at a bunch of open source tools like Apache Drill, Druid, > > > AtScale, Spark, Storm, Kylin etc and settled on exploring Spark as the > > > first step given it's recent rise in popularity and growing eco-system > around it. > > > Since we are also interested in doing deep data analysis like machine > > > learning and graph algorithms on top our data, spark seems to be a > > > good solution. > > > I would like to build out a POC for our MOLAP cubes using spark with > > > HDFS/Hive as the datasource and see how it scales for our > > > queries/measures in real time with real data. > > > Roughly, these are the requirements for our team 1. Should be able to > > > create facts, dimensions and measures from our data sets in an easier > > > way. > > > 2. Cubes should be query able from Excel and Tableau. > > > 3. Easily scale out by adding new nodes when data grows 4. Very less > > > maintenance and highly stable for production level workloads 5. Sub > > > second query latencies for COUNT DISTINCT measures (since majority of > > > our expensive measures are of this type) . Are ok with Approx Distinct > > > counts for better perf. > > > > > > So given these requirements, is Spark the right solution to replace > > > our on-premise MOLAP cubes? > > > Are there any tutorials or documentation on how to build cubes using > Spark? > > > Is that even possible? or even necessary? As long as our users can > > > pivot/slice & dice the measures quickly from client tools by dragging > > > dropping dimensions into rows/columns w/o the need to join to fact > > > table, we are ok with however the data is laid out. Doesn't have to be > > > a cube. It can be a flat file in hdfs for all we care. I would love to > > > chat with some one who has successfully done this kind of migration > > > from OLAP cubes to Spark in their team or company . > > > > > > This is it for now. Looking forward to a great discussion. > > > > > > P.S. We have decided on using Azure HDInsight as our managed hadoop > > > system in the cloud. > > > >