Dean Thanks for the info. Are you saying that we can create star/snowflake data models using spark so they can be queried from tableau ?
On Thursday, February 26, 2015, Dean Wampler <deanwamp...@gmail.com> wrote: > Historically, many orgs. have replaced data warehouses with Hadoop > clusters and used Hive along with Impala (on Cloudera deployments) or Drill > (on MapR deployments) for SQL. Hive is older and slower, while Impala and > Drill are newer and faster, but you typically need both for their > complementary features, at least today. > > Spark and Spark SQL are not yet complete replacements for them, but > they'll get there over time. The good news is, you can mix and match these > tools, as appropriate, because they can all work with the same datasets. > > The challenge is all the tribal knowledge required to setup and manage > Hadoop clusters, to properly organize your data for best performance for > your needs, to use all these tools effectively, along with additional > Hadoop ETL tools, etc. Fortunately, tools like Tableau are already > integrated here. > > However, none of this will be as polished and integrated as what you're > used to. You're trading that polish for greater scalability and flexibility. > > HTH. > > > Dean Wampler, Ph.D. > Author: Programming Scala, 2nd Edition > <http://shop.oreilly.com/product/0636920033073.do> (O'Reilly) > Typesafe <http://typesafe.com> > @deanwampler <http://twitter.com/deanwampler> > http://polyglotprogramming.com > > On Thu, Feb 26, 2015 at 1:56 AM, Vikram Kone <vikramk...@gmail.com > <javascript:_e(%7B%7D,'cvml','vikramk...@gmail.com');>> wrote: > >> Hi, >> I'm a newbie when it comes to Spark and Hadoop eco system in general. Our >> team has been predominantly a Microsoft shop that uses MS stack for most >> of >> their BI needs. So we are talking SQL server for storing relational data >> and SQL Server Analysis services for building MOLAP cubes for sub-second >> query analysis. >> Lately, we have been hitting degradation in our cube query response times >> as our data sizes grew considerably the past year. We are talking fact >> tables which are in 1o-100 billions of rows range and a few dimensions in >> the 10-100's of millions of rows. We tried vertically scaling up our SSAS >> server but queries are still taking few minutes. In light of this, I was >> entrusted with task of figuring out an open source solution that would >> scale to our current and future needs for data analysis. >> I looked at a bunch of open source tools like Apache Drill, Druid, >> AtScale, >> Spark, Storm, Kylin etc and settled on exploring Spark as the first step >> given it's recent rise in popularity and growing eco-system around it. >> Since we are also interested in doing deep data analysis like machine >> learning and graph algorithms on top our data, spark seems to be a good >> solution. >> I would like to build out a POC for our MOLAP cubes using spark with >> HDFS/Hive as the datasource and see how it scales for our queries/measures >> in real time with real data. >> Roughly, these are the requirements for our team >> 1. Should be able to create facts, dimensions and measures from our data >> sets in an easier way. >> 2. Cubes should be query able from Excel and Tableau. >> 3. Easily scale out by adding new nodes when data grows >> 4. Very less maintenance and highly stable for production level workloads >> 5. Sub second query latencies for COUNT DISTINCT measures (since majority >> of our expensive measures are of this type) . Are ok with Approx Distinct >> counts for better perf. >> >> So given these requirements, is Spark the right solution to replace our >> on-premise MOLAP cubes? >> Are there any tutorials or documentation on how to build cubes using >> Spark? >> Is that even possible? or even necessary? As long as our users can >> pivot/slice & dice the measures quickly from client tools by dragging >> dropping dimensions into rows/columns w/o the need to join to fact table, >> we are ok with however the data is laid out. Doesn't have to be a cube. It >> can be a flat file in hdfs for all we care. I would love to chat with some >> one who has successfully done this kind of migration from OLAP cubes to >> Spark in their team or company . >> >> This is it for now. Looking forward to a great discussion. >> >> P.S. We have decided on using Azure HDInsight as our managed hadoop system >> in the cloud. >> > >