Use case question

2014-11-24 Thread Gordon Benjamin
hi, We are building an analytics dashboard. Data will be updated every 5 minutes for now and eventually every 1 minute, maybe more frequent. The amount of data coming is not huge, per customer maybe 30 records per minute although we could have 500 customers. Is streaming correct for this I nstead

Re: Use case question

2014-11-24 Thread Gordon Benjamin
at 4:34 PM, Gordon Benjamin gordon.benjami...@gmail.com javascript:_e(%7B%7D,'cvml','gordon.benjami...@gmail.com'); wrote: hi, We are building an analytics dashboard. Data will be updated every 5 minutes for now and eventually every 1 minute, maybe more frequent. The amount of data coming

Re: Use case question

2014-11-24 Thread Gordon Benjamin
will be updated with the new data. And yes, the end use won't feel anything while you do the coalesce/repartition and all but after that your dashboards will be refreshed with new data. Thanks Best Regards On Mon, Nov 24, 2014 at 4:54 PM, Gordon Benjamin gordon.benjami...@gmail.com javascript:_e(%7B

Incremental loading data slows performance

2014-11-20 Thread Gordon Benjamin
Hi, We are seeing bad performance as we incrementally load data. Here is the config Spark standalone cluster spark01 (spark master, shark, hadoop namenode): 15GB RAM, 4vCPU's spark02 (spark worker, hadoop datanode): 15GB RAM, 8vCPU's spark03 (spark worker): 15GB RAM, 8vCPU's spark04 (spark

Re: Incremental loading data slows performance

2014-11-20 Thread Gordon Benjamin
from ..._incremental Perhaps this helps understand our issue On Thursday, November 20, 2014, Gordon Benjamin gordon.benjami...@gmail.com wrote: Hi, We are seeing bad performance as we incrementally load data. Here is the config Spark standalone cluster spark01 (spark master, shark

Debug Sql execution

2014-11-20 Thread Gordon Benjamin
hey, Can anyone tell me how to debug a sql execution? Perhaps so it can show what the query is doing and how long it takes at each point?

partitioning to speed up queries

2014-11-07 Thread Gordon Benjamin
Hi All, I'm using Spark/Shark as the foundation for some reporting that I'm doing and have a customers table with approximately 3 million rows that I've cached in memory. I've also created a partitioned table that I've also cached in memory on a per day basis FROM customers_cached INSERT