Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
Hi,

I need to store terabytes of data which will be used for BI tools like
qlikview.

The queries can be on the basis of filter on any column.

Currently, we are using redshift for this purpose.

I am trying to explore things other than the redshift .

Is it possible to gain better performance in spark as compared to redshift ?

If yes, please suggest what is the best way to achieve this.


Thanks!!
Kundan


Re: Handling Big data for interactive BI tools

2015-03-26 Thread Akhil Das
Yes, you can easily configure Spark Thrift server and connect BI Tools.
Here's an example
https://hadoopi.wordpress.com/2014/12/31/spark-connect-tableau-desktop-to-sparksql/
showing how to integrate SparkSQL with Tableau dashboards.

Thanks
Best Regards

On Thu, Mar 26, 2015 at 3:56 PM, kundan kumar iitr.kun...@gmail.com wrote:

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to redshift
 ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan



Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
You can also preaggregate results for the queries by the user - depending
on what queries they use this might be necessary for any underlying
technology
Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to redshift
 ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan



Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
I looking for some options and came across

http://www.jethrodata.com/

On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com wrote:

 You can also preaggregate results for the queries by the user - depending
 on what queries they use this might be necessary for any underlying
 technology
 Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to
 redshift ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan




Re: Handling Big data for interactive BI tools

2015-03-26 Thread kundan kumar
I was looking for some options and came across JethroData.

http://www.jethrodata.com/

This stores the data maintaining indexes over all the columns seems good
and claims to have better performance than Impala.

Earlier I had tried Apache Phoenix because of its secondary indexing
feature. But the major challenge I faced there was, secondary indexing was
not supported for bulk loading process.
Only the sequential loading process supported the secondary indexes, which
took longer time.


Any comments on this ?




On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar iitr.kun...@gmail.com wrote:

 I looking for some options and came across

 http://www.jethrodata.com/

 On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com wrote:

 You can also preaggregate results for the queries by the user - depending
 on what queries they use this might be necessary for any underlying
 technology
 Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to
 redshift ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan





Re: Handling Big data for interactive BI tools

2015-03-26 Thread Denny Lee
BTW, a tool that I have been using to help do the preaggregation of data
using hyperloglog in combination with Spark is atscale (http://atscale.com/).
It builds the aggregations and makes use of the speed of SparkSQL - all
within the context of a model that is accessible by Tableau or Qlik.

On Thu, Mar 26, 2015 at 8:55 AM Jörn Franke jornfra...@gmail.com wrote:

 As I wrote previously - indexing is not your only choice, you can
 preaggregate data during load or depending on your needs you  need to think
 about other data structures, such as graphs, hyperloglog, bloom filters
 etc. (challenge to integrate in standard bi tools)
 Le 26 mars 2015 13:34, kundan kumar iitr.kun...@gmail.com a écrit :

 I was looking for some options and came across JethroData.

 http://www.jethrodata.com/

 This stores the data maintaining indexes over all the columns seems good
 and claims to have better performance than Impala.

 Earlier I had tried Apache Phoenix because of its secondary indexing
 feature. But the major challenge I faced there was, secondary indexing was
 not supported for bulk loading process.
 Only the sequential loading process supported the secondary indexes,
 which took longer time.


 Any comments on this ?




 On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar iitr.kun...@gmail.com
 wrote:

 I looking for some options and came across

 http://www.jethrodata.com/

 On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com
 wrote:

 You can also preaggregate results for the queries by the user -
 depending on what queries they use this might be necessary for any
 underlying technology
 Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to
 redshift ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan






Re: Handling Big data for interactive BI tools

2015-03-26 Thread Jörn Franke
As I wrote previously - indexing is not your only choice, you can
preaggregate data during load or depending on your needs you  need to think
about other data structures, such as graphs, hyperloglog, bloom filters
etc. (challenge to integrate in standard bi tools)
Le 26 mars 2015 13:34, kundan kumar iitr.kun...@gmail.com a écrit :

 I was looking for some options and came across JethroData.

 http://www.jethrodata.com/

 This stores the data maintaining indexes over all the columns seems good
 and claims to have better performance than Impala.

 Earlier I had tried Apache Phoenix because of its secondary indexing
 feature. But the major challenge I faced there was, secondary indexing was
 not supported for bulk loading process.
 Only the sequential loading process supported the secondary indexes, which
 took longer time.


 Any comments on this ?




 On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar iitr.kun...@gmail.com
 wrote:

 I looking for some options and came across

 http://www.jethrodata.com/

 On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke jornfra...@gmail.com
 wrote:

 You can also preaggregate results for the queries by the user -
 depending on what queries they use this might be necessary for any
 underlying technology
 Le 26 mars 2015 11:27, kundan kumar iitr.kun...@gmail.com a écrit :

 Hi,

 I need to store terabytes of data which will be used for BI tools like
 qlikview.

 The queries can be on the basis of filter on any column.

 Currently, we are using redshift for this purpose.

 I am trying to explore things other than the redshift .

 Is it possible to gain better performance in spark as compared to
 redshift ?

 If yes, please suggest what is the best way to achieve this.


 Thanks!!
 Kundan