BTW, a tool that I have been using to help do the preaggregation of data using hyperloglog in combination with Spark is atscale (http://atscale.com/). It builds the aggregations and makes use of the speed of SparkSQL - all within the context of a model that is accessible by Tableau or Qlik.
On Thu, Mar 26, 2015 at 8:55 AM Jörn Franke <jornfra...@gmail.com> wrote: > As I wrote previously - indexing is not your only choice, you can > preaggregate data during load or depending on your needs you need to think > about other data structures, such as graphs, hyperloglog, bloom filters > etc. (challenge to integrate in standard bi tools) > Le 26 mars 2015 13:34, "kundan kumar" <iitr.kun...@gmail.com> a écrit : > > I was looking for some options and came across JethroData. >> >> http://www.jethrodata.com/ >> >> This stores the data maintaining indexes over all the columns seems good >> and claims to have better performance than Impala. >> >> Earlier I had tried Apache Phoenix because of its secondary indexing >> feature. But the major challenge I faced there was, secondary indexing was >> not supported for bulk loading process. >> Only the sequential loading process supported the secondary indexes, >> which took longer time. >> >> >> Any comments on this ? >> >> >> >> >> On Thu, Mar 26, 2015 at 5:59 PM, kundan kumar <iitr.kun...@gmail.com> >> wrote: >> >>> I looking for some options and came across >>> >>> http://www.jethrodata.com/ >>> >>> On Thu, Mar 26, 2015 at 5:47 PM, Jörn Franke <jornfra...@gmail.com> >>> wrote: >>> >>>> You can also preaggregate results for the queries by the user - >>>> depending on what queries they use this might be necessary for any >>>> underlying technology >>>> Le 26 mars 2015 11:27, "kundan kumar" <iitr.kun...@gmail.com> a écrit : >>>> >>>> Hi, >>>>> >>>>> I need to store terabytes of data which will be used for BI tools like >>>>> qlikview. >>>>> >>>>> The queries can be on the basis of filter on any column. >>>>> >>>>> Currently, we are using redshift for this purpose. >>>>> >>>>> I am trying to explore things other than the redshift . >>>>> >>>>> Is it possible to gain better performance in spark as compared to >>>>> redshift ? >>>>> >>>>> If yes, please suggest what is the best way to achieve this. >>>>> >>>>> >>>>> Thanks!! >>>>> Kundan >>>>> >>>> >>> >>