Hi All,
I believe here what we are looking for is a serving layer where user queries
can be executed on a subset of processed data.
In this scenario, we are using Impala for this as it provides a layered
caching, in our use case it caches some set in memory and then some in HDFS and
the full set is on S3.
Our processing layer is SparkStreaming + HBase —> extracts to Parquet on S3 —>
Impala is serving layer serving user requests. Impala also has a SQL interface.
Drawback is Impala is not managed via Yarn and has its own resource manager and
you would have to figure out a way to man Yarn and impala co-exist.
Thanks,
Edwin
On Mar 16, 2017, 5:44 AM -0400, yohann jardin , wrote:
> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same issue
> soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I
> also noticed Alluxio to store spark results in memory that you might want to
> investigate.
>
> In my case I want to use them to have a real time dashboard (or like waiting
> very few seconds to refine a dashboard), and that use case seems similar to
> your filter/aggregate previously computed spark results.
>
> Regards,
> Yohann
>
> De : Rick Moritz
> Envoyé : jeudi 16 mars 2017 10:37
> À : user
> Objet : Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet
> might also be an option. Of course, management-wise it has much more overhead
> than using ES, since you need to manually define partitions and buckets,
> which is suboptimal. On the other hand, for querying, you can probably get
> some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark
> were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very practical,
> but as a general approach it might be one way to get intermediate results
> quicker, and with less of a storage-zoo than some alternatives.
>
> > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal wrote:
> > > I do think Kafka is an overkill in this case. There are no streaming use-
> > > cases that needs a queue to do pub-sub.
> > >
> > > > On 16-Mar-2017 11:47 AM, "vvshvv" wrote:
> > > > > Hi,
> > > > >
> > > > > >> A slightly over-kill solution may be Spark to Kafka to
> > > > > >> ElasticSearch?
> > > > >
> > > > > I do not think so, in this case you will be able to process Parquet
> > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be
> > > > > stable and survive regarding the number of rows.
> > > > >
> > > > > Regards,
> > > > > Uladzimir
> > > > >
> > > > >
> > > > >
> > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Will MongoDB not fit this solution?
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com]
> > > > > > Sent: Wednesday, March 15, 2017 11:51 PM
> > > > > > To: Muthu Jayakumar
> > > > > > Cc: vincent gromakowski ; Richard
> > > > > > Siebeling ; user ;
> > > > > > Shiva Ramagopal
> > > > > > Subject: Re: Fast write datastore...
> > > > > >
> > > > > > Hi Muthu,.
> > > > > >
> > > > > > I did not catch from your message, what performance do you expect
> > > > > > from subsequent queries?
> > > > > >
> > > > > > Regards,
> > > > > > Uladzimir
> > > > > >
> > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"
> > > > > > wrote:
> > > > > > > Hello Uladzimir / Shiva,
> > > > > > >
> > > > > > > From ElasticSearch documentation (i have to see the logical plan
> > > > > > > of a query to confirm), the richness of filters (like regex,..)
> > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i
> > > > > > > think Spark Dataframes is quite rich enough to tackle.
> > > > > > > Let me know your thoughts.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Muthu
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 15, 2017 at 10:55 AM, vvshvv wrote:
> > > > > > > > Hi muthu,
> > > > > > > >
> > > > > > > > I agree with Shiva, Cassandra also supports SASI indexes, which
> > > > > > > > can partially replace Elasticsearch functionality.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Uladzimir
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Sent from my Mi phone
> > > > > > > > On Shiva Ramagopal , Mar 15, 2017 5:57 PM
> > > > > > > > wrote:
> > > > > > > > > Probably Cassandra is a good choice if you are mainly looking
> > > > > > > > > for a datastore that supports fast writes. You can ingest the
> > > > > > > > > data into a table and define one or more materialized views
> > > > > > > > > on top of it to support your queries. Since you mention that
> > > > >