Hi All, I believe here what we are looking for is a serving layer where user queries can be executed on a subset of processed data. In this scenario, we are using Impala for this as it provides a layered caching, in our use case it caches some set in memory and then some in HDFS and the full set is on S3.
Our processing layer is SparkStreaming + HBase —> extracts to Parquet on S3 —> Impala is serving layer serving user requests. Impala also has a SQL interface. Drawback is Impala is not managed via Yarn and has its own resource manager and you would have to figure out a way to man Yarn and impala co-exist. Thanks, Edwin On Mar 16, 2017, 5:44 AM -0400, yohann jardin <yohannjar...@hotmail.com>, wrote: > Hello everyone, > > I'm also really interested in the answers as I will be facing the same issue > soon. > Muthu, if you evaluate again Apache Ignite, can you share your results? I > also noticed Alluxio to store spark results in memory that you might want to > investigate. > > In my case I want to use them to have a real time dashboard (or like waiting > very few seconds to refine a dashboard), and that use case seems similar to > your filter/aggregate previously computed spark results. > > Regards, > Yohann > > De : Rick Moritz <rah...@gmail.com> > Envoyé : jeudi 16 mars 2017 10:37 > À : user > Objet : Re: RE: Fast write datastore... > > If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet > might also be an option. Of course, management-wise it has much more overhead > than using ES, since you need to manually define partitions and buckets, > which is suboptimal. On the other hand, for querying, you can probably get > some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark > were too slow/cumbersome. > Depending on your particular access patterns, this may not be very practical, > but as a general approach it might be one way to get intermediate results > quicker, and with less of a storage-zoo than some alternatives. > > > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote: > > > I do think Kafka is an overkill in this case. There are no streaming use- > > > cases that needs a queue to do pub-sub. > > > > > > > On 16-Mar-2017 11:47 AM, "vvshvv" <vvs...@gmail.com> wrote: > > > > > Hi, > > > > > > > > > > >> A slightly over-kill solution may be Spark to Kafka to > > > > > >> ElasticSearch? > > > > > > > > > > I do not think so, in this case you will be able to process Parquet > > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be > > > > > stable and survive regarding the number of rows. > > > > > > > > > > Regards, > > > > > Uladzimir > > > > > > > > > > > > > > > > > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote: > > > > > > Hi, > > > > > > > > > > > > Will MongoDB not fit this solution? > > > > > > > > > > > > > > > > > > > > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com] > > > > > > Sent: Wednesday, March 15, 2017 11:51 PM > > > > > > To: Muthu Jayakumar <bablo...@gmail.com> > > > > > > Cc: vincent gromakowski <vincent.gromakow...@gmail.com>; Richard > > > > > > Siebeling <rsiebel...@gmail.com>; user <user@spark.apache.org>; > > > > > > Shiva Ramagopal <tr.s...@gmail.com> > > > > > > Subject: Re: Fast write datastore... > > > > > > > > > > > > Hi Muthu,. > > > > > > > > > > > > I did not catch from your message, what performance do you expect > > > > > > from subsequent queries? > > > > > > > > > > > > Regards, > > > > > > Uladzimir > > > > > > > > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" <bablo...@gmail.com> > > > > > > wrote: > > > > > > > Hello Uladzimir / Shiva, > > > > > > > > > > > > > > From ElasticSearch documentation (i have to see the logical plan > > > > > > > of a query to confirm), the richness of filters (like regex,..) > > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i > > > > > > > think Spark Dataframes is quite rich enough to tackle. > > > > > > > Let me know your thoughts. > > > > > > > > > > > > > > Thanks, > > > > > > > Muthu > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 15, 2017 at 10:55 AM, vvshvv <vvs...@gmail.com> wrote: > > > > > > > > Hi muthu, > > > > > > > > > > > > > > > > I agree with Shiva, Cassandra also supports SASI indexes, which > > > > > > > > can partially replace Elasticsearch functionality. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Uladzimir > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent from my Mi phone > > > > > > > > On Shiva Ramagopal <tr.s...@gmail.com>, Mar 15, 2017 5:57 PM > > > > > > > > wrote: > > > > > > > > > Probably Cassandra is a good choice if you are mainly looking > > > > > > > > > for a datastore that supports fast writes. You can ingest the > > > > > > > > > data into a table and define one or more materialized views > > > > > > > > > on top of it to support your queries. Since you mention that > > > > > > > > > your queries are going to be simple you can define your > > > > > > > > > indexes in the materialized views according to how you want > > > > > > > > > to query the data. > > > > > > > > > Thanks, > > > > > > > > > Shiva > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar > > > > > > > > > <bablo...@gmail.com> wrote: > > > > > > > > > > Hello Vincent, > > > > > > > > > > > > > > > > > > > > Cassandra may not fit my bill if I need to define my > > > > > > > > > > partition and other indexes upfront. Is this right? > > > > > > > > > > > > > > > > > > > > Hello Richard, > > > > > > > > > > > > > > > > > > > > Let me evaluate Apache Ignite. I did evaluate it 3 months > > > > > > > > > > back and back then the connector to Apache Spark did not > > > > > > > > > > support Spark 2.0. > > > > > > > > > > > > > > > > > > > > Another drastic thought may be repartition the result count > > > > > > > > > > to 1 (but have to be cautions on making sure I don't run > > > > > > > > > > into Heap issues if the result is too large to fit into an > > > > > > > > > > executor) and write to a relational database like mysql / > > > > > > > > > > postgres. But, I believe I can do the same using > > > > > > > > > > ElasticSearch too. > > > > > > > > > > > > > > > > > > > > A slightly over-kill solution may be Spark to Kafka to > > > > > > > > > > ElasticSearch? > > > > > > > > > > > > > > > > > > > > More thoughts welcome please. > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Muthu > > > > > > > > > > > > > > > > > > > > On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling > > > > > > > > > > <rsiebel...@gmail.com> wrote: > > > > > > > > > > > maybe Apache Ignite does fit your requirements > > > > > > > > > > > > > > > > > > > > > > On 15 March 2017 at 08:44, vincent gromakowski > > > > > > > > > > > <vincent.gromakow...@gmail.com> wrote: > > > > > > > > > > > > Hi > > > > > > > > > > > > If queries are statics and filters are on the same > > > > > > > > > > > > columns, Cassandra is a good option. > > > > > > > > > > > > > > > > > > > > > > > > Le 15 mars 2017 7:04 AM, "muthu" <bablo...@gmail.com> a > > > > > > > > > > > > écrit : > > > > > > > > > > > > > Hello there, > > > > > > > > > > > > > > > > > > > > > > > > > > I have one or more parquet files to read and perform > > > > > > > > > > > > > some aggregate queries > > > > > > > > > > > > > using Spark Dataframe. I would like to find a > > > > > > > > > > > > > reasonable fast datastore that > > > > > > > > > > > > > allows me to write the results for subsequent > > > > > > > > > > > > > (simpler queries). > > > > > > > > > > > > > I did attempt to use ElasticSearch to write the query > > > > > > > > > > > > > results using > > > > > > > > > > > > > ElasticSearch Hadoop connector. But I am running into > > > > > > > > > > > > > connector write issues > > > > > > > > > > > > > if the number of Spark executors are too many for > > > > > > > > > > > > > ElasticSearch to handle. > > > > > > > > > > > > > But in the schema sense, this seems a great fit as > > > > > > > > > > > > > ElasticSearch has smartz > > > > > > > > > > > > > in place to discover the schema. Also in the query > > > > > > > > > > > > > sense, I can perform > > > > > > > > > > > > > simple filters and sort using ElasticSearch and for > > > > > > > > > > > > > more complex aggregate, > > > > > > > > > > > > > Spark Dataframe can come back to the rescue :). > > > > > > > > > > > > > Please advice on other possible data-stores I could > > > > > > > > > > > > > use? > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > Muthu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > View this message in context: > > > > > > > > > > > > > http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html > > > > > > > > > > > > > Sent from the Apache Spark User List mailing list > > > > > > > > > > > > > archive at Nabble.com. > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > > > To unsubscribe e-mail: > > > > > > > > > > > > > user-unsubscr...@spark.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This message is for the designated recipient only and may contain > > > > > > privileged, proprietary, or otherwise confidential information. If > > > > > > you have received it in error, please notify the sender immediately > > > > > > and delete the original. Any other use of the e-mail by you is > > > > > > prohibited. Where allowed by local law, electronic communications > > > > > > with Accenture and its affiliates, including e-mail and instant > > > > > > messaging (including content), may be scanned by our systems for > > > > > > the purposes of information security and assessment of internal > > > > > > compliance with Accenture policy. > > > > > > ______________________________________________________________________________________ > > > > > > > > > > > > www.accenture.com >