RE: RE: Fast write datastore...

Mal Edwin Thu, 16 Mar 2017 04:32:28 -0700

Hi All,
I believe here what we are looking for is a serving layer where user queries 
can be executed on a subset of processed data.
In this scenario, we are using Impala for this as it provides a layered 
caching, in our use case it caches some set in memory and then some in HDFS and 
the full set is on S3.


Our processing layer is SparkStreaming + HBase  —> extracts to Parquet on S3 —> 
Impala is serving layer serving user requests. Impala also has a SQL interface. 
Drawback is Impala is not managed via Yarn and has its own resource manager and 
you would have to figure out a way to man Yarn and impala co-exist.

Thanks,
Edwin

On Mar 16, 2017, 5:44 AM -0400, yohann jardin <yohannjar...@hotmail.com>, wrote:
> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same issue 
> soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I 
> also noticed Alluxio to store spark results in memory that you might want to 
> investigate.
>
> In my case I want to use them to have a real time dashboard (or like waiting 
> very few seconds to refine a dashboard), and that use case seems similar to 
> your filter/aggregate previously computed spark results.
>
> Regards,
> Yohann
>
> De : Rick Moritz <rah...@gmail.com>
> Envoyé : jeudi 16 mars 2017 10:37
> À : user
> Objet : Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet 
> might also be an option. Of course, management-wise it has much more overhead 
> than using ES, since you need to manually define partitions and buckets, 
> which is suboptimal. On the other hand, for querying, you can probably get 
> some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark 
> were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very practical, 
> but as a general approach it might be one way to get intermediate results 
> quicker, and with less of a storage-zoo than some alternatives.
>
> > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal <tr.s...@gmail.com> wrote:
> > > I do think Kafka is an overkill in this case. There are no streaming use- 
> > > cases that needs a queue to do pub-sub.
> > >
> > > > On 16-Mar-2017 11:47 AM, "vvshvv" <vvs...@gmail.com> wrote:
> > > > > Hi,
> > > > >
> > > > > >> A slightly over-kill solution may be Spark to Kafka to 
> > > > > >> ElasticSearch?
> > > > >
> > > > > I do not think so, in this case you will be able to process Parquet 
> > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be 
> > > > > stable and survive regarding the number of rows.
> > > > >
> > > > > Regards,
> > > > > Uladzimir
> > > > >
> > > > >
> > > > >
> > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Will MongoDB not fit this solution?
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com]
> > > > > > Sent: Wednesday, March 15, 2017 11:51 PM
> > > > > > To: Muthu Jayakumar <bablo...@gmail.com>
> > > > > > Cc: vincent gromakowski <vincent.gromakow...@gmail.com>; Richard 
> > > > > > Siebeling <rsiebel...@gmail.com>; user <user@spark.apache.org>; 
> > > > > > Shiva Ramagopal <tr.s...@gmail.com>
> > > > > > Subject: Re: Fast write datastore...
> > > > > >
> > > > > > Hi Muthu,.
> > > > > >
> > > > > > I did not catch from your message, what performance do you expect 
> > > > > > from subsequent queries?
> > > > > >
> > > > > > Regards,
> > > > > > Uladzimir
> > > > > >
> > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" <bablo...@gmail.com> 
> > > > > > wrote:
> > > > > > > Hello Uladzimir / Shiva,
> > > > > > >
> > > > > > > From ElasticSearch documentation (i have to see the logical plan 
> > > > > > > of a query to confirm), the richness of filters (like regex,..) 
> > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i 
> > > > > > > think Spark Dataframes is quite rich enough to tackle.
> > > > > > > Let me know your thoughts.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Muthu
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 15, 2017 at 10:55 AM, vvshvv <vvs...@gmail.com> wrote:
> > > > > > > > Hi muthu,
> > > > > > > >
> > > > > > > > I agree with Shiva, Cassandra also supports SASI indexes, which 
> > > > > > > > can partially replace Elasticsearch functionality.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Uladzimir
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Sent from my Mi phone
> > > > > > > > On Shiva Ramagopal <tr.s...@gmail.com>, Mar 15, 2017 5:57 PM 
> > > > > > > > wrote:
> > > > > > > > > Probably Cassandra is a good choice if you are mainly looking 
> > > > > > > > > for a datastore that supports fast writes. You can ingest the 
> > > > > > > > > data into a table and define one or more materialized views 
> > > > > > > > > on top of it to support your queries. Since you mention that 
> > > > > > > > > your queries are going to be simple you can define your 
> > > > > > > > > indexes in the materialized views according to how you want 
> > > > > > > > > to query the data.
> > > > > > > > > Thanks,
> > > > > > > > > Shiva
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
> > > > > > > > > <bablo...@gmail.com> wrote:
> > > > > > > > > > Hello Vincent,
> > > > > > > > > >
> > > > > > > > > > Cassandra may not fit my bill if I need to define my 
> > > > > > > > > > partition and other indexes upfront. Is this right?
> > > > > > > > > >
> > > > > > > > > > Hello Richard,
> > > > > > > > > >
> > > > > > > > > > Let me evaluate Apache Ignite. I did evaluate it 3 months 
> > > > > > > > > > back and back then the connector to Apache Spark did not 
> > > > > > > > > > support Spark 2.0.
> > > > > > > > > >
> > > > > > > > > > Another drastic thought may be repartition the result count 
> > > > > > > > > > to 1 (but have to be cautions on making sure I don't run 
> > > > > > > > > > into Heap issues if the result is too large to fit into an 
> > > > > > > > > > executor)  and write to a relational database like mysql / 
> > > > > > > > > > postgres. But, I believe I can do the same using 
> > > > > > > > > > ElasticSearch too.
> > > > > > > > > >
> > > > > > > > > > A slightly over-kill solution may be Spark to Kafka to 
> > > > > > > > > > ElasticSearch?
> > > > > > > > > >
> > > > > > > > > > More thoughts welcome please.
> > > > > > > > > >
> > > > > > > > > > Thanks,
> > > > > > > > > > Muthu
> > > > > > > > > >
> > > > > > > > > > On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
> > > > > > > > > > <rsiebel...@gmail.com> wrote:
> > > > > > > > > > > maybe Apache Ignite does fit your requirements
> > > > > > > > > > >
> > > > > > > > > > > On 15 March 2017 at 08:44, vincent gromakowski 
> > > > > > > > > > > <vincent.gromakow...@gmail.com> wrote:
> > > > > > > > > > > > Hi
> > > > > > > > > > > > If queries are statics and filters are on the same 
> > > > > > > > > > > > columns, Cassandra is a good option.
> > > > > > > > > > > >
> > > > > > > > > > > > Le 15 mars 2017 7:04 AM, "muthu" <bablo...@gmail.com> a 
> > > > > > > > > > > > écrit :
> > > > > > > > > > > > > Hello there,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I have one or more parquet files to read and perform 
> > > > > > > > > > > > > some aggregate queries
> > > > > > > > > > > > > using Spark Dataframe. I would like to find a 
> > > > > > > > > > > > > reasonable fast datastore that
> > > > > > > > > > > > > allows me to write the results for subsequent 
> > > > > > > > > > > > > (simpler queries).
> > > > > > > > > > > > > I did attempt to use ElasticSearch to write the query 
> > > > > > > > > > > > > results using
> > > > > > > > > > > > > ElasticSearch Hadoop connector. But I am running into 
> > > > > > > > > > > > > connector write issues
> > > > > > > > > > > > > if the number of Spark executors are too many for 
> > > > > > > > > > > > > ElasticSearch to handle.
> > > > > > > > > > > > > But in the schema sense, this seems a great fit as 
> > > > > > > > > > > > > ElasticSearch has smartz
> > > > > > > > > > > > > in place to discover the schema. Also in the query 
> > > > > > > > > > > > > sense, I can perform
> > > > > > > > > > > > > simple filters and sort using ElasticSearch and for 
> > > > > > > > > > > > > more complex aggregate,
> > > > > > > > > > > > > Spark Dataframe can come back to the rescue :).
> > > > > > > > > > > > > Please advice on other possible data-stores I could 
> > > > > > > > > > > > > use?
> > > > > > > > > > > > >
> > > > > > > > > > > > > Thanks,
> > > > > > > > > > > > > Muthu
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > View this message in context: 
> > > > > > > > > > > > > http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> > > > > > > > > > > > > Sent from the Apache Spark User List mailing list 
> > > > > > > > > > > > > archive at Nabble.com.
> > > > > > > > > > > > >
> > > > > > > > > > > > > ---------------------------------------------------------------------
> > > > > > > > > > > > > To unsubscribe e-mail: 
> > > > > > > > > > > > > user-unsubscr...@spark.apache.org
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > > This message is for the designated recipient only and may contain 
> > > > > > privileged, proprietary, or otherwise confidential information. If 
> > > > > > you have received it in error, please notify the sender immediately 
> > > > > > and delete the original. Any other use of the e-mail by you is 
> > > > > > prohibited. Where allowed by local law, electronic communications 
> > > > > > with Accenture and its affiliates, including e-mail and instant 
> > > > > > messaging (including content), may be scanned by our systems for 
> > > > > > the purposes of information security and assessment of internal 
> > > > > > compliance with Accenture policy.
> > > > > > ______________________________________________________________________________________
> > > > > >
> > > > > > www.accenture.com
>

RE: RE: Fast write datastore...

Reply via email to