subject:"Re\: Fast write datastore..."

Re: RE: Fast write datastore...

2017-03-16 Thread Sudhir Menon

I am extremely leery about pushing product on this forum and have refrained
from it in the past. But since you are talking about loading parquet data
into Spark, run some aggregate queries and then write the results to a fast
data store, and specifically asking for product options,  it makes absolute
sense to consider SnappyData. SnappyData turns Spark into a fast read write
store and you can do what you are trying to do with a single cluster which
hosts Spark and the database. It is an in memory store that supports high
concurrency, fast lookups and the ability to run queries via
ODBC/JDBC/Thrift. The tables stored in the database are accessible as
dataframes and you can use the Spark API to access the data.

Check it out here <http://www.snappydata.io/download>. Happy to answer any
questions (there are tons of resources on the site and you can post
questions on the slack <https://snappydata-public.slack.com> channel)

On Thu, Mar 16, 2017 at 2:43 AM, yohann jardin 
wrote:

> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same
> issue soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I
> also noticed Alluxio to store spark results in memory that you might want
> to investigate.
>
> In my case I want to use them to have a real time dashboard (or like
> waiting very few seconds to refine a dashboard), and that use case seems
> similar to your filter/aggregate previously computed spark results.
>
> Regards,
> Yohann
>
>
> --
> *De :* Rick Moritz 
> *Envoyé :* jeudi 16 mars 2017 10:37
> *À :* user
> *Objet :* Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and
> Parquet might also be an option. Of course, management-wise it has much
> more overhead than using ES, since you need to manually define partitions
> and buckets, which is suboptimal. On the other hand, for querying, you can
> probably get some decent performance by hooking up Impala or Presto or
> LLAP-Hive, if Spark were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very
> practical, but as a general approach it might be one way to get
> intermediate results quicker, and with less of a storage-zoo than some
> alternatives.
>
> On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal 
> wrote:
>
>> I do think Kafka is an overkill in this case. There are no streaming use-
>> cases that needs a queue to do pub-sub.
>>
>> On 16-Mar-2017 11:47 AM, "vvshvv"  wrote:
>>
>>> Hi,
>>>
>>> >> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>>
>>> I do not think so, in this case you will be able to process Parquet
>>> files as usual, but Kafka will allow your Elasticsearch cluster to be
>>> stable and survive regarding the number of rows.
>>>
>>> Regards,
>>> Uladzimir
>>>
>>>
>>>
>>> On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
>>>
>>> Hi,
>>>
>>>
>>>
>>> Will MongoDB not fit this solution?
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Vova Shelgunov [mailto:vvs...@gmail.com]
>>> *Sent:* Wednesday, March 15, 2017 11:51 PM
>>> *To:* Muthu Jayakumar 
>>> *Cc:* vincent gromakowski ; Richard
>>> Siebeling ; user ; Shiva
>>> Ramagopal 
>>> *Subject:* Re: Fast write datastore...
>>>
>>>
>>>
>>> Hi Muthu,.
>>>
>>>
>>>
>>> I did not catch from your message, what performance do you expect from
>>> subsequent queries?
>>>
>>>
>>>
>>> Regards,
>>>
>>> Uladzimir
>>>
>>>
>>>
>>> On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"  wrote:
>>>
>>> Hello Uladzimir / Shiva,
>>>
>>>
>>>
>>> From ElasticSearch documentation (i have to see the logical plan of a
>>> query to confirm), the richness of filters (like regex,..) is pretty good
>>> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
>>> is quite rich enough to tackle.
>>>
>>> Let me know your thoughts.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Muthu
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>>>
>>> Hi muthu,
>>>
>>>
>>>
>>> I agree with Shiva, Cassandra also supports SASI index

RE: RE: Fast write datastore...

2017-03-16 Thread Mal Edwin

Hi All,
I believe here what we are looking for is a serving layer where user queries 
can be executed on a subset of processed data.
In this scenario, we are using Impala for this as it provides a layered 
caching, in our use case it caches some set in memory and then some in HDFS and 
the full set is on S3.

Our processing layer is SparkStreaming + HBase  —> extracts to Parquet on S3 —> 
Impala is serving layer serving user requests. Impala also has a SQL interface. 
Drawback is Impala is not managed via Yarn and has its own resource manager and 
you would have to figure out a way to man Yarn and impala co-exist.

Thanks,
Edwin

On Mar 16, 2017, 5:44 AM -0400, yohann jardin , wrote:
> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same issue 
> soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I 
> also noticed Alluxio to store spark results in memory that you might want to 
> investigate.
>
> In my case I want to use them to have a real time dashboard (or like waiting 
> very few seconds to refine a dashboard), and that use case seems similar to 
> your filter/aggregate previously computed spark results.
>
> Regards,
> Yohann
>
> De : Rick Moritz 
> Envoyé : jeudi 16 mars 2017 10:37
> À : user
> Objet : Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet 
> might also be an option. Of course, management-wise it has much more overhead 
> than using ES, since you need to manually define partitions and buckets, 
> which is suboptimal. On the other hand, for querying, you can probably get 
> some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark 
> were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very practical, 
> but as a general approach it might be one way to get intermediate results 
> quicker, and with less of a storage-zoo than some alternatives.
>
> > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal  wrote:
> > > I do think Kafka is an overkill in this case. There are no streaming use- 
> > > cases that needs a queue to do pub-sub.
> > >
> > > > On 16-Mar-2017 11:47 AM, "vvshvv"  wrote:
> > > > > Hi,
> > > > >
> > > > > >> A slightly over-kill solution may be Spark to Kafka to 
> > > > > >> ElasticSearch?
> > > > >
> > > > > I do not think so, in this case you will be able to process Parquet 
> > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be 
> > > > > stable and survive regarding the number of rows.
> > > > >
> > > > > Regards,
> > > > > Uladzimir
> > > > >
> > > > >
> > > > >
> > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Will MongoDB not fit this solution?
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com]
> > > > > > Sent: Wednesday, March 15, 2017 11:51 PM
> > > > > > To: Muthu Jayakumar 
> > > > > > Cc: vincent gromakowski ; Richard 
> > > > > > Siebeling ; user ; 
> > > > > > Shiva Ramagopal 
> > > > > > Subject: Re: Fast write datastore...
> > > > > >
> > > > > > Hi Muthu,.
> > > > > >
> > > > > > I did not catch from your message, what performance do you expect 
> > > > > > from subsequent queries?
> > > > > >
> > > > > > Regards,
> > > > > > Uladzimir
> > > > > >
> > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"  
> > > > > > wrote:
> > > > > > > Hello Uladzimir / Shiva,
> > > > > > >
> > > > > > > From ElasticSearch documentation (i have to see the logical plan 
> > > > > > > of a query to confirm), the richness of filters (like regex,..) 
> > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i 
> > > > > > > think Spark Dataframes is quite rich enough to tackle.
> > > > > > > Let me know your thoughts.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Muthu
> > > > > > &g

RE: RE: Fast write datastore...

2017-03-16 Thread yohann jardin

Hello everyone,

I'm also really interested in the answers as I will be facing the same issue 
soon.
Muthu, if you evaluate again Apache Ignite, can you share your results? I also 
noticed Alluxio to store spark results in memory that you might want to 
investigate.

In my case I want to use them to have a real time dashboard (or like waiting 
very few seconds to refine a dashboard), and that use case seems similar to 
your filter/aggregate previously computed spark results.

Regards,
Yohann

De : Rick Moritz 
Envoyé : jeudi 16 mars 2017 10:37
À : user
Objet : Re: RE: Fast write datastore...

If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet 
might also be an option. Of course, management-wise it has much more overhead 
than using ES, since you need to manually define partitions and buckets, which 
is suboptimal. On the other hand, for querying, you can probably get some 
decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark were 
too slow/cumbersome.
Depending on your particular access patterns, this may not be very practical, 
but as a general approach it might be one way to get intermediate results 
quicker, and with less of a storage-zoo than some alternatives.

On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal 
mailto:tr.s...@gmail.com>> wrote:
I do think Kafka is an overkill in this case. There are no streaming use- cases 
that needs a queue to do pub-sub.

On 16-Mar-2017 11:47 AM, "vvshvv" mailto:vvs...@gmail.com>> 
wrote:
Hi,

>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

I do not think so, in this case you will be able to process Parquet files as 
usual, but Kafka will allow your Elasticsearch cluster to be stable and survive 
regarding the number of rows.

Regards,
Uladzimir

On jasbir.s...@accenture.com<mailto:jasbir.s...@accenture.com>, Mar 16, 2017 
7:52 AM wrote:
Hi,

Will MongoDB not fit this solution?

From: Vova Shelgunov [mailto:vvs...@gmail.com<mailto:vvs...@gmail.com>]
Sent: Wednesday, March 15, 2017 11:51 PM
To: Muthu Jayakumar mailto:bablo...@gmail.com>>
Cc: vincent gromakowski 
mailto:vincent.gromakow...@gmail.com>>; Richard 
Siebeling mailto:rsiebel...@gmail.com>>; user 
mailto:user@spark.apache.org>>; Shiva Ramagopal 
mailto:tr.s...@gmail.com>>
Subject: Re: Fast write datastore...

Hi Muthu,.

I did not catch from your message, what performance do you expect from 
subsequent queries?

Regards,
Uladzimir

On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" 
mailto:bablo...@gmail.com>> wrote:
Hello Uladzimir / Shiva,

>From ElasticSearch documentation (i have to see the logical plan of a query to 
>confirm), the richness of filters (like regex,..) is pretty good while 
>comparing to Cassandra. As for aggregates, i think Spark Dataframes is quite 
>rich enough to tackle.
Let me know your thoughts.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 10:55 AM, vvshvv 
mailto:vvs...@gmail.com>> wrote:
Hi muthu,

I agree with Shiva, Cassandra also supports SASI indexes, which can partially 
replace Elasticsearch functionality.

Regards,
Uladzimir

Sent from my Mi phone
On Shiva Ramagopal mailto:tr.s...@gmail.com>>, Mar 15, 2017 
5:57 PM wrote:
Probably Cassandra is a good choice if you are mainly looking for a datastore 
that supports fast writes. You can ingest the data into a table and define one 
or more materialized views on top of it to support your queries. Since you 
mention that your queries are going to be simple you can define your indexes in 
the materialized views according to how you want to query the data.
Thanks,
Shiva

On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
mailto:bablo...@gmail.com>> wrote:
Hello Vincent,

Cassandra may not fit my bill if I need to define my partition and other 
indexes upfront. Is this right?

Hello Richard,

Let me evaluate Apache Ignite. I did evaluate it 3 months back and back then 
the connector to Apache Spark did not support Spark 2.0.

Another drastic thought may be repartition the result count to 1 (but have to 
be cautions on making sure I don't run into Heap issues if the result is too 
large to fit into an executor)  and write to a relational database like mysql / 
postgres. But, I believe I can do the same using ElasticSearch too.

A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

More thoughts welcome please.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
mailto:rsiebel...@gmail.com>> wrote:
maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski 
mailto:vincent.gromakow...@gmail.com>> wrote:
Hi
If queries are statics and filters are on the same columns, Cassandra is a good 
option.

Le 15 mars 2017 7:04 AM, "muthu" 
mailto:bablo...@gmail.com>> a écrit :
Hello there,

I have one or more parquet files

Re: RE: Fast write datastore...

2017-03-16 Thread Rick Moritz

If you have enough RAM/SSDs available, maybe tiered HDFS storage and
Parquet might also be an option. Of course, management-wise it has much
more overhead than using ES, since you need to manually define partitions
and buckets, which is suboptimal. On the other hand, for querying, you can
probably get some decent performance by hooking up Impala or Presto or
LLAP-Hive, if Spark were too slow/cumbersome.
Depending on your particular access patterns, this may not be very
practical, but as a general approach it might be one way to get
intermediate results quicker, and with less of a storage-zoo than some
alternatives.

On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal  wrote:

> I do think Kafka is an overkill in this case. There are no streaming use-
> cases that needs a queue to do pub-sub.
>
> On 16-Mar-2017 11:47 AM, "vvshvv"  wrote:
>
>> Hi,
>>
>> >> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>
>> I do not think so, in this case you will be able to process Parquet files
>> as usual, but Kafka will allow your Elasticsearch cluster to be stable and
>> survive regarding the number of rows.
>>
>> Regards,
>> Uladzimir
>>
>>
>>
>> On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
>>
>> Hi,
>>
>>
>>
>> Will MongoDB not fit this solution?
>>
>>
>>
>>
>>
>>
>>
>> *From:* Vova Shelgunov [mailto:vvs...@gmail.com]
>> *Sent:* Wednesday, March 15, 2017 11:51 PM
>> *To:* Muthu Jayakumar 
>> *Cc:* vincent gromakowski ; Richard
>> Siebeling ; user ; Shiva
>> Ramagopal 
>> *Subject:* Re: Fast write datastore...
>>
>>
>>
>> Hi Muthu,.
>>
>>
>>
>> I did not catch from your message, what performance do you expect from
>> subsequent queries?
>>
>>
>>
>> Regards,
>>
>> Uladzimir
>>
>>
>>
>> On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"  wrote:
>>
>> Hello Uladzimir / Shiva,
>>
>>
>>
>> From ElasticSearch documentation (i have to see the logical plan of a
>> query to confirm), the richness of filters (like regex,..) is pretty good
>> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
>> is quite rich enough to tackle.
>>
>> Let me know your thoughts.
>>
>>
>>
>> Thanks,
>>
>> Muthu
>>
>>
>>
>>
>>
>> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>>
>> Hi muthu,
>>
>>
>>
>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>> partially replace Elasticsearch functionality.
>>
>>
>>
>> Regards,
>>
>> Uladzimir
>>
>>
>>
>>
>>
>>
>>
>> Sent from my Mi phone
>>
>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>
>> Probably Cassandra is a good choice if you are mainly looking for a
>> datastore that supports fast writes. You can ingest the data into a table
>> and define one or more materialized views on top of it to support your
>> queries. Since you mention that your queries are going to be simple you can
>> define your indexes in the materialized views according to how you want to
>> query the data.
>>
>> Thanks,
>>
>> Shiva
>>
>>
>>
>>
>>
>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>> wrote:
>>
>> Hello Vincent,
>>
>>
>>
>> Cassandra may not fit my bill if I need to define my partition and other
>> indexes upfront. Is this right?
>>
>>
>>
>> Hello Richard,
>>
>>
>>
>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>> then the connector to Apache Spark did not support Spark 2.0.
>>
>>
>>
>> Another drastic thought may be repartition the result count to 1 (but
>> have to be cautions on making sure I don't run into Heap issues if the
>> result is too large to fit into an executor)  and write to a relational
>> database like mysql / postgres. But, I believe I can do the same using
>> ElasticSearch too.
>>
>>
>>
>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>
>>
>>
>> More thoughts welcome please.
>>
>>
>>
>> Thanks,
>>
>> Muthu
>>
>>
>>
>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
>> wrote:
>>
>> maybe Apache Ignite does fit your requirements
>>
>>
>>
>> On 15

RE: Fast write datastore...

2017-03-15 Thread jasbir.sing

Hi,

Will MongoDB not fit this solution?

From: Vova Shelgunov [mailto:vvs...@gmail.com]
Sent: Wednesday, March 15, 2017 11:51 PM
To: Muthu Jayakumar 
Cc: vincent gromakowski ; Richard Siebeling 
; user ; Shiva Ramagopal 

Subject: Re: Fast write datastore...

Hi Muthu,.

I did not catch from your message, what performance do you expect from 
subsequent queries?

Regards,
Uladzimir

On Mar 15, 2017 9:03 PM, "Muthu Jayakumar" 
mailto:bablo...@gmail.com>> wrote:
Hello Uladzimir / Shiva,

From ElasticSearch documentation (i have to see the logical plan of a query to 
confirm), the richness of filters (like regex,..) is pretty good while 
comparing to Cassandra. As for aggregates, i think Spark Dataframes is quite 
rich enough to tackle.
Let me know your thoughts.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 10:55 AM, vvshvv 
mailto:vvs...@gmail.com>> wrote:
Hi muthu,

I agree with Shiva, Cassandra also supports SASI indexes, which can partially 
replace Elasticsearch functionality.

Regards,
Uladzimir

Sent from my Mi phone
On Shiva Ramagopal mailto:tr.s...@gmail.com>>, Mar 15, 2017 
5:57 PM wrote:
Probably Cassandra is a good choice if you are mainly looking for a datastore 
that supports fast writes. You can ingest the data into a table and define one 
or more materialized views on top of it to support your queries. Since you 
mention that your queries are going to be simple you can define your indexes in 
the materialized views according to how you want to query the data.
Thanks,
Shiva

On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
mailto:bablo...@gmail.com>> wrote:
Hello Vincent,

Cassandra may not fit my bill if I need to define my partition and other 
indexes upfront. Is this right?

Hello Richard,

Let me evaluate Apache Ignite. I did evaluate it 3 months back and back then 
the connector to Apache Spark did not support Spark 2.0.

Another drastic thought may be repartition the result count to 1 (but have to 
be cautions on making sure I don't run into Heap issues if the result is too 
large to fit into an executor)  and write to a relational database like mysql / 
postgres. But, I believe I can do the same using ElasticSearch too.

A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

More thoughts welcome please.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
mailto:rsiebel...@gmail.com>> wrote:
maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski 
mailto:vincent.gromakow...@gmail.com>> wrote:
Hi
If queries are statics and filters are on the same columns, Cassandra is a good 
option.

Le 15 mars 2017 7:04 AM, "muthu" 
mailto:bablo...@gmail.com>> a écrit :
Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu

--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Fast-write-datastore-tp28497.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__apache-2Dspark-2Duser-2Dlist.1001560.n3.nabble.com_Fast-2Dwrite-2Ddatastore-2Dtp28497.html&d=DwMFaQ&c=eIGjsITfXP_y-DLLX0uEHXJvU8nOHrUK8IrwNKOtkVU&r=7scIIjM0jY9x3fjvY6a_yERLxMA2NwA8l0DnuyrL6yA&m=9OzGCUHXXQLjuS_SpMHII54QWHNzFKrwMma4qV3ADxE&s=6305WvqHeyTC5S2ZSBXamJrcO03n3MQyoU4tkMQlM_k&e=>
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>

This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar

>Reading your original question again, it seems to me probably you don't
need a fast data store
Shiva, You are right. I only asked about fast-write and never mentioned on
read :). For us, Cassandra may not be a choice of read because of its
a. limitations on pagination support on the server side
b. richness of filters provided when compared to elastic search... but this
can worked around by using spark dataframe.
c. a possible larger limitation for me, which is mandate on creating a
partition key column before hand. I may not be able to determine this
before hand.
But 'materialized view', 'SSTable Attached Secondary Index (SASI)' can help
alleviate to some extent.

>what performance do you expect from subsequent queries?
Uladzimir, here is what we do now...
Step 1: Run aggregate query using large number of parquets (generally
ranging from few MBs to few GBs) using Spark Dataframe.
Step 2: Attempt to store these query results in a 'fast datastore' (I have
asked for recommendations in this post). The data is usually sized from
250K to 600 million rows... Also the schema from Step 1 is not known before
hand and is usually deduced from the Dataframe schema or so. In most cases
it's a simple non-structural field.
Step 3: Run one or more queries from results stored in Step 2... These are
something as simple as pagination, filters (think of it as simple string
contains, regex, number in range, ...) and sort. For any operation more
complex than this, I have been planning to run it thru a dataframe.

Koert makes valid points on the issues with Elastic Search.

On a side note, we do use Cassandra for Spark Streaming use-cases where we
sink the data into Cassandra (for efficient upsert capabilities) and
eventually write into parquet for long term storage and trend analysis with
full table scans scenarios.

But I am thankful for many ideas and perspectives on how this could be
looked at.

Thanks,
Muthu


On Wed, Mar 15, 2017 at 7:25 PM, Shiva Ramagopal  wrote:

> Hi,
>
> The choice of ES vs Cassandra should really be made depending on your
> query use-cases. ES and Cassandra have their own strengths which should be
> matched to what you want to do rather than making a choice based on their
> respective feature sets.
>
> Reading your original question again, it seems to me probably you don't
> need a fast data store since you are doing a batch-like processing (reading
> from Parquet files) and it is possibly to control this part fully. And it
> also seems like you want to use ES. You can try to reduce the number of
> Spark executors to throttle the writes to ES.
>
> -Shiva
>
> On Wed, Mar 15, 2017 at 11:32 PM, Muthu Jayakumar 
> wrote:
>
>> Hello Uladzimir / Shiva,
>>
>> From ElasticSearch documentation (i have to see the logical plan of a
>> query to confirm), the richness of filters (like regex,..) is pretty good
>> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
>> is quite rich enough to tackle.
>> Let me know your thoughts.
>>
>> Thanks,
>> Muthu
>>
>>
>> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>>
>>> Hi muthu,
>>>
>>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>>> partially replace Elasticsearch functionality.
>>>
>>> Regards,
>>> Uladzimir
>>>
>>>
>>>
>>> Sent from my Mi phone
>>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>>
>>> Probably Cassandra is a good choice if you are mainly looking for a
>>> datastore that supports fast writes. You can ingest the data into a table
>>> and define one or more materialized views on top of it to support your
>>> queries. Since you mention that your queries are going to be simple you can
>>> define your indexes in the materialized views according to how you want to
>>> query the data.
>>>
>>> Thanks,
>>> Shiva
>>>
>>>
>>>
>>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>>> wrote:
>>>
 Hello Vincent,

 Cassandra may not fit my bill if I need to define my partition and
 other indexes upfront. Is this right?

 Hello Richard,

 Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
 then the connector to Apache Spark did not support Spark 2.0.

 Another drastic thought may be repartition the result count to 1 (but
 have to be cautions on making sure I don't run into Heap issues if the
 result is too large to fit into an executor)  and write to a relational
 database like mysql / postgres. But, I believe I can do the same using
 ElasticSearch too.

 A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

 More thoughts welcome please.

 Thanks,
 Muthu

 On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling <
 rsiebel...@gmail.com> wrote:

> maybe Apache Ignite does fit your requirements
>
> On 15 March 2017 at 08:44, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Hi
>> If queries are statics and filters are on the same columns, Ca

Re: Fast write datastore...

2017-03-15 Thread Shiva Ramagopal

Hi,

The choice of ES vs Cassandra should really be made depending on your query
use-cases. ES and Cassandra have their own strengths which should be
matched to what you want to do rather than making a choice based on their
respective feature sets.

Reading your original question again, it seems to me probably you don't
need a fast data store since you are doing a batch-like processing (reading
from Parquet files) and it is possibly to control this part fully. And it
also seems like you want to use ES. You can try to reduce the number of
Spark executors to throttle the writes to ES.

-Shiva

On Wed, Mar 15, 2017 at 11:32 PM, Muthu Jayakumar 
wrote:

> Hello Uladzimir / Shiva,
>
> From ElasticSearch documentation (i have to see the logical plan of a
> query to confirm), the richness of filters (like regex,..) is pretty good
> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
> is quite rich enough to tackle.
> Let me know your thoughts.
>
> Thanks,
> Muthu
>
>
> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>
>> Hi muthu,
>>
>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>> partially replace Elasticsearch functionality.
>>
>> Regards,
>> Uladzimir
>>
>>
>>
>> Sent from my Mi phone
>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>
>> Probably Cassandra is a good choice if you are mainly looking for a
>> datastore that supports fast writes. You can ingest the data into a table
>> and define one or more materialized views on top of it to support your
>> queries. Since you mention that your queries are going to be simple you can
>> define your indexes in the materialized views according to how you want to
>> query the data.
>>
>> Thanks,
>> Shiva
>>
>>
>>
>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>> wrote:
>>
>>> Hello Vincent,
>>>
>>> Cassandra may not fit my bill if I need to define my partition and other
>>> indexes upfront. Is this right?
>>>
>>> Hello Richard,
>>>
>>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>>> then the connector to Apache Spark did not support Spark 2.0.
>>>
>>> Another drastic thought may be repartition the result count to 1 (but
>>> have to be cautions on making sure I don't run into Heap issues if the
>>> result is too large to fit into an executor)  and write to a relational
>>> database like mysql / postgres. But, I believe I can do the same using
>>> ElasticSearch too.
>>>
>>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>>
>>> More thoughts welcome please.
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling >> > wrote:
>>>
 maybe Apache Ignite does fit your requirements

 On 15 March 2017 at 08:44, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra
> is a good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate
> queries
> using Spark Dataframe. I would like to find a reasonable fast
> datastore that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to
> handle.
> But in the schema sense, this seems a great fit as ElasticSearch has
> smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex
> aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

>>>
>>
>

Re: Fast write datastore...

2017-03-15 Thread Koert Kuipers

we are using elasticsearch for this.

the issue of elasticsearch falling over if the number of partitions/cores
in spark writing to it is too high does suck indeed. and the answer every
time i asked about it on elasticsearch mailing list has been to reduce
spark tasks or increase elasticsearch nodes, which is not very useful.

we ended up putting the spark jobs that write to elasticsearch on a yarn
queue that limits cores. not ideal but it does the job.

On Wed, Mar 15, 2017 at 2:04 AM, muthu  wrote:

> Hello there,
>
> I have one or more parquet files to read and perform some aggregate queries
> using Spark Dataframe. I would like to find a reasonable fast datastore
> that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to handle.
> But in the schema sense, this seems a great fit as ElasticSearch has smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Fast write datastore...

2017-03-15 Thread Vova Shelgunov

Hi Muthu,.

I did not catch from your message, what performance do you expect from
subsequent queries?

Regards,
Uladzimir

On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"  wrote:

> Hello Uladzimir / Shiva,
>
> From ElasticSearch documentation (i have to see the logical plan of a
> query to confirm), the richness of filters (like regex,..) is pretty good
> while comparing to Cassandra. As for aggregates, i think Spark Dataframes
> is quite rich enough to tackle.
> Let me know your thoughts.
>
> Thanks,
> Muthu
>
>
> On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
>
>> Hi muthu,
>>
>> I agree with Shiva, Cassandra also supports SASI indexes, which can
>> partially replace Elasticsearch functionality.
>>
>> Regards,
>> Uladzimir
>>
>>
>>
>> Sent from my Mi phone
>> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>>
>> Probably Cassandra is a good choice if you are mainly looking for a
>> datastore that supports fast writes. You can ingest the data into a table
>> and define one or more materialized views on top of it to support your
>> queries. Since you mention that your queries are going to be simple you can
>> define your indexes in the materialized views according to how you want to
>> query the data.
>>
>> Thanks,
>> Shiva
>>
>>
>>
>> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
>> wrote:
>>
>>> Hello Vincent,
>>>
>>> Cassandra may not fit my bill if I need to define my partition and other
>>> indexes upfront. Is this right?
>>>
>>> Hello Richard,
>>>
>>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>>> then the connector to Apache Spark did not support Spark 2.0.
>>>
>>> Another drastic thought may be repartition the result count to 1 (but
>>> have to be cautions on making sure I don't run into Heap issues if the
>>> result is too large to fit into an executor)  and write to a relational
>>> database like mysql / postgres. But, I believe I can do the same using
>>> ElasticSearch too.
>>>
>>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>>
>>> More thoughts welcome please.
>>>
>>> Thanks,
>>> Muthu
>>>
>>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling >> > wrote:
>>>
 maybe Apache Ignite does fit your requirements

 On 15 March 2017 at 08:44, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra
> is a good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate
> queries
> using Spark Dataframe. I would like to find a reasonable fast
> datastore that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to
> handle.
> But in the schema sense, this seems a great fit as ElasticSearch has
> smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex
> aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

>>>
>>
>

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar

Hello Uladzimir / Shiva,

>From ElasticSearch documentation (i have to see the logical plan of a query
to confirm), the richness of filters (like regex,..) is pretty good while
comparing to Cassandra. As for aggregates, i think Spark Dataframes is
quite rich enough to tackle.
Let me know your thoughts.

Thanks,
Muthu


On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:

> Hi muthu,
>
> I agree with Shiva, Cassandra also supports SASI indexes, which can
> partially replace Elasticsearch functionality.
>
> Regards,
> Uladzimir
>
>
>
> Sent from my Mi phone
> On Shiva Ramagopal , Mar 15, 2017 5:57 PM wrote:
>
> Probably Cassandra is a good choice if you are mainly looking for a
> datastore that supports fast writes. You can ingest the data into a table
> and define one or more materialized views on top of it to support your
> queries. Since you mention that your queries are going to be simple you can
> define your indexes in the materialized views according to how you want to
> query the data.
>
> Thanks,
> Shiva
>
>
>
> On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar 
> wrote:
>
>> Hello Vincent,
>>
>> Cassandra may not fit my bill if I need to define my partition and other
>> indexes upfront. Is this right?
>>
>> Hello Richard,
>>
>> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
>> then the connector to Apache Spark did not support Spark 2.0.
>>
>> Another drastic thought may be repartition the result count to 1 (but
>> have to be cautions on making sure I don't run into Heap issues if the
>> result is too large to fit into an executor)  and write to a relational
>> database like mysql / postgres. But, I believe I can do the same using
>> ElasticSearch too.
>>
>> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>>
>> More thoughts welcome please.
>>
>> Thanks,
>> Muthu
>>
>> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
>> wrote:
>>
>>> maybe Apache Ignite does fit your requirements
>>>
>>> On 15 March 2017 at 08:44, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 Hi
 If queries are statics and filters are on the same columns, Cassandra
 is a good option.

 Le 15 mars 2017 7:04 AM, "muthu"  a écrit :

 Hello there,

 I have one or more parquet files to read and perform some aggregate
 queries
 using Spark Dataframe. I would like to find a reasonable fast datastore
 that
 allows me to write the results for subsequent (simpler queries).
 I did attempt to use ElasticSearch to write the query results using
 ElasticSearch Hadoop connector. But I am running into connector write
 issues
 if the number of Spark executors are too many for ElasticSearch to
 handle.
 But in the schema sense, this seems a great fit as ElasticSearch has
 smartz
 in place to discover the schema. Also in the query sense, I can perform
 simple filters and sort using ElasticSearch and for more complex
 aggregate,
 Spark Dataframe can come back to the rescue :).
 Please advice on other possible data-stores I could use?

 Thanks,
 Muthu



 --
 View this message in context: http://apache-spark-user-list.
 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org



>>>
>>
>

Re: Fast write datastore...

2017-03-15 Thread Shiva Ramagopal

Probably Cassandra is a good choice if you are mainly looking for a
datastore that supports fast writes. You can ingest the data into a table
and define one or more materialized views on top of it to support your
queries. Since you mention that your queries are going to be simple you can
define your indexes in the materialized views according to how you want to
query the data.

Thanks,
Shiva



On Wed, Mar 15, 2017 at 7:58 PM, Muthu Jayakumar  wrote:

> Hello Vincent,
>
> Cassandra may not fit my bill if I need to define my partition and other
> indexes upfront. Is this right?
>
> Hello Richard,
>
> Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
> then the connector to Apache Spark did not support Spark 2.0.
>
> Another drastic thought may be repartition the result count to 1 (but have
> to be cautions on making sure I don't run into Heap issues if the result is
> too large to fit into an executor)  and write to a relational database like
> mysql / postgres. But, I believe I can do the same using ElasticSearch too.
>
> A slightly over-kill solution may be Spark to Kafka to ElasticSearch?
>
> More thoughts welcome please.
>
> Thanks,
> Muthu
>
> On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
> wrote:
>
>> maybe Apache Ignite does fit your requirements
>>
>> On 15 March 2017 at 08:44, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Hi
>>> If queries are statics and filters are on the same columns, Cassandra is
>>> a good option.
>>>
>>> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>>>
>>> Hello there,
>>>
>>> I have one or more parquet files to read and perform some aggregate
>>> queries
>>> using Spark Dataframe. I would like to find a reasonable fast datastore
>>> that
>>> allows me to write the results for subsequent (simpler queries).
>>> I did attempt to use ElasticSearch to write the query results using
>>> ElasticSearch Hadoop connector. But I am running into connector write
>>> issues
>>> if the number of Spark executors are too many for ElasticSearch to
>>> handle.
>>> But in the schema sense, this seems a great fit as ElasticSearch has
>>> smartz
>>> in place to discover the schema. Also in the query sense, I can perform
>>> simple filters and sort using ElasticSearch and for more complex
>>> aggregate,
>>> Spark Dataframe can come back to the rescue :).
>>> Please advice on other possible data-stores I could use?
>>>
>>> Thanks,
>>> Muthu
>>>
>>>
>>>
>>> --
>>> View this message in context: http://apache-spark-user-list.
>>> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>>
>>
>

Re: Fast write datastore...

2017-03-15 Thread Muthu Jayakumar

Hello Vincent,

Cassandra may not fit my bill if I need to define my partition and other
indexes upfront. Is this right?

Hello Richard,

Let me evaluate Apache Ignite. I did evaluate it 3 months back and back
then the connector to Apache Spark did not support Spark 2.0.

Another drastic thought may be repartition the result count to 1 (but have
to be cautions on making sure I don't run into Heap issues if the result is
too large to fit into an executor)  and write to a relational database like
mysql / postgres. But, I believe I can do the same using ElasticSearch too.

A slightly over-kill solution may be Spark to Kafka to ElasticSearch?

More thoughts welcome please.

Thanks,
Muthu

On Wed, Mar 15, 2017 at 4:53 AM, Richard Siebeling 
wrote:

> maybe Apache Ignite does fit your requirements
>
> On 15 March 2017 at 08:44, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Hi
>> If queries are statics and filters are on the same columns, Cassandra is
>> a good option.
>>
>> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>>
>> Hello there,
>>
>> I have one or more parquet files to read and perform some aggregate
>> queries
>> using Spark Dataframe. I would like to find a reasonable fast datastore
>> that
>> allows me to write the results for subsequent (simpler queries).
>> I did attempt to use ElasticSearch to write the query results using
>> ElasticSearch Hadoop connector. But I am running into connector write
>> issues
>> if the number of Spark executors are too many for ElasticSearch to handle.
>> But in the schema sense, this seems a great fit as ElasticSearch has
>> smartz
>> in place to discover the schema. Also in the query sense, I can perform
>> simple filters and sort using ElasticSearch and for more complex
>> aggregate,
>> Spark Dataframe can come back to the rescue :).
>> Please advice on other possible data-stores I could use?
>>
>> Thanks,
>> Muthu
>>
>>
>>
>> --
>> View this message in context: http://apache-spark-user-list.
>> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>

Re: Fast write datastore...

2017-03-15 Thread Richard Siebeling

maybe Apache Ignite does fit your requirements

On 15 March 2017 at 08:44, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Hi
> If queries are statics and filters are on the same columns, Cassandra is a
> good option.
>
> Le 15 mars 2017 7:04 AM, "muthu"  a écrit :
>
> Hello there,
>
> I have one or more parquet files to read and perform some aggregate queries
> using Spark Dataframe. I would like to find a reasonable fast datastore
> that
> allows me to write the results for subsequent (simpler queries).
> I did attempt to use ElasticSearch to write the query results using
> ElasticSearch Hadoop connector. But I am running into connector write
> issues
> if the number of Spark executors are too many for ElasticSearch to handle.
> But in the schema sense, this seems a great fit as ElasticSearch has smartz
> in place to discover the schema. Also in the query sense, I can perform
> simple filters and sort using ElasticSearch and for more complex aggregate,
> Spark Dataframe can come back to the rescue :).
> Please advice on other possible data-stores I could use?
>
> Thanks,
> Muthu
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>

Re: Fast write datastore...

2017-03-15 Thread vincent gromakowski

Hi
If queries are statics and filters are on the same columns, Cassandra is a
good option.

Le 15 mars 2017 7:04 AM, "muthu"  a écrit :

Hello there,

I have one or more parquet files to read and perform some aggregate queries
using Spark Dataframe. I would like to find a reasonable fast datastore that
allows me to write the results for subsequent (simpler queries).
I did attempt to use ElasticSearch to write the query results using
ElasticSearch Hadoop connector. But I am running into connector write issues
if the number of Spark executors are too many for ElasticSearch to handle.
But in the schema sense, this seems a great fit as ElasticSearch has smartz
in place to discover the schema. Also in the query sense, I can perform
simple filters and sort using ElasticSearch and for more complex aggregate,
Spark Dataframe can come back to the rescue :).
Please advice on other possible data-stores I could use?

Thanks,
Muthu



--
View this message in context: http://apache-spark-user-list.
1001560.n3.nabble.com/Fast-write-datastore-tp28497.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: RE: Fast write datastore...

RE: RE: Fast write datastore...

RE: RE: Fast write datastore...

Re: RE: Fast write datastore...

RE: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

Re: Fast write datastore...

14 matches

Site Navigation

Mail list logo

Footer information