RE: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-10 Thread Durity, Sean R
RF in the Analytics DC can be 2 (or even 1) if storage cost is more important 
than availability. There is a storage (and CPU and network latency) cost for a 
separate Spark cluster. So, the variables of your specific use case may swing 
the decision in different directions.


Sean Durity
From: Dor Laor 
Sent: Wednesday, January 09, 2019 11:23 PM
To: user@cassandra.apache.org
Subject: Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache 
Cassandra

On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I think you could consider option C: Create a (new) analytics DC in Cassandra 
and run your spark nodes there. Then you can address the scaling just on that 
DC. You can also use less vnodes, only replicate certain keyspaces, etc. in 
order to perform the analytics more efficiently.

But this way you duplicate the entire dataset RF times over. It's very very 
expensive.
It is a common practice to run Spark on a separate Cassandra (virtual) 
datacenter but it's done
in order to isolate the analytic workload from the realtime workload for 
isolation and low latency guarantees.
We addressed this problem elsewhere, beyond this scope.



Sean Durity

From: Dor Laor mailto:d...@scylladb.com>>
Sent: Friday, January 04, 2019 4:21 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] Re: Good way of configuring Apache spark with Apache 
Cassandra

I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why 
bundle them?
   This ratio may also change over time as your application evolves and amount 
of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't want it 
to affect Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the 
above doesn't happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
mailto:goutham.chiru...@gmail.com>> wrote:
Hi,
We have requirement of heavy data lifting and analytics requirement and decided 
to go with Apache Spark. In the process we have come up with two patterns
a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
b. Apache Spark on one independent cluster and Apache Cassandra as one 
independent cluster.

Need good pattern how to use the analytic engine for Cassandra. Thanks in 
advance.

Regards
Goutham.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


RE: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-10 Thread Durity, Sean R
At this point, I would be talking to DataStax. They already have Spark and 
SOLR/search fully embedded in their product. You can look at their docs for 
some idea of the RAM and CPU required for combined Search/Analytics use cases. 
I would expect this to be a much faster route to production.


Sean Durity
From: Goutham reddy 
Sent: Wednesday, January 09, 2019 11:29 AM
To: user@cassandra.apache.org
Subject: Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache 
Cassandra

Thanks Sean. But what if I want to have both Spark and elasticsearch with 
Cassandra as separare data center. Does that cause any overhead ?

On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I think you could consider option C: Create a (new) analytics DC in Cassandra 
and run your spark nodes there. Then you can address the scaling just on that 
DC. You can also use less vnodes, only replicate certain keyspaces, etc. in 
order to perform the analytics more efficiently.


Sean Durity

From: Dor Laor mailto:d...@scylladb.com>>
Sent: Friday, January 04, 2019 4:21 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] Re: Good way of configuring Apache spark with Apache 
Cassandra

I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why 
bundle them?
   This ratio may also change over time as your application evolves and amount 
of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't want it 
to affect Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the 
above doesn't happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
mailto:goutham.chiru...@gmail.com>> wrote:
Hi,
We have requirement of heavy data lifting and analytics requirement and decided 
to go with Apache Spark. In the process we have come up with two patterns
a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
b. Apache Spark on one independent cluster and Apache Cassandra as one 
independent cluster.

Need good pattern how to use the analytic engine for Cassandra. Thanks in 
advance.

Regards
Goutham.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.
--
Regards
Goutham Reddy



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Dor Laor
On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
wrote:

> I think you could consider option C: Create a (new) analytics DC in
> Cassandra and run your spark nodes there. Then you can address the scaling
> just on that DC. You can also use less vnodes, only replicate certain
> keyspaces, etc. in order to perform the analytics more efficiently.
>

But this way you duplicate the entire dataset RF times over. It's very very
expensive.
It is a common practice to run Spark on a separate Cassandra (virtual)
datacenter but it's done
in order to isolate the analytic workload from the realtime workload for
isolation and low latency guarantees.
We addressed this problem elsewhere, beyond this scope.


>
>
>
> Sean Durity
>
>
>
> *From:* Dor Laor 
> *Sent:* Friday, January 04, 2019 4:21 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Good way of configuring Apache spark with
> Apache Cassandra
>
>
>
> I strongly recommend option B, separate clusters. Reasons:
>
>  - Networking of node-node is negligible compared to networking within the
> node
>
>  - Different scaling considerations
>
>Your workload may require 10 Spark nodes and 20 database nodes, so why
> bundle them?
>
>This ratio may also change over time as your application evolves and
> amount of data changes.
>
>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
> want it to affect Cassandra and the opposite.
>
>If you isolate it with cgroups, you may have too much idle time when
> the above doesn't happen.
>
>
>
>
>
> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
> wrote:
>
> Hi,
>
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
>
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
>
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
>
>
> Regards
>
> Goutham.
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Goutham reddy
Thanks Sean. But what if I want to have both Spark and elasticsearch with
Cassandra as separare data center. Does that cause any overhead ?

On Wed, Jan 9, 2019 at 7:28 AM Durity, Sean R 
wrote:

> I think you could consider option C: Create a (new) analytics DC in
> Cassandra and run your spark nodes there. Then you can address the scaling
> just on that DC. You can also use less vnodes, only replicate certain
> keyspaces, etc. in order to perform the analytics more efficiently.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Dor Laor 
> *Sent:* Friday, January 04, 2019 4:21 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Good way of configuring Apache spark with
> Apache Cassandra
>
>
>
> I strongly recommend option B, separate clusters. Reasons:
>
>  - Networking of node-node is negligible compared to networking within the
> node
>
>  - Different scaling considerations
>
>Your workload may require 10 Spark nodes and 20 database nodes, so why
> bundle them?
>
>This ratio may also change over time as your application evolves and
> amount of data changes.
>
>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
> want it to affect Cassandra and the opposite.
>
>If you isolate it with cgroups, you may have too much idle time when
> the above doesn't happen.
>
>
>
>
>
> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
> wrote:
>
> Hi,
>
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
>
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
>
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
>
>
> Regards
>
> Goutham.
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
-- 
Regards
Goutham Reddy


RE: [EXTERNAL] Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-09 Thread Durity, Sean R
I think you could consider option C: Create a (new) analytics DC in Cassandra 
and run your spark nodes there. Then you can address the scaling just on that 
DC. You can also use less vnodes, only replicate certain keyspaces, etc. in 
order to perform the analytics more efficiently.


Sean Durity

From: Dor Laor 
Sent: Friday, January 04, 2019 4:21 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Good way of configuring Apache spark with Apache 
Cassandra

I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why 
bundle them?
   This ratio may also change over time as your application evolves and amount 
of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't want it 
to affect Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the 
above doesn't happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
mailto:goutham.chiru...@gmail.com>> wrote:
Hi,
We have requirement of heavy data lifting and analytics requirement and decided 
to go with Apache Spark. In the process we have come up with two patterns
a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
b. Apache Spark on one independent cluster and Apache Cassandra as one 
independent cluster.

Need good pattern how to use the analytic engine for Cassandra. Thanks in 
advance.

Regards
Goutham.



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Goutham reddy
Thanks Jonathan, I believe we have to reconsider the way analytics have to
be performed.

On Fri, Jan 4, 2019 at 1:46 PM Jonathan Haddad  wrote:

> If you absolutely have to use Cassandra as the source of your data, I
> agree with Dor.
>
> That being said, if you're going to be doing a lot of analytics, I
> recommend using something other than Cassandra with Spark.  The performance
> isn't particularly wonderful and you'll likely get anywhere from 10-50x
> improvement from putting the data in an analytics friendly format (parquet)
> and on a block / blob store (DFS or S3) instead.
>
> On Fri, Jan 4, 2019 at 1:43 PM Goutham reddy 
> wrote:
>
>> Thank you very much Dor for the detailed information, yes that should be
>> the primary reason why we have to isolate from Cassandra.
>>
>> Thanks and Regards,
>> Goutham Reddy
>>
>>
>> On Fri, Jan 4, 2019 at 1:29 PM Dor Laor  wrote:
>>
>>> I strongly recommend option B, separate clusters. Reasons:
>>>  - Networking of node-node is negligible compared to networking within
>>> the node
>>>  - Different scaling considerations
>>>Your workload may require 10 Spark nodes and 20 database nodes, so
>>> why bundle them?
>>>This ratio may also change over time as your application evolves and
>>> amount of data changes.
>>>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
>>> want it to affect Cassandra and the opposite.
>>>If you isolate it with cgroups, you may have too much idle time when
>>> the above doesn't happen.
>>>
>>>
>>> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy <
>>> goutham.chiru...@gmail.com> wrote:
>>>
 Hi,
 We have requirement of heavy data lifting and analytics requirement and
 decided to go with Apache Spark. In the process we have come up with two
 patterns
 a. Apache Spark and Apache Cassandra co-located and shared on same
 nodes.
 b. Apache Spark on one independent cluster and Apache Cassandra as one
 independent cluster.

 Need good pattern how to use the analytic engine for Cassandra. Thanks
 in advance.

 Regards
 Goutham.

>>>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>
-- 
Regards
Goutham Reddy


Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Jonathan Haddad
If you absolutely have to use Cassandra as the source of your data, I agree
with Dor.

That being said, if you're going to be doing a lot of analytics, I
recommend using something other than Cassandra with Spark.  The performance
isn't particularly wonderful and you'll likely get anywhere from 10-50x
improvement from putting the data in an analytics friendly format (parquet)
and on a block / blob store (DFS or S3) instead.

On Fri, Jan 4, 2019 at 1:43 PM Goutham reddy 
wrote:

> Thank you very much Dor for the detailed information, yes that should be
> the primary reason why we have to isolate from Cassandra.
>
> Thanks and Regards,
> Goutham Reddy
>
>
> On Fri, Jan 4, 2019 at 1:29 PM Dor Laor  wrote:
>
>> I strongly recommend option B, separate clusters. Reasons:
>>  - Networking of node-node is negligible compared to networking within
>> the node
>>  - Different scaling considerations
>>Your workload may require 10 Spark nodes and 20 database nodes, so why
>> bundle them?
>>This ratio may also change over time as your application evolves and
>> amount of data changes.
>>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
>> want it to affect Cassandra and the opposite.
>>If you isolate it with cgroups, you may have too much idle time when
>> the above doesn't happen.
>>
>>
>> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
>> wrote:
>>
>>> Hi,
>>> We have requirement of heavy data lifting and analytics requirement and
>>> decided to go with Apache Spark. In the process we have come up with two
>>> patterns
>>> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>>> b. Apache Spark on one independent cluster and Apache Cassandra as one
>>> independent cluster.
>>>
>>> Need good pattern how to use the analytic engine for Cassandra. Thanks
>>> in advance.
>>>
>>> Regards
>>> Goutham.
>>>
>>

-- 
Jon Haddad
http://www.rustyrazorblade.com
twitter: rustyrazorblade


Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Goutham reddy
Thank you very much Dor for the detailed information, yes that should be
the primary reason why we have to isolate from Cassandra.

Thanks and Regards,
Goutham Reddy


On Fri, Jan 4, 2019 at 1:29 PM Dor Laor  wrote:

> I strongly recommend option B, separate clusters. Reasons:
>  - Networking of node-node is negligible compared to networking within the
> node
>  - Different scaling considerations
>Your workload may require 10 Spark nodes and 20 database nodes, so why
> bundle them?
>This ratio may also change over time as your application evolves and
> amount of data changes.
>  - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
> want it to affect Cassandra and the opposite.
>If you isolate it with cgroups, you may have too much idle time when
> the above doesn't happen.
>
>
> On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
> wrote:
>
>> Hi,
>> We have requirement of heavy data lifting and analytics requirement and
>> decided to go with Apache Spark. In the process we have come up with two
>> patterns
>> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
>> b. Apache Spark on one independent cluster and Apache Cassandra as one
>> independent cluster.
>>
>> Need good pattern how to use the analytic engine for Cassandra. Thanks in
>> advance.
>>
>> Regards
>> Goutham.
>>
>


Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Dor Laor
I strongly recommend option B, separate clusters. Reasons:
 - Networking of node-node is negligible compared to networking within the
node
 - Different scaling considerations
   Your workload may require 10 Spark nodes and 20 database nodes, so why
bundle them?
   This ratio may also change over time as your application evolves and
amount of data changes.
 - Isolation - If Spark has a spike in cpu/IO utilization, you wouldn't
want it to affect Cassandra and the opposite.
   If you isolate it with cgroups, you may have too much idle time when the
above doesn't happen.


On Fri, Jan 4, 2019 at 12:47 PM Goutham reddy 
wrote:

> Hi,
> We have requirement of heavy data lifting and analytics requirement and
> decided to go with Apache Spark. In the process we have come up with two
> patterns
> a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
> b. Apache Spark on one independent cluster and Apache Cassandra as one
> independent cluster.
>
> Need good pattern how to use the analytic engine for Cassandra. Thanks in
> advance.
>
> Regards
> Goutham.
>


Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Goutham reddy
Hi,
We have requirement of heavy data lifting and analytics requirement and
decided to go with Apache Spark. In the process we have come up with two
patterns
a. Apache Spark and Apache Cassandra co-located and shared on same nodes.
b. Apache Spark on one independent cluster and Apache Cassandra as one
independent cluster.

Need good pattern how to use the analytic engine for Cassandra. Thanks in
advance.

Regards
Goutham.