Spark shuffle: FileNotFound exception

2016-12-03 Thread Swapnil Shinde
Hello All
I am facing FileNotFoundException for shuffle index file when running
job with large data. Same job runs fine with smaller datasets. These our my
cluster specifications -

No of nodes - 19
Total cores - 380
Memory per executor - 32G
Spark 1.6 mapr version
spark.shuffle.service.enabled - false

 I am running job with 28G memory, 50 executors and 1 core per
executor. Job is failing at stage having dataframe explode where each row
gets multiplied to 6 rows. Here are exception details-

Caused by: java.lang.RuntimeException: java.io.FileNotFoundException:
/tmp/hadoop-mapr/nm-local-dir/usercache/sshinde/appcache/application_1480622725467_0071/blockmgr-3b2051f5-81c8-40a5-a332-9d32b4586a5d/38/shuffle_14_229_0.index
(No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at
org.apache.spark.shuffle.IndexShuffleBlockResolver.getBlockData(IndexShuffleBlockResolver.scala:191)
at
org.apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:291)
at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
at
org.apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:58)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

I tried with below configurations but nothing worked out-
conf.set("spark.io.compression.codec", "lz4")
conf.set("spark.network.timeout", "1000s")
conf.set("spark.sql.shuffle.partitions", "2500")
spark.yarn.executor.memoryOverhead should be high due to 32g of
executor memory. (10% of 32g)
  Increased number of partitions till 15000
  I checked yarn logs briefly and nothing stand out apart from above
exception.


Please let me if there is something I am missing or alternatives to make
large data jobs run.  Thank you..

Thanks
Swapnil


Re: Unsubscribe

2016-12-03 Thread kote rao
unsubscribe


From: S Malligarjunan 
Sent: Saturday, December 3, 2016 11:55:41 AM
To: user@spark.apache.org
Subject: Re: Unsubscribe

Unsubscribe

Thanks and Regards,
Malligarjunan S.



On Saturday, 3 December 2016, 20:42, Sivakumar S  
wrote:


Unsubscribe




Unsubscribe

2016-12-03 Thread S Malligarjunan
Unsubscribe Thanks and Regards,Malligarjunan S.  


Re: Unsubscribe

2016-12-03 Thread S Malligarjunan
Unsubscribe Thanks and Regards,Malligarjunan S.  
 

On Saturday, 3 December 2016, 20:42, Sivakumar S 
 wrote:
 

 Unsubscribe

   

Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
ephemeral storage on ssd will be very painful to maintain especially with
large datasets. we will pretty soon have somewhere in PB.

I am thinking to leverage something like below. But not sure how much
performance gain we could get out of that.

https://github.com/stec-inc/EnhanceIO

On Sat, Dec 3, 2016 at 8:28 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> What about ephemeral storage on ssd ? If performance is required it's
> generally for production so the cluster would never be stopped. Then a
> spark job to backup/restore on S3 allows to shut down completely the cluster
>
> Le 3 déc. 2016 1:28 PM, "David Mitchell"  a
> écrit :
>
>> To get a node local read from Spark to Cassandra, one has to use a read
>> consistency level of LOCAL_ONE.  For some use cases, this is not an
>> option.  For example, if you need to use a read consistency level
>> of LOCAL_QUORUM, as many use cases demand, then one is not going to get a
>> node local read.
>>
>> Also, to insure a node local read, one has to set spark.locality.wait to
>> zero.  Whether or not a partition will be streamed to another node or
>> computed locally is dependent on the spark.locality.wait parameters. This
>> parameter can be set to 0 to force all partitions to only be computed on
>> local nodes.
>>
>> If you do some testing, please post your performance numbers.
>>
>>
>>


Design patterns for Spark implementation

2016-12-03 Thread Vasu Gourabathina
Hi,

I know this is a broad question. If this is not the right forum, appreciate
if you can point to other sites/areas that may be helpful.

Before posing this question, I did use our friend Google, but sanitizing
the query results from my need angle hasn't been easy.

Who I am:
   - Have done data processing and analytics, but relatively new to Spark
world

What I am looking for:
  - Architecture/Design of a ML system using Spark
  - In particular, looking for best practices that can support/bridge both
Engineering and Data Science teams

Engineering:
   - Build a system that has typical engineering needs, data processing,
scalability, reliability, availability, fault-tolerance etc.
   - System monitoring etc.
Data Science:
   - Build a system for Data Science team to do data exploration activities
   - Develop models using supervised learning and tweak models

Data:
  - Batch and incremental updates - mostly structured or semi-structured
(some data from transaction systems, weblogs, click stream etc.)
  - Steaming, in near term, but not to begin with

Data Storage:
  - Data is expected to grow on a daily basis...so, system should be able
to support and handle big data
  - May be, after further analysis, there might be a possibility/need to
archive some of the data...it all depends on how the ML models were built
and results were stored/used for future usage

Data Analysis:
  - Obvious data related aspects, such as data cleansing, data
transformation, data partitioning etc
  - May be run models on windows of data. For example: last 1-year, 2-years
etc.

ML models:
  - Ability to store model versions and previous results
  - Compare results of different variants of models

Consumers:
  - RESTful webservice clients to look at the results

*So, the questions I have are:*
1) Are there architectural and design patterns that I can use based on
industry best-practices. In particular:
  - data ingestion
  - data storage (for eg. go with HDFS or not)
  - data partitioning, especially in Spark world
  - running parallel ML models and combining results etc.
  - consumption of final results by clients (for eg. by pushing results
to Cassandra, NoSQL dbs etc.)

Again, I know this is a broad questionPointers to some best-practices
in some of the areas, if not all, would be highly appreciated. Open to
purchase any books that may have relevant information.

Thanks much folks,
Vasu.


Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
What about ephemeral storage on ssd ? If performance is required it's
generally for production so the cluster would never be stopped. Then a
spark job to backup/restore on S3 allows to shut down completely the cluster

Le 3 déc. 2016 1:28 PM, "David Mitchell"  a
écrit :

> To get a node local read from Spark to Cassandra, one has to use a read
> consistency level of LOCAL_ONE.  For some use cases, this is not an
> option.  For example, if you need to use a read consistency level
> of LOCAL_QUORUM, as many use cases demand, then one is not going to get a
> node local read.
>
> Also, to insure a node local read, one has to set spark.locality.wait to
> zero.  Whether or not a partition will be streamed to another node or
> computed locally is dependent on the spark.locality.wait parameters. This
> parameter can be set to 0 to force all partitions to only be computed on
> local nodes.
>
> If you do some testing, please post your performance numbers.
>
>
>


Unsubscribe

2016-12-03 Thread Sivakumar S
Unsubscribe


Re: What benefits do we really get out of colocation?

2016-12-03 Thread David Mitchell
To get a node local read from Spark to Cassandra, one has to use a read
consistency level of LOCAL_ONE.  For some use cases, this is not an
option.  For example, if you need to use a read consistency level
of LOCAL_QUORUM, as many use cases demand, then one is not going to get a
node local read.

Also, to insure a node local read, one has to set spark.locality.wait to
zero.  Whether or not a partition will be streamed to another node or
computed locally is dependent on the spark.locality.wait parameters. This
parameter can be set to 0 to force all partitions to only be computed on
local nodes.

If you do some testing, please post your performance numbers.


Parquet timestamp storage in Hive and possible use case of spark instead of impala

2016-12-03 Thread Mich Talebzadeh
guys,

This is my suggestion. Use Spark SQL instead of Impala from Hive tables to
get correct timestamp values all the time. The situation is explained below:


I have come across a situation where a multi-tenant cluster is being used
to read and write to Parquet file.

This causes some issues as I understand when Hive stores a timestamp into
Parquet format, it converts local time into UTC time, and when it reads
data out, it converts back to local time.

Impala, however, on the other hand does not do any conversion when it reads
the timestamp column from Parquet file so the UTC time is returned instead
of local time.

so there are multiple issues:

Data read by impala is not converted from UTC to local time
A flag can be set to make impala convert at the cluster level only
a group is saying they don't want to o the conversion at the application
level

So it will cure certain problems but make other tenants less happy with the
conversion.

now my understanding is that this issue comes about because impala bypasses
hive metadata and goes directly to Parquet files.

there is an impact to business.

my suggestion is that if they want performant reads they should use Spark
SQL on Hive. it will always get the same values as stored by Hive


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


Re: What benefits do we really get out of colocation?

2016-12-03 Thread Steve Loughran

On 3 Dec 2016, at 09:16, Manish Malhotra 
> wrote:

thanks for sharing number as well !

Now a days even network can be with very high throughput, and might out perform 
the disk, but as Sean mentioned data on network will have other dependencies 
like network hops, like if its across rack, which can have switch in between.

But yes people are discussing and talking about Mesos + high performance 
network and not worried about the colocation for various use cases.

AWS emphmerial is not good for reliable storage file system, EBS is the 
expensive alternative :)


If you working with HDFS, then on linux HDFS can bypass the entire network 
stack: after opening a block for an authenticated user, HDFS passes the open 
file handle back to the caller for them to talk direct to the filesystem. You 
can't get any faster than that.

On AWS, well, your life is complex as networking is now something you get to 
pay for in your choice of VM and storage options; it is going to generally 
offer lower performance than a physical cluster.

Me? I'd recommend using HDFS for transient storage and then s3 for persistent 
storage of the final data


On Sat, Dec 3, 2016 at 1:12 AM, kant kodali 
> wrote:
Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive 
throughput ) on my spark worker machine when I do `sudo iftop -B`

The problem with instance store on AWS is that they all are ephemeral so 
placing Cassandra on top doesn't make a lot of sense. so In short, AWS doesn't 
seem to be the right place for colocating in theory. I would still give you the 
benefit of doubt and colocate :) but just the numbers are not reflecting 
significant margins in terms of performance gains for AWS


On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen 
> wrote:
I'm sure he meant that this is downside to not colocating.
You are asking the right question. While networking is traditionally much 
slower than disk, that changes a bit in the cloud, where attached storage is 
remote too.
The disk throughput here is mostly achievable in normal workloads. However I 
think you'll find it's going to be much harder to get 1Gbps out of network 
transfers. That's just the speed of the local interface, and of course the 
transfer speed depends on hops across the network beyond that. Network latency 
is going to be higher than disk too, though that's not as much an issue in this 
context.

On Sat, Dec 3, 2016 at 8:42 AM kant kodali 
> wrote:
wait, how is that a benefit? isn't that a bad thing if you are saying 
colocating leads to more latency  and overall execution time is longer?

On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski 
> wrote:

You get more latency on reads so overall execution time is longer

Le 3 déc. 2016 7:39 AM, "kant kodali" 
> a écrit :

I wonder what benefits do I really I get If I colocate my spark worker process 
and Cassandra server process on each node?

I understand the concept of moving compute towards the data instead of moving 
data towards computation but It sounds more like one is trying to optimize for 
network latency.

Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per second) 
Network throughput.

and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

so In this case I don't see how colocation can help even if there is one to one 
mapping from spark worker node to a colocated Cassandra node where say we are 
doing a table scan of billion rows ?

Thanks!







Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
hmm GCE pretty much seems to follow the same model as AWS.

On Sat, Dec 3, 2016 at 1:22 AM, kant kodali  wrote:

> GCE seems to have better options. Any one had any experience with GCE?
>
> On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra <
> manish.malhotra.w...@gmail.com> wrote:
>
>> thanks for sharing number as well !
>>
>> Now a days even network can be with very high throughput, and might out
>> perform the disk, but as Sean mentioned data on network will have other
>> dependencies like network hops, like if its across rack, which can have
>> switch in between.
>>
>> But yes people are discussing and talking about Mesos + high performance
>> network and not worried about the colocation for various use cases.
>>
>> AWS emphmerial is not good for reliable storage file system, EBS is the
>> expensive alternative :)
>>
>> On Sat, Dec 3, 2016 at 1:12 AM, kant kodali  wrote:
>>
>>> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX
>>> (Receive throughput ) on my spark worker machine when I do `sudo iftop -B`
>>>
>>> The problem with instance store on AWS is that they all are ephemeral so
>>> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
>>> doesn't seem to be the right place for colocating in theory. I would still
>>> give you the benefit of doubt and colocate :) but just the numbers are not
>>> reflecting significant margins in terms of performance gains for AWS
>>>
>>>
>>> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen  wrote:
>>>
 I'm sure he meant that this is downside to not colocating.
 You are asking the right question. While networking is traditionally
 much slower than disk, that changes a bit in the cloud, where attached
 storage is remote too.
 The disk throughput here is mostly achievable in normal workloads.
 However I think you'll find it's going to be much harder to get 1Gbps out
 of network transfers. That's just the speed of the local interface, and of
 course the transfer speed depends on hops across the network beyond that.
 Network latency is going to be higher than disk too, though that's not as
 much an issue in this context.

 On Sat, Dec 3, 2016 at 8:42 AM kant kodali  wrote:

> wait, how is that a benefit? isn't that a bad thing if you are saying
> colocating leads to more latency  and overall execution time is longer?
>
> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
> You get more latency on reads so overall execution time is longer
>
> Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :
>
>
> I wonder what benefits do I really I get If I colocate my spark worker
> process and Cassandra server process on each node?
>
> I understand the concept of moving compute towards the data instead of
> moving data towards computation but It sounds more like one is trying to
> optimize for network latency.
>
> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
> second) Network throughput.
>
> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>
> so In this case I don't see how colocation can help even if there is
> one to one mapping from spark worker node to a colocated Cassandra node
> where say we are doing a table scan of billion rows ?
>
> Thanks!
>
>
>
>>>
>>
>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
GCE seems to have better options. Any one had any experience with GCE?

On Sat, Dec 3, 2016 at 1:16 AM, Manish Malhotra <
manish.malhotra.w...@gmail.com> wrote:

> thanks for sharing number as well !
>
> Now a days even network can be with very high throughput, and might out
> perform the disk, but as Sean mentioned data on network will have other
> dependencies like network hops, like if its across rack, which can have
> switch in between.
>
> But yes people are discussing and talking about Mesos + high performance
> network and not worried about the colocation for various use cases.
>
> AWS emphmerial is not good for reliable storage file system, EBS is the
> expensive alternative :)
>
> On Sat, Dec 3, 2016 at 1:12 AM, kant kodali  wrote:
>
>> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX
>> (Receive throughput ) on my spark worker machine when I do `sudo iftop -B`
>>
>> The problem with instance store on AWS is that they all are ephemeral so
>> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
>> doesn't seem to be the right place for colocating in theory. I would still
>> give you the benefit of doubt and colocate :) but just the numbers are not
>> reflecting significant margins in terms of performance gains for AWS
>>
>>
>> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen  wrote:
>>
>>> I'm sure he meant that this is downside to not colocating.
>>> You are asking the right question. While networking is traditionally
>>> much slower than disk, that changes a bit in the cloud, where attached
>>> storage is remote too.
>>> The disk throughput here is mostly achievable in normal workloads.
>>> However I think you'll find it's going to be much harder to get 1Gbps out
>>> of network transfers. That's just the speed of the local interface, and of
>>> course the transfer speed depends on hops across the network beyond that.
>>> Network latency is going to be higher than disk too, though that's not as
>>> much an issue in this context.
>>>
>>> On Sat, Dec 3, 2016 at 8:42 AM kant kodali  wrote:
>>>
 wait, how is that a benefit? isn't that a bad thing if you are saying
 colocating leads to more latency  and overall execution time is longer?

 On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

 You get more latency on reads so overall execution time is longer

 Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :


 I wonder what benefits do I really I get If I colocate my spark worker
 process and Cassandra server process on each node?

 I understand the concept of moving compute towards the data instead of
 moving data towards computation but It sounds more like one is trying to
 optimize for network latency.

 Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
 second) Network throughput.

 and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)

 http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

 so In this case I don't see how colocation can help even if there is
 one to one mapping from spark worker node to a colocated Cassandra node
 where say we are doing a table scan of billion rows ?

 Thanks!



>>
>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread Manish Malhotra
thanks for sharing number as well !

Now a days even network can be with very high throughput, and might out
perform the disk, but as Sean mentioned data on network will have other
dependencies like network hops, like if its across rack, which can have
switch in between.

But yes people are discussing and talking about Mesos + high performance
network and not worried about the colocation for various use cases.

AWS emphmerial is not good for reliable storage file system, EBS is the
expensive alternative :)

On Sat, Dec 3, 2016 at 1:12 AM, kant kodali  wrote:

> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
> throughput ) on my spark worker machine when I do `sudo iftop -B`
>
> The problem with instance store on AWS is that they all are ephemeral so
> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
> doesn't seem to be the right place for colocating in theory. I would still
> give you the benefit of doubt and colocate :) but just the numbers are not
> reflecting significant margins in terms of performance gains for AWS
>
>
> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen  wrote:
>
>> I'm sure he meant that this is downside to not colocating.
>> You are asking the right question. While networking is traditionally much
>> slower than disk, that changes a bit in the cloud, where attached storage
>> is remote too.
>> The disk throughput here is mostly achievable in normal workloads.
>> However I think you'll find it's going to be much harder to get 1Gbps out
>> of network transfers. That's just the speed of the local interface, and of
>> course the transfer speed depends on hops across the network beyond that.
>> Network latency is going to be higher than disk too, though that's not as
>> much an issue in this context.
>>
>> On Sat, Dec 3, 2016 at 8:42 AM kant kodali  wrote:
>>
>>> wait, how is that a benefit? isn't that a bad thing if you are saying
>>> colocating leads to more latency  and overall execution time is longer?
>>>
>>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>> You get more latency on reads so overall execution time is longer
>>>
>>> Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :
>>>
>>>
>>> I wonder what benefits do I really I get If I colocate my spark worker
>>> process and Cassandra server process on each node?
>>>
>>> I understand the concept of moving compute towards the data instead of
>>> moving data towards computation but It sounds more like one is trying to
>>> optimize for network latency.
>>>
>>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>>> second) Network throughput.
>>>
>>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>>
>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>>>
>>> so In this case I don't see how colocation can help even if there is one
>>> to one mapping from spark worker node to a colocated Cassandra node where
>>> say we are doing a table scan of billion rows ?
>>>
>>> Thanks!
>>>
>>>
>>>
>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
Forgot to mention my entire cluster is on one DC. so if it is across
multiple DC's then colocating does makes sense in theory as well.

On Sat, Dec 3, 2016 at 1:12 AM, kant kodali  wrote:

> Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
> throughput ) on my spark worker machine when I do `sudo iftop -B`
>
> The problem with instance store on AWS is that they all are ephemeral so
> placing Cassandra on top doesn't make a lot of sense. so In short, AWS
> doesn't seem to be the right place for colocating in theory. I would still
> give you the benefit of doubt and colocate :) but just the numbers are not
> reflecting significant margins in terms of performance gains for AWS
>
>
> On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen  wrote:
>
>> I'm sure he meant that this is downside to not colocating.
>> You are asking the right question. While networking is traditionally much
>> slower than disk, that changes a bit in the cloud, where attached storage
>> is remote too.
>> The disk throughput here is mostly achievable in normal workloads.
>> However I think you'll find it's going to be much harder to get 1Gbps out
>> of network transfers. That's just the speed of the local interface, and of
>> course the transfer speed depends on hops across the network beyond that.
>> Network latency is going to be higher than disk too, though that's not as
>> much an issue in this context.
>>
>> On Sat, Dec 3, 2016 at 8:42 AM kant kodali  wrote:
>>
>>> wait, how is that a benefit? isn't that a bad thing if you are saying
>>> colocating leads to more latency  and overall execution time is longer?
>>>
>>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
>>> You get more latency on reads so overall execution time is longer
>>>
>>> Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :
>>>
>>>
>>> I wonder what benefits do I really I get If I colocate my spark worker
>>> process and Cassandra server process on each node?
>>>
>>> I understand the concept of moving compute towards the data instead of
>>> moving data towards computation but It sounds more like one is trying to
>>> optimize for network latency.
>>>
>>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>>> second) Network throughput.
>>>
>>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>>
>>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>>>
>>> so In this case I don't see how colocation can help even if there is one
>>> to one mapping from spark worker node to a colocated Cassandra node where
>>> say we are doing a table scan of billion rows ?
>>>
>>> Thanks!
>>>
>>>
>>>
>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
Thanks Sean! Just for the record I am currently seeing 95 MB/s RX (Receive
throughput ) on my spark worker machine when I do `sudo iftop -B`

The problem with instance store on AWS is that they all are ephemeral so
placing Cassandra on top doesn't make a lot of sense. so In short, AWS
doesn't seem to be the right place for colocating in theory. I would still
give you the benefit of doubt and colocate :) but just the numbers are not
reflecting significant margins in terms of performance gains for AWS


On Sat, Dec 3, 2016 at 12:56 AM, Sean Owen  wrote:

> I'm sure he meant that this is downside to not colocating.
> You are asking the right question. While networking is traditionally much
> slower than disk, that changes a bit in the cloud, where attached storage
> is remote too.
> The disk throughput here is mostly achievable in normal workloads. However
> I think you'll find it's going to be much harder to get 1Gbps out of
> network transfers. That's just the speed of the local interface, and of
> course the transfer speed depends on hops across the network beyond that.
> Network latency is going to be higher than disk too, though that's not as
> much an issue in this context.
>
> On Sat, Dec 3, 2016 at 8:42 AM kant kodali  wrote:
>
>> wait, how is that a benefit? isn't that a bad thing if you are saying
>> colocating leads to more latency  and overall execution time is longer?
>>
>> On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>> You get more latency on reads so overall execution time is longer
>>
>> Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :
>>
>>
>> I wonder what benefits do I really I get If I colocate my spark worker
>> process and Cassandra server process on each node?
>>
>> I understand the concept of moving compute towards the data instead of
>> moving data towards computation but It sounds more like one is trying to
>> optimize for network latency.
>>
>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>> second) Network throughput.
>>
>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>
>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>>
>> so In this case I don't see how colocation can help even if there is one
>> to one mapping from spark worker node to a colocated Cassandra node where
>> say we are doing a table scan of billion rows ?
>>
>> Thanks!
>>
>>
>>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread kant kodali
wait, how is that a benefit? isn't that a bad thing if you are saying
colocating leads to more latency  and overall execution time is longer?

On Sat, Dec 3, 2016 at 12:34 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> You get more latency on reads so overall execution time is longer
>
> Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :
>
>>
>> I wonder what benefits do I really I get If I colocate my spark worker
>> process and Cassandra server process on each node?
>>
>> I understand the concept of moving compute towards the data instead of
>> moving data towards computation but It sounds more like one is trying to
>> optimize for network latency.
>>
>> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
>> second) Network throughput.
>>
>> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>>
>> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>>
>> so In this case I don't see how colocation can help even if there is one
>> to one mapping from spark worker node to a colocated Cassandra node where
>> say we are doing a table scan of billion rows ?
>>
>> Thanks!
>>
>>


Re: What benefits do we really get out of colocation?

2016-12-03 Thread vincent gromakowski
You get more latency on reads so overall execution time is longer

Le 3 déc. 2016 7:39 AM, "kant kodali"  a écrit :

>
> I wonder what benefits do I really I get If I colocate my spark worker
> process and Cassandra server process on each node?
>
> I understand the concept of moving compute towards the data instead of
> moving data towards computation but It sounds more like one is trying to
> optimize for network latency.
>
> Majority of my nodes (m4.xlarge)  have 1Gbps = 125MB/s (Megabytes per
> second) Network throughput.
>
> and the DISK throughput for m4.xlarge is 93.75 MB/s (link below)
>
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
>
> so In this case I don't see how colocation can help even if there is one
> to one mapping from spark worker node to a colocated Cassandra node where
> say we are doing a table scan of billion rows ?
>
> Thanks!
>
>