Re: How does spark work?

2017-09-12 Thread Jules Damji

Alternatively, watch Spark Summit talk on Memory Management to get insight from 
a developer's perspective.

https://spark-summit.org/2016/events/deep-dive-apache-spark-memory-management/

https://spark-summit.org/2017/events/a-developers-view-into-sparks-memory-model/

Cheers 
Jules 

Sent from my iPhone
Pardon the dumb thumb typos :)

> On Sep 12, 2017, at 4:07 AM, Vikash Pareek  wrote:
> 
> Obviously, you can't store 900GB of data into 80GB memory. 
> There is a concept in spark called disk spill, it means when your data size
> increases and can't fit into memory then it spilled out to disk.
> 
> Also, spark doesn't use whole memory for storing the data, some fraction of
> memory used for processing, shuffling and internal data structure too.
> For more detail, you can have a look at 
> https://0x0fff.com/spark-memory-management/
>   
> 
> Hope this will help you.
> 
> 
> 
> 
> 
> 
> -
> 
> __Vikash Pareek
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 


Re: How does spark work?

2017-09-12 Thread nguyen duc Tuan
In genernal, RDD, which is the central concept of Spark, is just
deffinition of how to get data and process data. Each partition of RDD
defines how to get/process each partition of data. A series of
transformation will transform every partitions of data from previous RDD. I
give you very easy example of what I meant. For example, you need to count
number of lines of 100 files. So your RDD will consist of 100 partitions,
which one partition corresponding to 1 file path. You have only 3
executors/3 cores resource to do that. So executors will firstly get 3
tasks / 3 tasks, do count job and send the results to driver, and execute
next tasks and so on until all tasks are done. So driver just do simple
reduce operation to get final number of lines. Note that, the partition
only defines how to get data, so you don't have to send data to each
executor. If some tasks fail, you just need to do them again.
If you want to understand more internal details, I recommend this:
https://github.com/JerryLead/SparkInternals.
Hope this will help you.

2017-09-12 18:07 GMT+07:00 Vikash Pareek :

> Obviously, you can't store 900GB of data into 80GB memory.
> There is a concept in spark called disk spill, it means when your data size
> increases and can't fit into memory then it spilled out to disk.
>
> Also, spark doesn't use whole memory for storing the data, some fraction of
> memory used for processing, shuffling and internal data structure too.
> For more detail, you can have a look at
> https://0x0fff.com/spark-memory-management/
> 
>
> Hope this will help you.
>
>
>
>
>
>
> -
>
> __Vikash Pareek
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: How does spark work?

2017-09-12 Thread Vikash Pareek
Obviously, you can't store 900GB of data into 80GB memory. 
There is a concept in spark called disk spill, it means when your data size
increases and can't fit into memory then it spilled out to disk.

Also, spark doesn't use whole memory for storing the data, some fraction of
memory used for processing, shuffling and internal data structure too.
For more detail, you can have a look at 
https://0x0fff.com/spark-memory-management/
  

Hope this will help you.






-

__Vikash Pareek
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



How does spark work?

2017-09-11 Thread 陈卓
Hi
I'm a newbie.
In my spark cluster, there are 5 machines, each machine 16G memory, but my data 
may be more than 900G, the source may be HDFS or mongodb, I want to know how to 
put this 900G data into spark cluster memory because I have a total memory 
space of 80G. How does spark work?




Thanks!
Jetty