I just did a test, even for a single node (local deployment), spark can handle the data whose size is much larger than the total memory.

My test VM (2g ram, 2 cores):

$ free -m
total used free shared buff/cache available Mem: 1992 1845 92 19 54 36
Swap:          1023         285         738


The data size:

$ du -h rate.csv
3.2G    rate.csv


Loading this file into spark for calculation can be done without error:

scala> val df = spark.read.format("csv").option("inferSchema", true).load("skydrive/rate.csv") val df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2 more fields]

scala> df.printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation`
root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: double (nullable = true)
 |-- _c3: integer (nullable = true)


scala> df.groupBy("_c1").agg(avg("_c2").alias("avg_rating")).orderBy(desc("avg_rating")).show warning: 1 deprecation (since 2.13.3); for details, enable `:setting -deprecation` or `:replay -deprecation` +----------+----------+
|       _c1|avg_rating|
+----------+----------+
|0001360000|       5.0|
|0001711474|       5.0|
|0001360779|       5.0|
|0001006657|       5.0|
|0001361155|       5.0|
|0001018043|       5.0|
|000136118X|       5.0|
|0000202010|       5.0|
|0001371037|       5.0|
|0000401048|       5.0|
|0001371045|       5.0|
|0001203010|       5.0|
|0001381245|       5.0|
|0001048236|       5.0|
|0001436163|       5.0|
|000104897X|       5.0|
|0001437879|       5.0|
|0001056107|       5.0|
|0001468685|       5.0|
|0001061240|       5.0|
+----------+----------+
only showing top 20 rows


So as you see spark can handle file larger than its memory well. :)

Thanks


rajat kumar wrote:
With autoscaling can have any numbers of executors.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to