Hi,

   - Spark 1.4 on a single node machine. Run spark-shell
   - Reading from Parquet file with bunch of text columns and couple of
   amounts in decimal(14,4). On disk size of of the file is 376M. It has ~100
   million rows
   - rdd1 = sqlcontext.read.parquet
   - rdd1.cache
   - group_by_df =
   rdd1.groupBy("a").agg(sum(rdd1("amount1")),sum(rdd1("amount2")))
   - group_by_df.cache
   - group_by_df.count // Trigger action - Results in 725 rows
   - Run top on machine
   - In the spark UI, the storage shows base ParquetRDD size is 2.3GB
   (multiple of storage size 376M), the size of the group_by_df is 43.2 KB.
   This seems ok
   - However, the "top" command shows the process memory "RES" part jumping
   from 2g at start to 31g after the count. This seems excessive for one group
   by operator and will lead to trouble for repeated similar operations on the
   data ...

Any thoughts ?

Thanks,

Reply via email to