Re: Understanding and optimizing spark disk usage during a job.

2014-11-29 Thread Vikas Agarwal
I may not be correct (in fact I may be completely opposite), but here is my guess: Assuming 8 bytes for double, 4000 vectors of dimension 400 for 12k images, would require 153.6 GB (12k*4000*400*8) of data which may justify the amount of data to be written to the disk. Without compression, it seem

Understanding and optimizing spark disk usage during a job.

2014-11-28 Thread Jaonary Rabarisoa
Dear all, I have a job that crashes before its end because of no space left on device, and I noticed that this job generates a lots of temporary data on my disk. To be precise, the job is a simple map job that takes a set of images, extracts local features and save these local features as a seque