Once again I am trying to read a directory tree using binary files.

My directory tree has a root dir ROOTDIR and subdirs where the files are located, i.e.

ROOTDIR/1
ROOTDIR/2
ROOTDIR/..
ROOTDIR/100

A total of 1 mil files split into 100 sub dirs

Using binaryFiles requires too much memory on the driver. I've also tried rdds of binaryFiles(each subdir) and then ++ those and rdd.saveAsObjectFile("outputDir"). That causes a lot of memory to be required in the executors!

What is the proper way to use binaryFiles with this number of files?

Thanks





---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to