Once again I am trying to read a directory tree using binary files.
My directory tree has a root dir ROOTDIR and subdirs where the files are
located, i.e.
ROOTDIR/1
ROOTDIR/2
ROOTDIR/..
ROOTDIR/100
A total of 1 mil files split into 100 sub dirs
Using binaryFiles requires too much memory on the driver. I've also
tried rdds of binaryFiles(each subdir) and then ++ those and
rdd.saveAsObjectFile("outputDir"). That causes a lot of memory to be
required in the executors!
What is the proper way to use binaryFiles with this number of files?
Thanks
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org