Using ORC input for mllib algorithms

Zsolt Tóth Wed, 25 Mar 2015 05:06:14 -0700

Hi,

I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class,
OrcStruct.class) to use data in ORC format as an RDD. I made some
benchmarking on ORC input vs Text input for MLlib and I ran into a few
issues with ORC.
Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory, 2g
executor memoryOverhead, 1g driver memory. The cluster nodes have
sufficient resources for the setup.


Logistic regression: When using 1GB ORC input (stored in 4 blocks on hdfs),
only one block (25%) is cached and only one executor is used, however the
whole rdd could be cached even as Textfile (that's around 5.5GB). Is it
possible to make Spark use the available resources?

Decision tree: Using 8GB ORC input, the job fails every time with the "Size
exceeds INTEGER.MAX_VALUE" error. Plus, I see errors from the JVM in the
logs that "container is running beyond physical memory limits". Is it
possible to avoid this when using ORC input format? Tried to set the
min.split.size/max.split.size or dfs.blocksize but that didn't help.

Again, none of these happen when using Text input.

Cheers,
Zsolt

Using ORC input for mllib algorithms

Reply via email to