Hi, I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class, OrcStruct.class) to use data in ORC format as an RDD. I made some benchmarking on ORC input vs Text input for MLlib and I ran into a few issues with ORC. Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory, 2g executor memoryOverhead, 1g driver memory. The cluster nodes have sufficient resources for the setup.
Logistic regression: When using 1GB ORC input (stored in 4 blocks on hdfs), only one block (25%) is cached and only one executor is used, however the whole rdd could be cached even as Textfile (that's around 5.5GB). Is it possible to make Spark use the available resources? Decision tree: Using 8GB ORC input, the job fails every time with the "Size exceeds INTEGER.MAX_VALUE" error. Plus, I see errors from the JVM in the logs that "container is running beyond physical memory limits". Is it possible to avoid this when using ORC input format? Tried to set the min.split.size/max.split.size or dfs.blocksize but that didn't help. Again, none of these happen when using Text input. Cheers, Zsolt
