Thanks for your answer! Unfortunately I can't use Spark SQL for some reason.
If anyone has experience in using ORC as hadoopFile, I'd be happy to read some hints/thoughts about my issues. Zsolt 2015-03-27 19:07 GMT+01:00 Xiangrui Meng <[email protected]>: > This is a PR in review to support ORC via the SQL data source API: > https://github.com/apache/spark/pull/3753. You can try pulling that PR > and help test it. -Xiangrui > > On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth <[email protected]> > wrote: > > Hi, > > > > I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class, > > OrcStruct.class) to use data in ORC format as an RDD. I made some > > benchmarking on ORC input vs Text input for MLlib and I ran into a few > > issues with ORC. > > Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory, 2g > > executor memoryOverhead, 1g driver memory. The cluster nodes have > sufficient > > resources for the setup. > > > > Logistic regression: When using 1GB ORC input (stored in 4 blocks on > hdfs), > > only one block (25%) is cached and only one executor is used, however the > > whole rdd could be cached even as Textfile (that's around 5.5GB). Is it > > possible to make Spark use the available resources? > > > > Decision tree: Using 8GB ORC input, the job fails every time with the > "Size > > exceeds INTEGER.MAX_VALUE" error. Plus, I see errors from the JVM in the > > logs that "container is running beyond physical memory limits". Is it > > possible to avoid this when using ORC input format? Tried to set the > > min.split.size/max.split.size or dfs.blocksize but that didn't help. > > > > Again, none of these happen when using Text input. > > > > Cheers, > > Zsolt > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
