Re: Using ORC input for mllib algorithms

Zsolt Tóth Mon, 30 Mar 2015 02:01:58 -0700

Thanks for your answer! Unfortunately I can't use Spark SQL for some reason.


If anyone has experience in using ORC as hadoopFile, I'd be happy to read
some hints/thoughts about my issues.

Zsolt

2015-03-27 19:07 GMT+01:00 Xiangrui Meng <[email protected]>:

> This is a PR in review to support ORC via the SQL data source API:
> https://github.com/apache/spark/pull/3753. You can try pulling that PR
> and help test it. -Xiangrui
>
> On Wed, Mar 25, 2015 at 5:03 AM, Zsolt Tóth <[email protected]>
> wrote:
> > Hi,
> >
> > I use sc.hadoopFile(directory, OrcInputFormat.class, NullWritable.class,
> > OrcStruct.class) to use data in ORC format as an RDD. I made some
> > benchmarking on ORC input vs Text input for MLlib and I ran into a few
> > issues with ORC.
> > Setup: yarn-cluster mode, 11 executors, 4 cores, 9g executor memory, 2g
> > executor memoryOverhead, 1g driver memory. The cluster nodes have
> sufficient
> > resources for the setup.
> >
> > Logistic regression: When using 1GB ORC input (stored in 4 blocks on
> hdfs),
> > only one block (25%) is cached and only one executor is used, however the
> > whole rdd could be cached even as Textfile (that's around 5.5GB). Is it
> > possible to make Spark use the available resources?
> >
> > Decision tree: Using 8GB ORC input, the job fails every time with the
> "Size
> > exceeds INTEGER.MAX_VALUE" error. Plus, I see errors from the JVM in the
> > logs that "container is running beyond physical memory limits". Is it
> > possible to avoid this when using ORC input format? Tried to set the
> > min.split.size/max.split.size or dfs.blocksize but that didn't help.
> >
> > Again, none of these happen when using Text input.
> >
> > Cheers,
> > Zsolt
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Using ORC input for mllib algorithms

Reply via email to