So did you tried actually to run your use case with spark 2.0 and orc files? It’s hard to understand your ‘apparently..’.
Best, Ovidiu > On 26 Jul 2016, at 13:10, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > > If you have ever tried to use ORC via SPARK you will know that SPARK's > promise of accessing ORC files is a sham. SPARK cannot access partitioned > tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster > and what more, if you are using SQL and have thought of using HIVE with ORC > on TEZ, then it runs way better, faster and leaner than SPARK. > > I can process almost a few billion records close to a terabyte in a cluster > with around 100GB RAM and 40 cores in a few hours, and find it a challenge > doing the same with SPARK. > > But apparently, everything is resolved in SPARK 2.0. > > > Regards, > Gourav Sengupta > > On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor <ofir.ma...@equalum.io > <mailto:ofir.ma...@equalum.io>> wrote: > One additional point specific to Spark 2.0 - for the alpha Structured > Streaming API (only), the file sink only supports Parquet format (I'm sure > that limitation will be lifted in a future release before Structured > Streaming is GA): > "File sink - Stores the output to a directory. As of Spark 2.0, this > only supports Parquet file format, and Append output mode." > > http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here > > <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here> > > >