Re: ORC v/s Parquet for Spark 2.0

Ovidiu-Cristian MARCU Tue, 26 Jul 2016 04:53:16 -0700

So did you tried actually to run your use case with spark 2.0 and orc files?
It’s hard to understand your ‘apparently..’.


Best,
Ovidiu
> On 26 Jul 2016, at 13:10, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> 
> If you have ever tried to use ORC via SPARK you will know that SPARK's 
> promise of accessing ORC files is a sham. SPARK cannot access partitioned 
> tables via HIVEcontext which are ORC, SPARK cannot stripe through ORC faster 
> and what more, if you are using SQL and have thought of using HIVE with ORC 
> on TEZ, then it runs way better, faster and leaner than SPARK. 
> 
> I can process almost a few billion records close to a terabyte in a cluster 
> with around 100GB RAM and 40 cores in a few hours, and find it a challenge 
> doing the same with SPARK. 
> 
> But apparently, everything is resolved in SPARK 2.0.
> 
> 
> Regards,
> Gourav Sengupta
> 
> On Tue, Jul 26, 2016 at 11:50 AM, Ofir Manor <ofir.ma...@equalum.io 
> <mailto:ofir.ma...@equalum.io>> wrote:
> One additional point specific to Spark 2.0 - for the alpha Structured 
> Streaming API (only),  the file sink only supports Parquet format (I'm sure 
> that limitation will be lifted in a future release before Structured 
> Streaming is GA):
>      "File sink - Stores the output to a directory. As of Spark 2.0, this 
> only supports Parquet file format, and Append output mode."
>      
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here
>  
> <http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc5-docs/structured-streaming-programming-guide.html#where-to-go-from-here>
> 
> 
>

Re: ORC v/s Parquet for Spark 2.0

Reply via email to