I haven't gone through much details of spark catalyst optimizer and
tungston project but we have been advised by databricks support to use
DataFrame to resolve issues with OOM error that we are getting during Join
and GroupBy operations. We use spark 1.3.1 and looks like it can not
perform external sort and blows with OOM.
https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html

Now it's great that it has been addressed in spark 1.5 release but why
databricks advocating to switch to DataFrames? It may make sense for batch
jobs or near real-time jobs but not sure if they do when you are developing
real time analytics where you want to optimize every millisecond that you
can. Again I am still knowledging myself with DataFrame APIs and
optimizations and I will benchmark it against RDD for our batch and
real-time use case as well.

On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> What do you think is preventing you from optimizing your own RDD-level
> transformations and actions?  AFAIK, nothing that has been added in
> Catalyst precludes you from doing that.  The fact of the matter is, though,
> that there is less type and semantic information available to Spark from
> the raw RDD API than from using Spark SQL, DataFrames or DataSets.  That
> means that Spark itself can't optimize for raw RDDs the same way that it
> can for higher-level constructs that can leverage Catalyst; but if you want
> to write your own optimizations based on your own knowledge of the data
> types and semantics that are hiding in your raw RDDs, there's no reason
> that you can't do that.
>
> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com>
> wrote:
>
>> Hi,
>>
>> Perhaps I should write a blog about this that why spark is focusing more
>> on writing easier spark jobs and hiding underlaying performance
>> optimization details from a seasoned spark users. It's one thing to provide
>> such abstract framework that does optimization for you so you don't have to
>> worry about it as a data scientist or data analyst but what about
>> developers who do not want overhead of SQL and Optimizers and unnecessary
>> abstractions ! Application designer who knows their data and queries should
>> be able to optimize at RDD level transformations and actions. Does spark
>> provides a way to achieve same level of optimization by using either SQL
>> Catalyst or raw RDD transformation?
>>
>> Thanks
>>
>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>
>

-- 


[image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>

<https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn] 
<https://www.linkedin.com/company/xactly-corporation>  [image: Twitter] 
<https://twitter.com/Xactly>  [image: Facebook] 
<https://www.facebook.com/XactlyCorp>  [image: YouTube] 
<http://www.youtube.com/xactlycorporation>

Reply via email to