Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Jerry Lam Tue, 02 Feb 2016 17:47:43 -0800

I think spark dataframe supports more than just SQL. It is more like pandas
dataframe.( I rarely use the SQL feature. )
There are a lot of novelties in dataframe so I think it is quite optimize
for many tasks. The in-memory data structure is very memory efficient. I
just change a very slow RDD program to use Dataframe. The performance gain
is about 2 times while using less CPU. Of course, if you are very good at
optimizing your code, then use pure RDD.



On Tue, Feb 2, 2016 at 8:08 PM, Koert Kuipers <ko...@tresata.com> wrote:

> Dataset will have access to some of the catalyst/tungsten optimizations
> while also giving you scala and types. However that is currently
> experimental and not yet as efficient as it could be.
> On Feb 2, 2016 7:50 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>
>> Sure, having a common distributed query and compute engine for all kind
>> of data source is alluring concept to market and advertise and to attract
>> potential customers (non engineers, analyst, data scientist). But it's
>> nothing new!..but darn old school. it's taking bits and pieces from
>> existing sql and no-sql technology. It lacks many panache of robust sql
>> engine. I think what put spark aside from everything else on market is RDD!
>> and flexibility and scala-like programming style given to developers which
>> is simply much more attractive to write then sql syntaxes, schema and
>> string constants that falls apart left and right. Writing sql is old
>> school. period.  good luck making money though :)
>>
>> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> To have a product databricks can charge for their sql engine needs to be
>>> competitive. That's why they have these optimizations in catalyst. RDD is
>>> simply no longer the focus.
>>> On Feb 2, 2016 7:17 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote:
>>>
>>>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly
>>>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert
>>>> physical rdd structure into byte array at some point for optimized GC and
>>>> memory. My question is why is it only applicable to SQL/Dataframe and not
>>>> RDD? RDD has types too!
>>>>
>>>>
>>>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com>
>>>> wrote:
>>>>
>>>>> I haven't gone through much details of spark catalyst optimizer and
>>>>> tungston project but we have been advised by databricks support to use
>>>>> DataFrame to resolve issues with OOM error that we are getting during Join
>>>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not
>>>>> perform external sort and blows with OOM.
>>>>>
>>>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html
>>>>>
>>>>> Now it's great that it has been addressed in spark 1.5 release but why
>>>>> databricks advocating to switch to DataFrames? It may make sense for batch
>>>>> jobs or near real-time jobs but not sure if they do when you are 
>>>>> developing
>>>>> real time analytics where you want to optimize every millisecond that you
>>>>> can. Again I am still knowledging myself with DataFrame APIs and
>>>>> optimizations and I will benchmark it against RDD for our batch and
>>>>> real-time use case as well.
>>>>>
>>>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com
>>>>> > wrote:
>>>>>
>>>>>> What do you think is preventing you from optimizing your
>>>>>> own RDD-level transformations and actions?  AFAIK, nothing that has been
>>>>>> added in Catalyst precludes you from doing that.  The fact of the matter
>>>>>> is, though, that there is less type and semantic information available to
>>>>>> Spark from the raw RDD API than from using Spark SQL, DataFrames or
>>>>>> DataSets.  That means that Spark itself can't optimize for raw RDDs the
>>>>>> same way that it can for higher-level constructs that can leverage
>>>>>> Catalyst; but if you want to write your own optimizations based on your 
>>>>>> own
>>>>>> knowledge of the data types and semantics that are hiding in your raw 
>>>>>> RDDs,
>>>>>> there's no reason that you can't do that.
>>>>>>
>>>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Perhaps I should write a blog about this that why spark is focusing
>>>>>>> more on writing easier spark jobs and hiding underlaying performance
>>>>>>> optimization details from a seasoned spark users. It's one thing to 
>>>>>>> provide
>>>>>>> such abstract framework that does optimization for you so you don't 
>>>>>>> have to
>>>>>>> worry about it as a data scientist or data analyst but what about
>>>>>>> developers who do not want overhead of SQL and Optimizers and 
>>>>>>> unnecessary
>>>>>>> abstractions ! Application designer who knows their data and queries 
>>>>>>> should
>>>>>>> be able to optimize at RDD level transformations and actions. Does spark
>>>>>>> provides a way to achieve same level of optimization by using either SQL
>>>>>>> Catalyst or raw RDD transformation?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [image: What's New with Xactly]
>>>>>>> <http://www.xactlycorp.com/email-click/>
>>>>>>>
>>>>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>>>>> <https://www.linkedin.com/company/xactly-corporation>  [image:
>>>>>>> Twitter] <https://twitter.com/Xactly>  [image: Facebook]
>>>>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>>>>> <http://www.youtube.com/xactlycorporation>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> [image: What's New with Xactly]
>>>> <http://www.xactlycorp.com/email-click/>
>>>>
>>>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>>>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>>>> <https://twitter.com/Xactly>  [image: Facebook]
>>>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>>>> <http://www.youtube.com/xactlycorporation>
>>>
>>>
>>
>>
>>
>> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/>
>>
>> <https://www.nyse.com/quote/XNYS:XTLY>  [image: LinkedIn]
>> <https://www.linkedin.com/company/xactly-corporation>  [image: Twitter]
>> <https://twitter.com/Xactly>  [image: Facebook]
>> <https://www.facebook.com/XactlyCorp>  [image: YouTube]
>> <http://www.youtube.com/xactlycorporation>
>
>

Re: Spark DataFrame Catalyst - Another Oracle like query optimizer?

Reply via email to