I think spark dataframe supports more than just SQL. It is more like pandas dataframe.( I rarely use the SQL feature. ) There are a lot of novelties in dataframe so I think it is quite optimize for many tasks. The in-memory data structure is very memory efficient. I just change a very slow RDD program to use Dataframe. The performance gain is about 2 times while using less CPU. Of course, if you are very good at optimizing your code, then use pure RDD.
On Tue, Feb 2, 2016 at 8:08 PM, Koert Kuipers <ko...@tresata.com> wrote: > Dataset will have access to some of the catalyst/tungsten optimizations > while also giving you scala and types. However that is currently > experimental and not yet as efficient as it could be. > On Feb 2, 2016 7:50 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote: > >> Sure, having a common distributed query and compute engine for all kind >> of data source is alluring concept to market and advertise and to attract >> potential customers (non engineers, analyst, data scientist). But it's >> nothing new!..but darn old school. it's taking bits and pieces from >> existing sql and no-sql technology. It lacks many panache of robust sql >> engine. I think what put spark aside from everything else on market is RDD! >> and flexibility and scala-like programming style given to developers which >> is simply much more attractive to write then sql syntaxes, schema and >> string constants that falls apart left and right. Writing sql is old >> school. period. good luck making money though :) >> >> On Tue, Feb 2, 2016 at 4:38 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> To have a product databricks can charge for their sql engine needs to be >>> competitive. That's why they have these optimizations in catalyst. RDD is >>> simply no longer the focus. >>> On Feb 2, 2016 7:17 PM, "Nirav Patel" <npa...@xactlycorp.com> wrote: >>> >>>> so latest optimizations done on spark 1.4 and 1.5 releases are mostly >>>> from project Tungsten. Docs says it usues sun.misc.unsafe to convert >>>> physical rdd structure into byte array at some point for optimized GC and >>>> memory. My question is why is it only applicable to SQL/Dataframe and not >>>> RDD? RDD has types too! >>>> >>>> >>>> On Mon, Jan 25, 2016 at 11:10 AM, Nirav Patel <npa...@xactlycorp.com> >>>> wrote: >>>> >>>>> I haven't gone through much details of spark catalyst optimizer and >>>>> tungston project but we have been advised by databricks support to use >>>>> DataFrame to resolve issues with OOM error that we are getting during Join >>>>> and GroupBy operations. We use spark 1.3.1 and looks like it can not >>>>> perform external sort and blows with OOM. >>>>> >>>>> https://forums.databricks.com/questions/2082/i-got-the-akka-frame-size-exceeded-exception.html >>>>> >>>>> Now it's great that it has been addressed in spark 1.5 release but why >>>>> databricks advocating to switch to DataFrames? It may make sense for batch >>>>> jobs or near real-time jobs but not sure if they do when you are >>>>> developing >>>>> real time analytics where you want to optimize every millisecond that you >>>>> can. Again I am still knowledging myself with DataFrame APIs and >>>>> optimizations and I will benchmark it against RDD for our batch and >>>>> real-time use case as well. >>>>> >>>>> On Mon, Jan 25, 2016 at 9:47 AM, Mark Hamstra <m...@clearstorydata.com >>>>> > wrote: >>>>> >>>>>> What do you think is preventing you from optimizing your >>>>>> own RDD-level transformations and actions? AFAIK, nothing that has been >>>>>> added in Catalyst precludes you from doing that. The fact of the matter >>>>>> is, though, that there is less type and semantic information available to >>>>>> Spark from the raw RDD API than from using Spark SQL, DataFrames or >>>>>> DataSets. That means that Spark itself can't optimize for raw RDDs the >>>>>> same way that it can for higher-level constructs that can leverage >>>>>> Catalyst; but if you want to write your own optimizations based on your >>>>>> own >>>>>> knowledge of the data types and semantics that are hiding in your raw >>>>>> RDDs, >>>>>> there's no reason that you can't do that. >>>>>> >>>>>> On Mon, Jan 25, 2016 at 9:35 AM, Nirav Patel <npa...@xactlycorp.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Perhaps I should write a blog about this that why spark is focusing >>>>>>> more on writing easier spark jobs and hiding underlaying performance >>>>>>> optimization details from a seasoned spark users. It's one thing to >>>>>>> provide >>>>>>> such abstract framework that does optimization for you so you don't >>>>>>> have to >>>>>>> worry about it as a data scientist or data analyst but what about >>>>>>> developers who do not want overhead of SQL and Optimizers and >>>>>>> unnecessary >>>>>>> abstractions ! Application designer who knows their data and queries >>>>>>> should >>>>>>> be able to optimize at RDD level transformations and actions. Does spark >>>>>>> provides a way to achieve same level of optimization by using either SQL >>>>>>> Catalyst or raw RDD transformation? >>>>>>> >>>>>>> Thanks >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> [image: What's New with Xactly] >>>>>>> <http://www.xactlycorp.com/email-click/> >>>>>>> >>>>>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>>>>> <https://www.linkedin.com/company/xactly-corporation> [image: >>>>>>> Twitter] <https://twitter.com/Xactly> [image: Facebook] >>>>>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>>>>> <http://www.youtube.com/xactlycorporation> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> >>>> [image: What's New with Xactly] >>>> <http://www.xactlycorp.com/email-click/> >>>> >>>> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >>>> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >>>> <https://twitter.com/Xactly> [image: Facebook] >>>> <https://www.facebook.com/XactlyCorp> [image: YouTube] >>>> <http://www.youtube.com/xactlycorporation> >>> >>> >> >> >> >> [image: What's New with Xactly] <http://www.xactlycorp.com/email-click/> >> >> <https://www.nyse.com/quote/XNYS:XTLY> [image: LinkedIn] >> <https://www.linkedin.com/company/xactly-corporation> [image: Twitter] >> <https://twitter.com/Xactly> [image: Facebook] >> <https://www.facebook.com/XactlyCorp> [image: YouTube] >> <http://www.youtube.com/xactlycorporation> > >