https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
On Thu, Sep 1, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Jacob. > > My understanding of Dataset is that it is basically an RDD with some > optimization gone into it. RDD is meant to deal with unstructured data? > > Now DataFrame is the tabular format of RDD designed for tabular work, csv, > SQL stuff etc. > > When you mention DataFrame is just an alias for Dataset[Row] does that > mean that it converts an RDD to DataSet thus producing a tabular format? > > Thanks > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote: > >> > However, what really worries me is not having Dataset APIs at all in >> Python. I think thats a deal breaker. >> >> What is the functionality you are missing? In Spark 2.0 a DataFrame is >> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in >> core/.../o/a/s/sql/package.scala). >> Since python is dynamically typed, you wouldn't really gain anything by >> using Datasets anyway. >> >> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote: >> >>> Thanks All for your replies. >>> >>> Feature Parity: >>> >>> MLLib, RDD and dataframes features are totally comparable. Streaming is >>> now at par in functionality too, I believe. However, what really worries me >>> is not having Dataset APIs at all in Python. I think thats a deal breaker. >>> >>> Performance: >>> I do get this bit when RDDs are involved, but not when Data frame is >>> the only construct I am operating on. Dataframe supposed to be >>> language-agnostic in terms of performance. So why people think python is >>> slower? is it because of using UDF? Any other reason? >>> >>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF >>> comparison? like the one out there b/w RDDs.* >>> >>> @Kant: I am not comparing ANY applications. I am comparing SPARK >>> applications only. I would be glad to hear your opinion on why pyspark >>> applications will not work, if you have any benchmarks please share if >>> possible. >>> >>> >>> >>> >>> >>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote: >>> >>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code >>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can >>>> write a 10 page essay on why that wouldn't work so great. you might be >>>> wondering why would spark have it then? well probably because its ease of >>>> use for ML (that would be my best guess). >>>> >>>> >>>> >>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com >>>> wrote: >>>> >>>>> I believe this would greatly depend on your use case and your >>>>> familiarity with the languages. >>>>> >>>>> >>>>> >>>>> In general, scala would have a much better performance than python and >>>>> not all interfaces are available in python. >>>>> >>>>> That said, if you are planning to use dataframes without any UDF then >>>>> the performance hit is practically nonexistent. >>>>> >>>>> Even if you need UDF, it is possible to write those in scala and wrap >>>>> them for python and still get away without the performance hit. >>>>> >>>>> Python does not have interfaces for UDAFs. >>>>> >>>>> >>>>> >>>>> I believe that if you have large structured data and do not generally >>>>> need UDF/UDAF you can certainly work in python without losing too much. >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> *From:* ayan guha [mailto:[hidden email] >>>>> <http:///user/SendEmail.jtp?type=node&node=27637&i=0>] >>>>> *Sent:* Thursday, September 01, 2016 5:03 AM >>>>> *To:* user >>>>> *Subject:* Scala Vs Python >>>>> >>>>> >>>>> >>>>> Hi Users >>>>> >>>>> >>>>> >>>>> Thought to ask (again and again) the question: While I am building any >>>>> production application, should I use Scala or Python? >>>>> >>>>> >>>>> >>>>> I have read many if not most articles but all seems pre-Spark 2. >>>>> Anything changed with Spark 2? Either pro-scala way or pro-python way? >>>>> >>>>> >>>>> >>>>> I am thinking performance, feature parity and future direction, not so >>>>> much in terms of skillset or ease of use. >>>>> >>>>> >>>>> >>>>> Or, if you think it is a moot point, please say so as well. >>>>> >>>>> >>>>> >>>>> Any real life example, production experience, anecdotes, personal >>>>> taste, profanity all are welcome :) >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Best Regards, >>>>> Ayan Guha >>>>> >>>>> ------------------------------ >>>>> View this message in context: RE: Scala Vs Python >>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html> >>>>> Sent from the Apache Spark User List mailing list archive >>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >>>>> >>>> >>> >>> >>> -- >>> Best Regards, >>> Ayan Guha >>> >> >> >