> However, what really worries me is not having Dataset APIs at all in Python. I think thats a deal breaker.
What is the functionality you are missing? In Spark 2.0 a DataFrame is just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in core/.../o/a/s/sql/package.scala). Since python is dynamically typed, you wouldn't really gain anything by using Datasets anyway. On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote: > Thanks All for your replies. > > Feature Parity: > > MLLib, RDD and dataframes features are totally comparable. Streaming is > now at par in functionality too, I believe. However, what really worries me > is not having Dataset APIs at all in Python. I think thats a deal breaker. > > Performance: > I do get this bit when RDDs are involved, but not when Data frame is the > only construct I am operating on. Dataframe supposed to be > language-agnostic in terms of performance. So why people think python is > slower? is it because of using UDF? Any other reason? > > *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF > comparison? like the one out there b/w RDDs.* > > @Kant: I am not comparing ANY applications. I am comparing SPARK > applications only. I would be glad to hear your opinion on why pyspark > applications will not work, if you have any benchmarks please share if > possible. > > > > > > On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote: > >> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code >> Bases or Large Scale Distributed Systems makes absolutely no sense. I can >> write a 10 page essay on why that wouldn't work so great. you might be >> wondering why would spark have it then? well probably because its ease of >> use for ML (that would be my best guess). >> >> >> >> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com >> wrote: >> >>> I believe this would greatly depend on your use case and your >>> familiarity with the languages. >>> >>> >>> >>> In general, scala would have a much better performance than python and >>> not all interfaces are available in python. >>> >>> That said, if you are planning to use dataframes without any UDF then >>> the performance hit is practically nonexistent. >>> >>> Even if you need UDF, it is possible to write those in scala and wrap >>> them for python and still get away without the performance hit. >>> >>> Python does not have interfaces for UDAFs. >>> >>> >>> >>> I believe that if you have large structured data and do not generally >>> need UDF/UDAF you can certainly work in python without losing too much. >>> >>> >>> >>> >>> >>> *From:* ayan guha [mailto:[hidden email] >>> <http:///user/SendEmail.jtp?type=node&node=27637&i=0>] >>> *Sent:* Thursday, September 01, 2016 5:03 AM >>> *To:* user >>> *Subject:* Scala Vs Python >>> >>> >>> >>> Hi Users >>> >>> >>> >>> Thought to ask (again and again) the question: While I am building any >>> production application, should I use Scala or Python? >>> >>> >>> >>> I have read many if not most articles but all seems pre-Spark 2. >>> Anything changed with Spark 2? Either pro-scala way or pro-python way? >>> >>> >>> >>> I am thinking performance, feature parity and future direction, not so >>> much in terms of skillset or ease of use. >>> >>> >>> >>> Or, if you think it is a moot point, please say so as well. >>> >>> >>> >>> Any real life example, production experience, anecdotes, personal taste, >>> profanity all are welcome :) >>> >>> >>> >>> -- >>> >>> Best Regards, >>> Ayan Guha >>> >>> ------------------------------ >>> View this message in context: RE: Scala Vs Python >>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html> >>> Sent from the Apache Spark User List mailing list archive >>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >>> >> > > > -- > Best Regards, > Ayan Guha >