https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

On Thu, Sep 1, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Hi Jacob.
>
> My understanding of Dataset is that it is basically an RDD with some
> optimization gone into it. RDD is meant to deal with unstructured data?
>
> Now DataFrame is the tabular format of RDD designed for tabular work, csv,
> SQL stuff etc.
>
> When you mention DataFrame is just an alias for Dataset[Row] does that
> mean  that it converts an RDD to DataSet thus producing a tabular format?
>
> Thanks
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:
>
>> > However, what really worries me is not having Dataset APIs at all in
>> Python. I think thats a deal breaker.
>>
>> What is the functionality you are missing? In Spark 2.0 a DataFrame is
>> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
>> core/.../o/a/s/sql/package.scala).
>> Since python is dynamically typed, you wouldn't really gain anything by
>> using Datasets anyway.
>>
>> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>
>>> Thanks All for your replies.
>>>
>>> Feature Parity:
>>>
>>> MLLib, RDD and dataframes features are totally comparable. Streaming is
>>> now at par in functionality too, I believe. However, what really worries me
>>> is not having Dataset APIs at all in Python. I think thats a deal breaker.
>>>
>>> Performance:
>>> I do  get this bit when RDDs are involved, but not when Data frame is
>>> the only construct I am operating on.  Dataframe supposed to be
>>> language-agnostic in terms of performance.  So why people think python is
>>> slower? is it because of using UDF? Any other reason?
>>>
>>> *Is there any kind of benchmarking/stats around Python UDF vs Scala UDF
>>> comparison? like the one out there  b/w RDDs.*
>>>
>>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>>> applications only. I would be glad to hear your opinion on why pyspark
>>> applications will not work, if you have any benchmarks please share if
>>> possible.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:
>>>
>>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I can
>>>> write a 10 page essay on why that wouldn't work so great. you might be
>>>> wondering why would spark have it then? well probably because its ease of
>>>> use for ML (that would be my best guess).
>>>>
>>>>
>>>>
>>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson assaf.mendel...@rsa.com
>>>> wrote:
>>>>
>>>>> I believe this would greatly depend on your use case and your
>>>>> familiarity with the languages.
>>>>>
>>>>>
>>>>>
>>>>> In general, scala would have a much better performance than python and
>>>>> not all interfaces are available in python.
>>>>>
>>>>> That said, if you are planning to use dataframes without any UDF then
>>>>> the performance hit is practically nonexistent.
>>>>>
>>>>> Even if you need UDF, it is possible to write those in scala and wrap
>>>>> them for python and still get away without the performance hit.
>>>>>
>>>>> Python does not have interfaces for UDAFs.
>>>>>
>>>>>
>>>>>
>>>>> I believe that if you have large structured data and do not generally
>>>>> need UDF/UDAF you can certainly work in python without losing too much.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> *From:* ayan guha [mailto:[hidden email]
>>>>> <http:///user/SendEmail.jtp?type=node&node=27637&i=0>]
>>>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>>>> *To:* user
>>>>> *Subject:* Scala Vs Python
>>>>>
>>>>>
>>>>>
>>>>> Hi Users
>>>>>
>>>>>
>>>>>
>>>>> Thought to ask (again and again) the question: While I am building any
>>>>> production application, should I use Scala or Python?
>>>>>
>>>>>
>>>>>
>>>>> I have read many if not most articles but all seems pre-Spark 2.
>>>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>>>
>>>>>
>>>>>
>>>>> I am thinking performance, feature parity and future direction, not so
>>>>> much in terms of skillset or ease of use.
>>>>>
>>>>>
>>>>>
>>>>> Or, if you think it is a moot point, please say so as well.
>>>>>
>>>>>
>>>>>
>>>>> Any real life example, production experience, anecdotes, personal
>>>>> taste, profanity all are welcome :)
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Best Regards,
>>>>> Ayan Guha
>>>>>
>>>>> ------------------------------
>>>>> View this message in context: RE: Scala Vs Python
>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>>>>> Sent from the Apache Spark User List mailing list archive
>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Ayan Guha
>>>
>>
>>
>

Reply via email to