This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.
This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale. 
As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.


Sent from my Verizon, Samsung Galaxy smartphone
-------- Original message --------From: Mich Talebzadeh 
<mich.talebza...@gmail.com> Date: 9/1/16  6:01 PM  (GMT-05:00) To: Jakob 
Odersky <ja...@odersky.com> Cc: ayan guha <guha.a...@gmail.com>, kant kodali 
<kanth...@gmail.com>, AssafMendelson <assaf.mendel...@rsa.com>, user 
<user@spark.apache.org> Subject: Re: Scala Vs Python 
Hi Jacob.
My understanding of Dataset is that it is basically an RDD with some 
optimization gone into it. RDD is meant to deal with unstructured data?
Now DataFrame is the tabular format of RDD designed for tabular work, csv, SQL 
stuff etc.
When you mention DataFrame is just an alias for Dataset[Row] does that mean  
that it converts an RDD to DataSet thus producing a tabular format?
Thanks



Dr Mich Talebzadeh

 

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

 

http://talebzadehmich.wordpress.com
Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction
of data or any other property which may arise from relying on this email's 
technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from such
loss, damage or destruction.  



On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:
> However, what really worries me is not having Dataset APIs at all in Python. 
> I think thats a deal breaker.

What is the functionality you are missing? In Spark 2.0 a DataFrame is just an 
alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in 
core/.../o/a/s/sql/package.scala).
Since python is dynamically typed, you wouldn't really gain anything by using 
Datasets anyway.

On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
Thanks All for your replies.
Feature Parity: 
MLLib, RDD and dataframes features are totally comparable. Streaming is now at 
par in functionality too, I believe. However, what really worries me is not 
having Dataset APIs at all in Python. I think thats a deal breaker. 
Performance: I do  get this bit when RDDs are involved, but not when Data frame 
is the only construct I am operating on.  Dataframe supposed to be 
language-agnostic in terms of performance.  So why people think python is 
slower? is it because of using UDF? Any other reason?
Is there any kind of benchmarking/stats around Python UDF vs Scala UDF 
comparison? like the one out there  b/w RDDs.
@Kant:  I am not comparing ANY applications. I am comparing SPARK applications 
only. I would be glad to hear your opinion on why pyspark applications will not 
work, if you have any benchmarks please share if possible. 





On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com> wrote:





  
    
    
    
    
    
    
    
    

    

  

  

    

    
    
    

    
      
        

          
            c'mon man this is no Brainer..Dynamic Typed Languages for Large 
Code Bases or Large Scale Distributed Systems makes absolutely no sense. I can 
write a 10 page essay on why that wouldn't work so great. you might be 
wondering why would spark have it then? well probably because its ease of use 
for ML (that would be my best guess). 
          
        
      
    

    

  

  
    

    
      On Wed, Aug 31, 2016 11:45 PM, AssafMendelson  assaf.mendel...@rsa.com
 wrote:

      
        
        







I believe this would greatly depend on your use case and your familiarity with 
the languages.
 
In general, scala would have a much better performance than python and not all 
interfaces are available in python.

That said, if you are planning to use dataframes without any UDF then the 
performance hit is practically nonexistent.
Even if you need UDF, it is possible to write those in scala and wrap them for 
python and still get away without the performance hit.
Python does not have interfaces for UDAFs.
 
I believe that if you have large structured data and do not generally need 
UDF/UDAF you can certainly work in python without losing too much.
 
 
From: ayan guha [mailto:[hidden email]]


Sent: Thursday, September 01, 2016 5:03 AM

To: user

Subject: Scala Vs Python
 

Hi Users

 


Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python? 


 


I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way? 


 


I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use. 


 


Or, if you think it is a moot point, please say so as well. 


 


Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)



 

-- 

Best Regards,

Ayan Guha








        
        
        


View this message in context: RE: Scala Vs Python

Sent from the Apache Spark User List mailing list archive at Nabble.com.

      
    
  


    
  




-- 
Best Regards,
Ayan Guha






Reply via email to