Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Anthony May
A sensible default strategy is to use the same language in which a system
was developed or a highly compatible language. That would be Scala for
Spark, however I assume you don't currently know Scala to the same degree
as Python or at all. In which case to help you make the decision you should
also consider your own personal/team productivity and project constraints.
If you have time and/or require the bleeding edge features and performance
then learning/strengthening in Scala is worth it and you should use the
Scala API.
If you're already very productive in Python and have tighter time
constraints and don't need the bleeding edge features and maximum
performance isn't a high priority then I'd recommend using the Python API.

On Mon, 21 Nov 2016 at 11:58 Jon Gregg  wrote:

> Spark is written in Scala, so yes it's still the strongest option.  You
> also get the Dataset type with Scala (compile time type-safety), and that's
> not an available feature with Python.
>
> That said, I think the Python API is a viable candidate if you use Pandas
> for Data Science.  There are similarities between the DataFrame and Pandas
> APIs, and you can convert a Spark DataFrame to a Pandas DataFrame.
>
> On Mon, Nov 21, 2016 at 1:51 PM, Brandon White 
> wrote:
>
> Hello all,
>
> I will be starting a new Spark codebase and I would like to get opinions
> on using Python over Scala. Historically, the Scala API has always been the
> strongest interface to Spark. Is this still true? Are there still many
> benefits and additional features in the Scala API that are not available in
> the Python API? Are there any performance concerns using the Python API
> that do not exist when using the Scala API? Anything else I should know
> about?
>
> I appreciate any insight you have on using the Scala API over the Python
> API.
>
> Brandon
>
>
>


Re: Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Jon Gregg
Spark is written in Scala, so yes it's still the strongest option.  You
also get the Dataset type with Scala (compile time type-safety), and that's
not an available feature with Python.

That said, I think the Python API is a viable candidate if you use Pandas
for Data Science.  There are similarities between the DataFrame and Pandas
APIs, and you can convert a Spark DataFrame to a Pandas DataFrame.

On Mon, Nov 21, 2016 at 1:51 PM, Brandon White 
wrote:

> Hello all,
>
> I will be starting a new Spark codebase and I would like to get opinions
> on using Python over Scala. Historically, the Scala API has always been the
> strongest interface to Spark. Is this still true? Are there still many
> benefits and additional features in the Scala API that are not available in
> the Python API? Are there any performance concerns using the Python API
> that do not exist when using the Scala API? Anything else I should know
> about?
>
> I appreciate any insight you have on using the Scala API over the Python
> API.
>
> Brandon
>


Starting a new Spark codebase, Python or Scala / Java?

2016-11-21 Thread Brandon White
Hello all,

I will be starting a new Spark codebase and I would like to get opinions on
using Python over Scala. Historically, the Scala API has always been the
strongest interface to Spark. Is this still true? Are there still many
benefits and additional features in the Scala API that are not available in
the Python API? Are there any performance concerns using the Python API
that do not exist when using the Scala API? Anything else I should know
about?

I appreciate any insight you have on using the Scala API over the Python
API.

Brandon