Re: Attempting to avoid a shuffle on join

2019-07-05 Thread Chris Teoh
Dataframes have a partitionBy function too.

You can avoid a shuffle if one of your datasets is small enough to
broadcast.

On Thu., 4 Jul. 2019, 7:34 am Mkal,  wrote:

> Please keep in mind i'm fairly new to spark.
> I have some spark code where i load two textfiles as datasets and after
> some
> map and filter operations to bring the columns in a specific shape, i join
> the datasets.
>
> The join takes place on a common column (of type string).
> Is there any way to avoid the exchange/shuffle before the join?
>
> As i understand it, the idea is that if i, initially, hash partition the
> datasets based on the join column, then the join would only have to look
> within the same partitions to complete the join, thus avoiding a shuffle.
>
> In the rdd API, you can create a hash partitioner and use partitionBy when
> creating the RDDS.(Though im not sure if this a sure way to avoid the
> shuffle on the join.) Is there any similar method for Dataframe/Dataset
> API?
>
> I also would like to avoid repartition,repartitionByRange and bucketing
> techniques since i only intend to do one join and these also require
> shuffling beforehand.
>
>
>
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: Learning Spark

2019-07-05 Thread Alex A. Reda
Hello,

I also second Gourav's point regarding "Spark the definitive guide" book.
This is great for learning both Scala and python based SPARK. But as others
mentioned, you will need to continuously read the documentation as SPARK is
still undergoing a lot of improvements. I list additional resources below,
no plug :)

-   Excellent training on Spark 2 in Udemy by Jose Portilla. This one
is on Pyspark, he also has a training on Scala. Not super advanced but
touches the basics very well.
https://www.udemy.com/apache-spark-with-python-big-data-with-pyspark-and-spark/



-Great book on Spark 2, "Learning Pyspark" by Chambers and Zaharia
- so far the best in the resource lineup both for scala based and python
based Spark -
https://www.packtpub.com/big-data-and-business-intelligence/learning-pyspark
(Read
Chapter 1, 2, 4, and 6 to get immediate benefits)



-Great book on Spark by Tomasz Drabas and Denny Lee.
https://www.amazon.com/Spark-Definitive-Guide-Processing-Simple/dp/1491912219/ref=sr_1_1?ie=UTF8&qid=1540567390&sr=8-1&keywords=spark+the+definitive+guide
(Part
I, II, VI are the most important to get started). Apparently, they have a
new edition, I am referring to the 2017 edition.


- A bit dated now because Spark has evolved so much but I like Jeffrey
Aven's book and style of writing too."Sams Teach Yourself Apache Spark in
24 hours

"

In terms of actually learning, I would suggest practicing the code plus
based on my experience you are better off installing spark to your local
PC. I found this a much better way of learning than using an enterprise
cluster. Depending on which rout you take, if you decide to focus on
Pyspark, learning Scikit learn will provide you a lot of transferable
skills.

One final note, I am providing the suggestion from the perspective of a
data scientist.

Kind regards,

Alex Reda







On Fri, Jul 5, 2019 at 9:24 AM Gourav Sengupta 
wrote:

> okay this is all something which I would disagree with.
>
> Dr. Matei Zaharia created SPARK
> Then he and Bill Chambers wrote a book on SPARK recently
> He is still the main thinking power behind SPARK (look at his research in
> Stanford)
> The name of the book is "SPARK the definitive guide", its the best ever
> book and introduction on SPARK.
>
> I have been through several documentation, at least 40 books on SPARK, and
> nothing even comes close to this book. And also it puts into rest much of
> arguments around which language to choose.
>
> Thanks and Regards,
> Gourav Sengupta
>
> On Fri, Jul 5, 2019 at 11:55 AM Vikas Garg  wrote:
>
>> Thanks!!!
>>
>> On Fri, 5 Jul 2019 at 15:38, Chris Teoh  wrote:
>>
>>> Scala is better suited to data engineering work. It also has better
>>> integration with other components like HBase, Kafka, etc.
>>>
>>> Python is great for data scientists as there are more data science
>>> libraries available in Python.
>>>
>>> On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg,  wrote:
>>>
 Is there any disadvantage of using Python? I have gone through multiple
 articles which says that Python has advantages over Scala.

 Scala is super fast in comparison but Python has more pre-built
 libraries and options for analytics.

 Still should I go with Scala?

 On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer  wrote:

> Since you are a data engineer I would start by learning Scala. The
> parts of Scala you would need to learn are pretty basic. Start with the
> examples on the Spark website, which gives examples in multiple languages.
> Think of Scala as a typed version of Python. You will find that the error
> messages tend to be much more meaningful in Scala because that is the
> native language of Spark. If you don’t want to to install the JVM and
> Scala, I highly recommend Databricks community edition as a place to 
> start.
>
> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg 
> wrote:
>
>> I am currently working as a data engineer and I am working on Power
>> BI, SSIS (ETL Tool). For learning purpose, I have done the setup PySpark
>> and also able to run queries through Spark on multi node cluster DB (I am
>> using Vertica DB and later will move on HDFS or SQL Server).
>>
>> I have good knowledge of Python also.
>>
>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer 
>> wrote:
>>
>>> Are you a data scientist or data engineer?
>>>
>>>
>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg 
>>> wrote:
>>>
 Hi,

 I am new Spark learner. Can someone guide me with the strategy
 towards getting expertise in PySpark.

 Thanks!!!

>>>


Re: Learning Spark

2019-07-05 Thread Gourav Sengupta
okay this is all something which I would disagree with.

Dr. Matei Zaharia created SPARK
Then he and Bill Chambers wrote a book on SPARK recently
He is still the main thinking power behind SPARK (look at his research in
Stanford)
The name of the book is "SPARK the definitive guide", its the best ever
book and introduction on SPARK.

I have been through several documentation, at least 40 books on SPARK, and
nothing even comes close to this book. And also it puts into rest much of
arguments around which language to choose.

Thanks and Regards,
Gourav Sengupta

On Fri, Jul 5, 2019 at 11:55 AM Vikas Garg  wrote:

> Thanks!!!
>
> On Fri, 5 Jul 2019 at 15:38, Chris Teoh  wrote:
>
>> Scala is better suited to data engineering work. It also has better
>> integration with other components like HBase, Kafka, etc.
>>
>> Python is great for data scientists as there are more data science
>> libraries available in Python.
>>
>> On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg,  wrote:
>>
>>> Is there any disadvantage of using Python? I have gone through multiple
>>> articles which says that Python has advantages over Scala.
>>>
>>> Scala is super fast in comparison but Python has more pre-built
>>> libraries and options for analytics.
>>>
>>> Still should I go with Scala?
>>>
>>> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer  wrote:
>>>
 Since you are a data engineer I would start by learning Scala. The
 parts of Scala you would need to learn are pretty basic. Start with the
 examples on the Spark website, which gives examples in multiple languages.
 Think of Scala as a typed version of Python. You will find that the error
 messages tend to be much more meaningful in Scala because that is the
 native language of Spark. If you don’t want to to install the JVM and
 Scala, I highly recommend Databricks community edition as a place to start.

 On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg  wrote:

> I am currently working as a data engineer and I am working on Power
> BI, SSIS (ETL Tool). For learning purpose, I have done the setup PySpark
> and also able to run queries through Spark on multi node cluster DB (I am
> using Vertica DB and later will move on HDFS or SQL Server).
>
> I have good knowledge of Python also.
>
> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer 
> wrote:
>
>> Are you a data scientist or data engineer?
>>
>>
>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg 
>> wrote:
>>
>>> Hi,
>>>
>>> I am new Spark learner. Can someone guide me with the strategy
>>> towards getting expertise in PySpark.
>>>
>>> Thanks!!!
>>>
>>


unsubscribe

2019-07-05 Thread Paras Bansal



Re: Learning Spark

2019-07-05 Thread Vikas Garg
Thanks!!!

On Fri, 5 Jul 2019 at 15:38, Chris Teoh  wrote:

> Scala is better suited to data engineering work. It also has better
> integration with other components like HBase, Kafka, etc.
>
> Python is great for data scientists as there are more data science
> libraries available in Python.
>
> On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg,  wrote:
>
>> Is there any disadvantage of using Python? I have gone through multiple
>> articles which says that Python has advantages over Scala.
>>
>> Scala is super fast in comparison but Python has more pre-built libraries
>> and options for analytics.
>>
>> Still should I go with Scala?
>>
>> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer  wrote:
>>
>>> Since you are a data engineer I would start by learning Scala. The parts
>>> of Scala you would need to learn are pretty basic. Start with the examples
>>> on the Spark website, which gives examples in multiple languages. Think of
>>> Scala as a typed version of Python. You will find that the error messages
>>> tend to be much more meaningful in Scala because that is the native
>>> language of Spark. If you don’t want to to install the JVM and Scala, I
>>> highly recommend Databricks community edition as a place to start.
>>>
>>> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg  wrote:
>>>
 I am currently working as a data engineer and I am working on Power BI,
 SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
 also able to run queries through Spark on multi node cluster DB (I am using
 Vertica DB and later will move on HDFS or SQL Server).

 I have good knowledge of Python also.

 On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer  wrote:

> Are you a data scientist or data engineer?
>
>
> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg 
> wrote:
>
>> Hi,
>>
>> I am new Spark learner. Can someone guide me with the strategy
>> towards getting expertise in PySpark.
>>
>> Thanks!!!
>>
>


Re: Learning Spark

2019-07-05 Thread Chris Teoh
Scala is better suited to data engineering work. It also has better
integration with other components like HBase, Kafka, etc.

Python is great for data scientists as there are more data science
libraries available in Python.

On Fri., 5 Jul. 2019, 7:40 pm Vikas Garg,  wrote:

> Is there any disadvantage of using Python? I have gone through multiple
> articles which says that Python has advantages over Scala.
>
> Scala is super fast in comparison but Python has more pre-built libraries
> and options for analytics.
>
> Still should I go with Scala?
>
> On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer  wrote:
>
>> Since you are a data engineer I would start by learning Scala. The parts
>> of Scala you would need to learn are pretty basic. Start with the examples
>> on the Spark website, which gives examples in multiple languages. Think of
>> Scala as a typed version of Python. You will find that the error messages
>> tend to be much more meaningful in Scala because that is the native
>> language of Spark. If you don’t want to to install the JVM and Scala, I
>> highly recommend Databricks community edition as a place to start.
>>
>> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg  wrote:
>>
>>> I am currently working as a data engineer and I am working on Power BI,
>>> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
>>> also able to run queries through Spark on multi node cluster DB (I am using
>>> Vertica DB and later will move on HDFS or SQL Server).
>>>
>>> I have good knowledge of Python also.
>>>
>>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer  wrote:
>>>
 Are you a data scientist or data engineer?


 On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg  wrote:

> Hi,
>
> I am new Spark learner. Can someone guide me with the strategy towards
> getting expertise in PySpark.
>
> Thanks!!!
>



Re: Learning Spark

2019-07-05 Thread Vikas Garg
Is there any disadvantage of using Python? I have gone through multiple
articles which says that Python has advantages over Scala.

Scala is super fast in comparison but Python has more pre-built libraries
and options for analytics.

Still should I go with Scala?

On Fri, 5 Jul 2019 at 13:07, Kurt Fehlhauer  wrote:

> Since you are a data engineer I would start by learning Scala. The parts
> of Scala you would need to learn are pretty basic. Start with the examples
> on the Spark website, which gives examples in multiple languages. Think of
> Scala as a typed version of Python. You will find that the error messages
> tend to be much more meaningful in Scala because that is the native
> language of Spark. If you don’t want to to install the JVM and Scala, I
> highly recommend Databricks community edition as a place to start.
>
> On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg  wrote:
>
>> I am currently working as a data engineer and I am working on Power BI,
>> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
>> also able to run queries through Spark on multi node cluster DB (I am using
>> Vertica DB and later will move on HDFS or SQL Server).
>>
>> I have good knowledge of Python also.
>>
>> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer  wrote:
>>
>>> Are you a data scientist or data engineer?
>>>
>>>
>>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg  wrote:
>>>
 Hi,

 I am new Spark learner. Can someone guide me with the strategy towards
 getting expertise in PySpark.

 Thanks!!!

>>>


Re: Learning Spark

2019-07-05 Thread Kurt Fehlhauer
Since you are a data engineer I would start by learning Scala. The parts of
Scala you would need to learn are pretty basic. Start with the examples on
the Spark website, which gives examples in multiple languages. Think of
Scala as a typed version of Python. You will find that the error messages
tend to be much more meaningful in Scala because that is the native
language of Spark. If you don’t want to to install the JVM and Scala, I
highly recommend Databricks community edition as a place to start.

On Thu, Jul 4, 2019 at 11:22 PM Vikas Garg  wrote:

> I am currently working as a data engineer and I am working on Power BI,
> SSIS (ETL Tool). For learning purpose, I have done the setup PySpark and
> also able to run queries through Spark on multi node cluster DB (I am using
> Vertica DB and later will move on HDFS or SQL Server).
>
> I have good knowledge of Python also.
>
> On Fri, 5 Jul 2019 at 10:32, Kurt Fehlhauer  wrote:
>
>> Are you a data scientist or data engineer?
>>
>>
>> On Thu, Jul 4, 2019 at 10:34 PM Vikas Garg  wrote:
>>
>>> Hi,
>>>
>>> I am new Spark learner. Can someone guide me with the strategy towards
>>> getting expertise in PySpark.
>>>
>>> Thanks!!!
>>>
>>