Re: [spark on yarn] spark on yarn without DFS

2019-05-20 Thread Hariharan
Hi Huizhe,

You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
That way your spark job will use S3 for all operations that need HDFS.
Intermediate data will still be stored on local disk though.

Thanks,
Hari

On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari 
wrote:

> While spark can read from S3 directly in EMR, I believe it still needs the
> HDFS to perform shuffles and to write intermediate data into disk when
> doing jobs (I.e. when the in memory need stop spill over to disk)
>
> For these operations, Spark does need a distributed file system - You
> could use something like EMRFS (which is like a HDFS backed by S3) on
> Amazon.
>
> The issue could be something else too - so a stacktrace or error message
> could help in understanding the problem.
>
>
>
> On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:
>
>> Hi,
>>
>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
>> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
>> DataNode. I got an error when using yarn cluster mode. Could I using yarn
>> without start DFS, how could I use this mode?
>>
>> Yours,
>> Jane
>>
>


Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Hariharan
Hi Akshay,

I believe HDP uses the capacity scheduler by default. In the capacity
scheduler, assignment of multiple containers on the same node is
determined by the option
yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled,
which is true by default. If you would like YARN to spread out the
containers, you can set this for false.

You can read learn about this and associated parameters here
-https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html

~ Hari


On Mon, May 20, 2019 at 11:16 AM Akshay Bhardwaj
 wrote:
>
> Hi All,
>
> Just floating this email again. Grateful for any suggestions.
>
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Mon, May 20, 2019 at 12:25 AM Akshay Bhardwaj 
>  wrote:
>>
>> Hi All,
>>
>> I am running Spark 2.3 on YARN using HDP 2.6
>>
>> I am running spark job using dynamic resource allocation on YARN with 
>> minimum 2 executors and maximum 6. My job read data from parquet files which 
>> are present on S3 buckets and store some enriched data to cassandra.
>>
>> My question is, how does YARN decide which nodes to launch containers?
>> I have around 12 YARN nodes running in the cluster, but still i see repeated 
>> patterns of 3-4 containers launched on the same node for a particular job.
>>
>> What is the best way to start debugging this reason?
>>
>> Akshay Bhardwaj
>> +91-97111-33849

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [spark on yarn] spark on yarn without DFS

2019-05-20 Thread JB Data31
There is a kind of check in the *yarn-site.xml*


*yarn.nodemanager.remote-app-log-dir
/var/yarn/logs*
**

Using *hdfs://:9000* as* fs.defaultFS* in *core-site.xml* you have to *hdfs
dfs -mkdir /var/yarn/logs*
Using *S3://* as * fs.defaultFS*...

Take care of *.dir* properties in* hdfs-site.xml*. Must point to local or
S3 value.

Curious to see *YARN* working without *DFS*.

@*JB*Δ 

Le lun. 20 mai 2019 à 09:54, Hariharan  a écrit :

> Hi Huizhe,
>
> You can set the "fs.defaultFS" field in core-site.xml to some path on s3.
> That way your spark job will use S3 for all operations that need HDFS.
> Intermediate data will still be stored on local disk though.
>
> Thanks,
> Hari
>
> On Mon, May 20, 2019 at 10:14 AM Abdeali Kothari 
> wrote:
>
>> While spark can read from S3 directly in EMR, I believe it still needs
>> the HDFS to perform shuffles and to write intermediate data into disk when
>> doing jobs (I.e. when the in memory need stop spill over to disk)
>>
>> For these operations, Spark does need a distributed file system - You
>> could use something like EMRFS (which is like a HDFS backed by S3) on
>> Amazon.
>>
>> The issue could be something else too - so a stacktrace or error message
>> could help in understanding the problem.
>>
>>
>>
>> On Mon, May 20, 2019, 07:20 Huizhe Wang  wrote:
>>
>>> Hi,
>>>
>>> I wanna to use Spark on Yarn without HDFS.I store my resource in AWS and
>>> using s3a to get them. However, when I use stop-dfs.sh stoped Namenode and
>>> DataNode. I got an error when using yarn cluster mode. Could I using yarn
>>> without start DFS, how could I use this mode?
>>>
>>> Yours,
>>> Jane
>>>
>>


Watermark handling on initial query start (Structured Streaming)

2019-05-20 Thread Joe Ammann
Hi all

I'm currently developing a Spark structured streaming application which 
joins/aggregates messages from ~7 Kafka topics and produces messages onto 
another Kafka topic.

Quite often in my development cycle, I want to "reprocess from scratch": I stop 
the program, delete the target topic and associated checkpoint information, and 
restart the application with the query.

My assumption would be that the newly started query then processes all messages 
that are on the input topics, sets the watermark according to the freshest 
messages on the topic and produces the output messages which have moved past 
the watermark and can thus be safely produced. As an example, if the freshest 
message on the topic has an event time of "2019-05-20 10:13" I restart the 
query at "2019-05-20 11:30" and I have a watermark duration of 10 minutes, I 
would expect the query to have a eventTime watermark of "2019-05-20 10:03" and 
all earlier results are produced.

But my observations indicate that after initial query startup and reading all 
input topics, the watermark stays at Unix epoch (1970-01-01) and no messages 
are produced. Only once a new message comes in, after the start of the query, 
then the watermark is moved ahead and all the messages are produced.

Is this the expected behaviour, and my assumption is wrong? Am I doing 
something wrong during query setup?

-- 
CU, Joe

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Akshay Bhardwaj
Hi Hari,

Thanks for this information.

Do you have any resources on/can explain, why YARN has this as default
behaviour? What would be the advantages/scenarios to have multiple
assignments in single heartbeat?


Regards
Akshay Bhardwaj
+91-97111-33849


On Mon, May 20, 2019 at 1:29 PM Hariharan  wrote:

> Hi Akshay,
>
> I believe HDP uses the capacity scheduler by default. In the capacity
> scheduler, assignment of multiple containers on the same node is
> determined by the option
> yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled,
> which is true by default. If you would like YARN to spread out the
> containers, you can set this for false.
>
> You can read learn about this and associated parameters here
> -
> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>
> ~ Hari
>
>
> On Mon, May 20, 2019 at 11:16 AM Akshay Bhardwaj
>  wrote:
> >
> > Hi All,
> >
> > Just floating this email again. Grateful for any suggestions.
> >
> > Akshay Bhardwaj
> > +91-97111-33849
> >
> >
> > On Mon, May 20, 2019 at 12:25 AM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
> >>
> >> Hi All,
> >>
> >> I am running Spark 2.3 on YARN using HDP 2.6
> >>
> >> I am running spark job using dynamic resource allocation on YARN with
> minimum 2 executors and maximum 6. My job read data from parquet files
> which are present on S3 buckets and store some enriched data to cassandra.
> >>
> >> My question is, how does YARN decide which nodes to launch containers?
> >> I have around 12 YARN nodes running in the cluster, but still i see
> repeated patterns of 3-4 containers launched on the same node for a
> particular job.
> >>
> >> What is the best way to start debugging this reason?
> >>
> >> Akshay Bhardwaj
> >> +91-97111-33849
>


Fetching LinkedIn data into PySpark using OAuth2.0

2019-05-20 Thread Aakash Basu
Hi,

Just curious to know if anyone was successful in connecting LinkedIn using
OAuth2.0, client ID and client secret to fetch data and process in
Python/PySpark.

I'm getting stuck at connection establishment.

Any help?

Thanks,
Aakash.


Re: Spark-YARN | Scheduling of containers

2019-05-20 Thread Hariharan
It makes scheduling faster. If you have a node that can accommodate 20
containers, and you schedule one container per heartbeat, it would take 20
seconds to schedule all the containers. OTOH if you schedule multiple
containers to a heartbeat it is much faster.

- Hari

On Mon, 20 May 2019, 15:40 Akshay Bhardwaj, 
wrote:

> Hi Hari,
>
> Thanks for this information.
>
> Do you have any resources on/can explain, why YARN has this as default
> behaviour? What would be the advantages/scenarios to have multiple
> assignments in single heartbeat?
>
>
> Regards
> Akshay Bhardwaj
> +91-97111-33849
>
>
> On Mon, May 20, 2019 at 1:29 PM Hariharan  wrote:
>
>> Hi Akshay,
>>
>> I believe HDP uses the capacity scheduler by default. In the capacity
>> scheduler, assignment of multiple containers on the same node is
>> determined by the option
>> yarn.scheduler.capacity.per-node-heartbeat.multiple-assignments-enabled,
>> which is true by default. If you would like YARN to spread out the
>> containers, you can set this for false.
>>
>> You can read learn about this and associated parameters here
>> -
>> https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/CapacityScheduler.html
>>
>> ~ Hari
>>
>>
>> On Mon, May 20, 2019 at 11:16 AM Akshay Bhardwaj
>>  wrote:
>> >
>> > Hi All,
>> >
>> > Just floating this email again. Grateful for any suggestions.
>> >
>> > Akshay Bhardwaj
>> > +91-97111-33849
>> >
>> >
>> > On Mon, May 20, 2019 at 12:25 AM Akshay Bhardwaj <
>> akshay.bhardwaj1...@gmail.com> wrote:
>> >>
>> >> Hi All,
>> >>
>> >> I am running Spark 2.3 on YARN using HDP 2.6
>> >>
>> >> I am running spark job using dynamic resource allocation on YARN with
>> minimum 2 executors and maximum 6. My job read data from parquet files
>> which are present on S3 buckets and store some enriched data to cassandra.
>> >>
>> >> My question is, how does YARN decide which nodes to launch containers?
>> >> I have around 12 YARN nodes running in the cluster, but still i see
>> repeated patterns of 3-4 containers launched on the same node for a
>> particular job.
>> >>
>> >> What is the best way to start debugging this reason?
>> >>
>> >> Akshay Bhardwaj
>> >> +91-97111-33849
>>
>


run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
Hi

I am wondering whether that's feasible to:
- build a spark application (with sbt/maven) based on spark2.4
- deploy that jar on yarn on a spark2.3 based installation

thanks by advance,


-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
yarn can happily run multiple spark versions side-by-side
you will need the spark version you intend to launch with on the machine
you launch from and point to the correct spark-submit

On Mon, May 20, 2019 at 1:50 PM Nicolas Paris 
wrote:

> Hi
>
> I am wondering whether that's feasible to:
> - build a spark application (with sbt/maven) based on spark2.4
> - deploy that jar on yarn on a spark2.3 based installation
>
> thanks by advance,
>
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
> you will need the spark version you intend to launch with on the machine you
> launch from and point to the correct spark-submit

does this mean to install a second spark version (2.4) on the cluster ?

thanks

On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> yarn can happily run multiple spark versions side-by-side
> you will need the spark version you intend to launch with on the machine you
> launch from and point to the correct spark-submit
> 
> On Mon, May 20, 2019 at 1:50 PM Nicolas Paris  
> wrote:
> 
> Hi
> 
> I am wondering whether that's feasible to:
> - build a spark application (with sbt/maven) based on spark2.4
> - deploy that jar on yarn on a spark2.3 based installation
> 
> thanks by advance,
> 
> 
> --
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: run new spark version on old spark cluster ?

2019-05-20 Thread Pat Ferrel
It is always dangerous to run a NEWER version of code on an OLDER cluster.
The danger increases with the semver change and this one is not just a
build #. In other word 2.4 is considered to be a fairly major change from
2.3. Not much else can be said.


From: Nicolas Paris  
Reply: user@spark.apache.org  
Date: May 20, 2019 at 11:02:49 AM
To: user@spark.apache.org  
Subject:  Re: run new spark version on old spark cluster ?

> you will need the spark version you intend to launch with on the machine
you
> launch from and point to the correct spark-submit

does this mean to install a second spark version (2.4) on the cluster ?

thanks

On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> yarn can happily run multiple spark versions side-by-side
> you will need the spark version you intend to launch with on the machine
you
> launch from and point to the correct spark-submit
>
> On Mon, May 20, 2019 at 1:50 PM Nicolas Paris 
wrote:
>
> Hi
>
> I am wondering whether that's feasible to:
> - build a spark application (with sbt/maven) based on spark2.4
> - deploy that jar on yarn on a spark2.3 based installation
>
> thanks by advance,
>
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
correct. note that you only need to install spark on the node you launch it
from. spark doesnt need to be installed on cluster itself.

the shared components between spark jobs on yarn are only really
spark-shuffle-service in yarn and spark-history-server. i have found
compatibility for these to be good. its best if these run latest version.

On Mon, May 20, 2019 at 2:02 PM Nicolas Paris 
wrote:

> > you will need the spark version you intend to launch with on the machine
> you
> > launch from and point to the correct spark-submit
>
> does this mean to install a second spark version (2.4) on the cluster ?
>
> thanks
>
> On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> > yarn can happily run multiple spark versions side-by-side
> > you will need the spark version you intend to launch with on the machine
> you
> > launch from and point to the correct spark-submit
> >
> > On Mon, May 20, 2019 at 1:50 PM Nicolas Paris 
> wrote:
> >
> > Hi
> >
> > I am wondering whether that's feasible to:
> > - build a spark application (with sbt/maven) based on spark2.4
> > - deploy that jar on yarn on a spark2.3 based installation
> >
> > thanks by advance,
> >
> >
> > --
> > nicolas
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
> correct. note that you only need to install spark on the node you launch it
> from. spark doesnt need to be installed on cluster itself.

That sound reasonably doable for me. My guess is I will have some
troubles to make that spark version work with both hive & hdfs installed
on the cluster - or maybe that's finally plug-&-play i don't know.

thanks

On Mon, May 20, 2019 at 02:16:43PM -0400, Koert Kuipers wrote:
> correct. note that you only need to install spark on the node you launch it
> from. spark doesnt need to be installed on cluster itself.
> 
> the shared components between spark jobs on yarn are only really
> spark-shuffle-service in yarn and spark-history-server. i have found
> compatibility for these to be good. its best if these run latest version.
> 
> On Mon, May 20, 2019 at 2:02 PM Nicolas Paris  
> wrote:
> 
> > you will need the spark version you intend to launch with on the machine
> you
> > launch from and point to the correct spark-submit
> 
> does this mean to install a second spark version (2.4) on the cluster ?
> 
> thanks
> 
> On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> > yarn can happily run multiple spark versions side-by-side
> > you will need the spark version you intend to launch with on the machine
> you
> > launch from and point to the correct spark-submit
> >
> > On Mon, May 20, 2019 at 1:50 PM Nicolas Paris 
> wrote:
> >
> >     Hi
> >
> >     I am wondering whether that's feasible to:
> >     - build a spark application (with sbt/maven) based on spark2.4
> >     - deploy that jar on yarn on a spark2.3 based installation
> >
> >     thanks by advance,
> >
> >
> >     --
> >     nicolas
> >
> >     
> -
> >     To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> 
> --
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



High level explanation of dropDuplicates

2019-05-20 Thread Yeikel
Hi , 

I am looking for a high level explanation(overview) on how dropDuplicates[1]
works. 

[1]
https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326

Could someone please explain?

Thank you



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
we had very little issues with hdfs or hive, but then we use hive only for
basic reading and writing of tables.

depending on your vendor you might have to add a few settings to your
spark-defaults.conf. i remember on hdp you had to set the hdp.version
somehow.
we prefer to build spark with hadoop being provided, and then add hadoop
classpath to spark classpath. this works well on cdh, hdp, and also for
cloud providers.

for example this is a typical build with hive for cdh 5 (which is based on
hadoop 2.6, you change hadoop version based on vendor):
dev/make-distribution.sh --name  --tgz -Phadoop-2.6
-Dhadoop.version=2.6.0 -Pyarn -Phadoop-provided -Phive
add hadoop classpath to the spark classpath in spark-env.sh:
export SPARK_DIST_CLASSPATH=$(hadoop classpath)

i think certain vendors support multiple "vendor supported" installs, so
you could also look into that if you are not comfortable with running your
own spark build.

On Mon, May 20, 2019 at 2:24 PM Nicolas Paris 
wrote:

> > correct. note that you only need to install spark on the node you launch
> it
> > from. spark doesnt need to be installed on cluster itself.
>
> That sound reasonably doable for me. My guess is I will have some
> troubles to make that spark version work with both hive & hdfs installed
> on the cluster - or maybe that's finally plug-&-play i don't know.
>
> thanks
>
> On Mon, May 20, 2019 at 02:16:43PM -0400, Koert Kuipers wrote:
> > correct. note that you only need to install spark on the node you launch
> it
> > from. spark doesnt need to be installed on cluster itself.
> >
> > the shared components between spark jobs on yarn are only really
> > spark-shuffle-service in yarn and spark-history-server. i have found
> > compatibility for these to be good. its best if these run latest version.
> >
> > On Mon, May 20, 2019 at 2:02 PM Nicolas Paris 
> wrote:
> >
> > > you will need the spark version you intend to launch with on the
> machine
> > you
> > > launch from and point to the correct spark-submit
> >
> > does this mean to install a second spark version (2.4) on the
> cluster ?
> >
> > thanks
> >
> > On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> > > yarn can happily run multiple spark versions side-by-side
> > > you will need the spark version you intend to launch with on the
> machine
> > you
> > > launch from and point to the correct spark-submit
> > >
> > > On Mon, May 20, 2019 at 1:50 PM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > wrote:
> > >
> > > Hi
> > >
> > > I am wondering whether that's feasible to:
> > > - build a spark application (with sbt/maven) based on spark2.4
> > > - deploy that jar on yarn on a spark2.3 based installation
> > >
> > > thanks by advance,
> > >
> > >
> > > --
> > > nicolas
> > >
> > >
>  -
> > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > >
> > >
> >
> > --
> > nicolas
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Nicolas Paris
Finally that was easy to connect to both hive/hdfs. I just had to copy
the hive-site.xml from the old spark version and that worked instantly
after unzipping.

Right now I am stuck on connecting to yarn. 


On Mon, May 20, 2019 at 02:50:44PM -0400, Koert Kuipers wrote:
> we had very little issues with hdfs or hive, but then we use hive only for
> basic reading and writing of tables.
> 
> depending on your vendor you might have to add a few settings to your
> spark-defaults.conf. i remember on hdp you had to set the hdp.version somehow.
> we prefer to build spark with hadoop being provided, and then add hadoop
> classpath to spark classpath. this works well on cdh, hdp, and also for cloud
> providers.
> 
> for example this is a typical build with hive for cdh 5 (which is based on
> hadoop 2.6, you change hadoop version based on vendor):
> dev/make-distribution.sh --name  --tgz -Phadoop-2.6 
> -Dhadoop.version=
> 2.6.0 -Pyarn -Phadoop-provided -Phive
> add hadoop classpath to the spark classpath in spark-env.sh:
> export SPARK_DIST_CLASSPATH=$(hadoop classpath)
> 
> i think certain vendors support multiple "vendor supported" installs, so you
> could also look into that if you are not comfortable with running your own
> spark build.
> 
> On Mon, May 20, 2019 at 2:24 PM Nicolas Paris  
> wrote:
> 
> > correct. note that you only need to install spark on the node you launch
> it
> > from. spark doesnt need to be installed on cluster itself.
> 
> That sound reasonably doable for me. My guess is I will have some
> troubles to make that spark version work with both hive & hdfs installed
> on the cluster - or maybe that's finally plug-&-play i don't know.
> 
> thanks
> 
> On Mon, May 20, 2019 at 02:16:43PM -0400, Koert Kuipers wrote:
> > correct. note that you only need to install spark on the node you launch
> it
> > from. spark doesnt need to be installed on cluster itself.
> >
> > the shared components between spark jobs on yarn are only really
> > spark-shuffle-service in yarn and spark-history-server. i have found
> > compatibility for these to be good. its best if these run latest 
> version.
> >
> > On Mon, May 20, 2019 at 2:02 PM Nicolas Paris 
> wrote:
> >
> >     > you will need the spark version you intend to launch with on the
> machine
> >     you
> >     > launch from and point to the correct spark-submit
> >
> >     does this mean to install a second spark version (2.4) on the 
> cluster
> ?
> >
> >     thanks
> >
> >     On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> >     > yarn can happily run multiple spark versions side-by-side
> >     > you will need the spark version you intend to launch with on the
> machine
> >     you
> >     > launch from and point to the correct spark-submit
> >     >
> >     > On Mon, May 20, 2019 at 1:50 PM Nicolas Paris <
> nicolas.pa...@riseup.net>
> >     wrote:
> >     >
> >     >     Hi
> >     >
> >     >     I am wondering whether that's feasible to:
> >     >     - build a spark application (with sbt/maven) based on spark2.4
> >     >     - deploy that jar on yarn on a spark2.3 based installation
> >     >
> >     >     thanks by advance,
> >     >
> >     >
> >     >     --
> >     >     nicolas
> >     >
> >     >   
>  -
> >     >     To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >     >
> >     >
> >
> >     --
> >     nicolas
> >
> >     
> -
> >     To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
> 
> --
> nicolas
> 
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> 
> 

-- 
nicolas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: High level explanation of dropDuplicates

2019-05-20 Thread Nicholas Hakobian
>From doing some searching around in the spark codebase, I found the
following:

https://github.com/apache/spark/blob/163a6e298213f216f74f4764e241ee6298ea30b6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1452-L1474

So it appears there is no direct operation called dropDuplicates or
Deduplicate, but there is an optimizer rule that converts this logical
operation to a physical operation that is equivalent to grouping by all the
columns you want to deduplicate across (or all columns if you are doing
something like distinct), and taking the First() value. So (using a pySpark
code example):

df = input_df.dropDuplicates(['col1', 'col2'])

Is effectively shorthand for saying something like:

df = input_df.groupBy('col1',
'col2').agg(first(struct(input_df.columns)).alias('data')).select('data.*')

Except I assume that it has some internal optimization so it doesn't need
to pack/unpack the column data, and just returns the whole Row.

Nicholas Szandor Hakobian, Ph.D.
Principal Data Scientist
Rally Health
nicholas.hakob...@rallyhealth.com



On Mon, May 20, 2019 at 11:38 AM Yeikel  wrote:

> Hi ,
>
> I am looking for a high level explanation(overview) on how
> dropDuplicates[1]
> works.
>
> [1]
>
> https://github.com/apache/spark/blob/db24b04cad421ed508413d397c6beec01f723aee/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2326
>
> Could someone please explain?
>
> Thank you
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: run new spark version on old spark cluster ?

2019-05-20 Thread Koert Kuipers
most likely have to set something in spark-defaults.conf like

spark.master yarn
spark.submit.deployMode client

On Mon, May 20, 2019 at 3:14 PM Nicolas Paris 
wrote:

> Finally that was easy to connect to both hive/hdfs. I just had to copy
> the hive-site.xml from the old spark version and that worked instantly
> after unzipping.
>
> Right now I am stuck on connecting to yarn.
>
>
> On Mon, May 20, 2019 at 02:50:44PM -0400, Koert Kuipers wrote:
> > we had very little issues with hdfs or hive, but then we use hive only
> for
> > basic reading and writing of tables.
> >
> > depending on your vendor you might have to add a few settings to your
> > spark-defaults.conf. i remember on hdp you had to set the hdp.version
> somehow.
> > we prefer to build spark with hadoop being provided, and then add hadoop
> > classpath to spark classpath. this works well on cdh, hdp, and also for
> cloud
> > providers.
> >
> > for example this is a typical build with hive for cdh 5 (which is based
> on
> > hadoop 2.6, you change hadoop version based on vendor):
> > dev/make-distribution.sh --name  --tgz -Phadoop-2.6
> -Dhadoop.version=
> > 2.6.0 -Pyarn -Phadoop-provided -Phive
> > add hadoop classpath to the spark classpath in spark-env.sh:
> > export SPARK_DIST_CLASSPATH=$(hadoop classpath)
> >
> > i think certain vendors support multiple "vendor supported" installs, so
> you
> > could also look into that if you are not comfortable with running your
> own
> > spark build.
> >
> > On Mon, May 20, 2019 at 2:24 PM Nicolas Paris 
> wrote:
> >
> > > correct. note that you only need to install spark on the node you
> launch
> > it
> > > from. spark doesnt need to be installed on cluster itself.
> >
> > That sound reasonably doable for me. My guess is I will have some
> > troubles to make that spark version work with both hive & hdfs
> installed
> > on the cluster - or maybe that's finally plug-&-play i don't know.
> >
> > thanks
> >
> > On Mon, May 20, 2019 at 02:16:43PM -0400, Koert Kuipers wrote:
> > > correct. note that you only need to install spark on the node you
> launch
> > it
> > > from. spark doesnt need to be installed on cluster itself.
> > >
> > > the shared components between spark jobs on yarn are only really
> > > spark-shuffle-service in yarn and spark-history-server. i have
> found
> > > compatibility for these to be good. its best if these run latest
> version.
> > >
> > > On Mon, May 20, 2019 at 2:02 PM Nicolas Paris <
> nicolas.pa...@riseup.net>
> > wrote:
> > >
> > > > you will need the spark version you intend to launch with on
> the
> > machine
> > > you
> > > > launch from and point to the correct spark-submit
> > >
> > > does this mean to install a second spark version (2.4) on the
> cluster
> > ?
> > >
> > > thanks
> > >
> > > On Mon, May 20, 2019 at 01:58:11PM -0400, Koert Kuipers wrote:
> > > > yarn can happily run multiple spark versions side-by-side
> > > > you will need the spark version you intend to launch with on
> the
> > machine
> > > you
> > > > launch from and point to the correct spark-submit
> > > >
> > > > On Mon, May 20, 2019 at 1:50 PM Nicolas Paris <
> > nicolas.pa...@riseup.net>
> > > wrote:
> > > >
> > > > Hi
> > > >
> > > > I am wondering whether that's feasible to:
> > > > - build a spark application (with sbt/maven) based on
> spark2.4
> > > > - deploy that jar on yarn on a spark2.3 based
> installation
> > > >
> > > > thanks by advance,
> > > >
> > > >
> > > > --
> > > > nicolas
> > > >
> > > >
> >
>   -
> > > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > > >
> > > >
> > >
> > > --
> > > nicolas
> > >
> > >
>  -
> > > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> > >
> > >
> >
> > --
> > nicolas
> >
> > -
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
> >
>
> --
> nicolas
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


[pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Rishi Shah
Hi All,

Just wanted to confirm my understanding around actions on dataframe. If
dataframe is not persisted at any point, & count() is called on a dataframe
followed by write action --> this would trigger dataframe computation twice
(which could be the performance hit for a larger dataframe).. Could anyone
please help confirm?

-- 
Regards,

Rishi Shah


Re: [pyspark 2.3] count followed by write on dataframe

2019-05-20 Thread Keith Chapman
Yes that is correct, that would cause computation twice. If you want the
computation to happen only once you can cache the dataframe and call count
and write on the cached dataframe.

Regards,
Keith.

http://keith-chapman.com


On Mon, May 20, 2019 at 6:43 PM Rishi Shah  wrote:

> Hi All,
>
> Just wanted to confirm my understanding around actions on dataframe. If
> dataframe is not persisted at any point, & count() is called on a dataframe
> followed by write action --> this would trigger dataframe computation twice
> (which could be the performance hit for a larger dataframe).. Could anyone
> please help confirm?
>
> --
> Regards,
>
> Rishi Shah
>