may I need a join here?

2022-01-23 Thread Bitfox
>>> df.show(3)

++-+

|word|count|

++-+

|  on|1|

| dec|1|

|2020|1|

++-+

only showing top 3 rows


>>> df2.show(3)

++-+

|stopword|count|

++-+

|able|1|

|   about|1|

|   above|1|

++-+

only showing top 3 rows


>>> df3=df.filter(~col("word").isin(df2.stopword ))

Traceback (most recent call last):

  File "", line 1, in 

  File "/opt/spark/python/pyspark/sql/dataframe.py", line 1733, in filter

jdf = self._jdf.filter(condition._jc)

  File "/opt/spark/python/lib/py4j-0.10.9.2-src.zip/py4j/java_gateway.py",
line 1310, in __call__

  File "/opt/spark/python/pyspark/sql/utils.py", line 117, in deco

raise converted from None

pyspark.sql.utils.AnalysisException: Resolved attribute(s) stopword#4
missing from word#0,count#1L in operator !Filter NOT word#0 IN
(stopword#4).;

!Filter NOT word#0 IN (stopword#4)

+- LogicalRDD [word#0, count#1L], false





The filter method doesn't work here.

Maybe I need a join for two DF?

What's the syntax for this?



Thank you and regards,

Bitfox


Question about ports in spark

2022-01-23 Thread Bitfox
Hello

When spark started in my home server, I saw there were two ports open then.
8080 for master, 8081 for worker.
If I keep these two ports open without any network filter, does it have
security issues?

Thanks


Re: What are your experiences using google cloud platform

2022-01-23 Thread German Schiavon
Hi,

Changing cloud providers won't help if your job is slow, has skew, etc... I
think first you have to see why "big jobs" are not completing.


On Sun, 23 Jan 2022 at 22:18, Andrew Davidson 
wrote:

> Hi recently started using GCP dataproc spark.
>
>
>
> Seem to have trouble getting big jobs to complete. I am using check
> points. I am wondering if maybe I should look for another cloud solution
>
>
>
> Kind regards
>
>
>
> Andy
>


Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread ayan guha
Just couple of points to add:

1. "partition" is more of a logical construct so partitions can not fail. A
task which is reading from persistent storage to RDD can fail, and thus can
be rerun to reprocess the partition. What is Ranadip mentioned above is
true, with a caveat that data will be actually be read in memory only after
an action is encountered, everything prior to that is logical plan, and
because of Spark's lazy loading data wont be materialized until it really
requires it

2. Between storage and tasks, there is a concept called splits. Each task
actually run on splits and there are various ways to define the splits. In
practice, splits and partitions can be 1:1 mapped, but I think there are
various strategies can be defined to map splits to partitions.

HTH.

On Mon, Jan 24, 2022 at 7:39 AM Ranadip Chatterjee 
wrote:

> Interesting question! I think this goes back to the roots of Spark. You
> ask "But suppose if I am reading a file that is distributed across nodes in
> partitions. So, what will happen if a partition fails that holds some
> data?". Assuming you mean the distributed file system that holds the file
> suffers a failure in the node that hosts the said partition - here's
> something that might explain this better:
>
> Spark does not come with its own persistent distributed file storage
> system. So Spark relies on the underlying file storage system to provide
> resilience over input file failure. Commonly, an open source stack will see
> HDFS (Hadoop Distributed File System) being the persistent distributed file
> system that Spark reads from and writes back to. In that case, your
> specific case will likely mean a failure of the datanode of HDFS that was
> hosting the partition of data that failed.
>
> Now, if the failure happens after Spark has completely read that partition
> and is in the middle of the job, Spark will progress with the job
> unhindered because it does not need to go back and re-read the data.
> Remember, Spark will save (as RDDs) all intermediate states of the data
> between stages and Spark stages continue to save the snippet of DAG from
> one RDD to the next. So, Spark can recover from it's own node failures in
> the intermediate stages by simply rerunning the DAGs from the last saved
> RDD to recover to the next stage. The only case where Spark will need to
> reach out to HDFS is if the very first Spark stage encounters a failure
> before it has created the RDD for that partition.
>
> In that specific edge case, Spark will reach out to HDFS to request the
> failed HDFS block. At that point, if HDFS detects that the datanode hosting
> that block is not responding, it will transparently redirect Spark to
> another replica of the same block. So, the job will progress unhindered in
> this case (perhaps a tad slower as the read may no longer be node-local).
> Only in the extreme scenarios where HDFS has a catastrophic failure, all
> the replicas are offline at that exact moment or the file was saved with
> only 1 replica - will the Spark job fail as there is no way for it to
> recover the said partitions. (In my rather long time working with Spark, I
> have never come across this scenario yet).
>
> Other distributed file systems behave similarly - e.g. Google Cloud
> Storage or Amazon S3 will have slightly different nuances but will behave
> very similarly to HDFS in this scenario.
>
> So, for all practical purposes, it is safe to say Spark will progress the
> job to completion in nearly all practical cases.
>
> Regards,
> Ranadip Chatterjee
>
>
> On Fri, 21 Jan 2022 at 20:40, Sean Owen  wrote:
>
>> Probably, because Spark prefers locality, but not necessarily.
>>
>> On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar <
>> kalgaonkarsiddh...@gmail.com> wrote:
>>
>>> Thank you so much for this information, Sean. One more question, that
>>> when it wants to re-run the failed partition, where does it run? On the
>>> same node or some other node?
>>>
>>>
>>> On Fri, 21 Jan 2022, 23:41 Sean Owen,  wrote:
>>>
 The Spark program already knows the partitions of the data and where
 they exist; that's just defined by the data layout. It doesn't care what
 data is inside. It knows partition 1 needs to be processed and if the task
 processing it fails, needs to be run again. I'm not sure where you're
 seeing data loss here? the data is already stored to begin with, not
 somehow consumed and deleted.

 On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
 kalgaonkarsiddh...@gmail.com> wrote:

> Okay, so suppose I have 10 records distributed across 5 nodes and the
> partition of the first node holding 2 records failed. I understand that it
> will re-process this partition but how will it come to know that XYZ
> partition was holding XYZ data so that it will pick again only those
> records and reprocess it? In case of failure of a partition, is there a
> data loss? or is it stored somewhere?
>
> Maybe 

What are your experiences using google cloud platform

2022-01-23 Thread Andrew Davidson
Hi recently started using GCP dataproc spark.

Seem to have trouble getting big jobs to complete. I am using check points. I 
am wondering if maybe I should look for another cloud solution

Kind regards

Andy


Re: What are the most common operators for shuffle in Spark

2022-01-23 Thread Khalid Mammadov
I don't know actual implementation:  But, to me it's still necessary as
each worker reads data separately and reduces to get local distinct these
will then need to be shuffled to find actual distinct.

On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID,
 wrote:

> Hello,
>
> I know some operators in Spark are expensive because of shuffle.
>
> This document describes shuffle
>
> https://www.educba.com/spark-shuffle/
>
> and says
> More shufflings in numbers are not always bad. Memory constraints and
> other impossibilities can be overcome by shuffling.
>
> In RDD, the below are a few operations and examples of shuffle:
> – subtractByKey
> – groupBy
> – foldByKey
> – reduceByKey
> – aggregateByKey
> – transformations of a join of any type
> – distinct
> – cogroup
> I know some operations like reduceBykey are well known for creating
> shuffle but what I don't understand why distinct operation should cause
> shuffle!
>
>
> Thanking
>
>
>
>
>
>


Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread Ranadip Chatterjee
Interesting question! I think this goes back to the roots of Spark. You ask
"But suppose if I am reading a file that is distributed across nodes in
partitions. So, what will happen if a partition fails that holds some
data?". Assuming you mean the distributed file system that holds the file
suffers a failure in the node that hosts the said partition - here's
something that might explain this better:

Spark does not come with its own persistent distributed file storage
system. So Spark relies on the underlying file storage system to provide
resilience over input file failure. Commonly, an open source stack will see
HDFS (Hadoop Distributed File System) being the persistent distributed file
system that Spark reads from and writes back to. In that case, your
specific case will likely mean a failure of the datanode of HDFS that was
hosting the partition of data that failed.

Now, if the failure happens after Spark has completely read that partition
and is in the middle of the job, Spark will progress with the job
unhindered because it does not need to go back and re-read the data.
Remember, Spark will save (as RDDs) all intermediate states of the data
between stages and Spark stages continue to save the snippet of DAG from
one RDD to the next. So, Spark can recover from it's own node failures in
the intermediate stages by simply rerunning the DAGs from the last saved
RDD to recover to the next stage. The only case where Spark will need to
reach out to HDFS is if the very first Spark stage encounters a failure
before it has created the RDD for that partition.

In that specific edge case, Spark will reach out to HDFS to request the
failed HDFS block. At that point, if HDFS detects that the datanode hosting
that block is not responding, it will transparently redirect Spark to
another replica of the same block. So, the job will progress unhindered in
this case (perhaps a tad slower as the read may no longer be node-local).
Only in the extreme scenarios where HDFS has a catastrophic failure, all
the replicas are offline at that exact moment or the file was saved with
only 1 replica - will the Spark job fail as there is no way for it to
recover the said partitions. (In my rather long time working with Spark, I
have never come across this scenario yet).

Other distributed file systems behave similarly - e.g. Google Cloud Storage
or Amazon S3 will have slightly different nuances but will behave very
similarly to HDFS in this scenario.

So, for all practical purposes, it is safe to say Spark will progress the
job to completion in nearly all practical cases.

Regards,
Ranadip Chatterjee


On Fri, 21 Jan 2022 at 20:40, Sean Owen  wrote:

> Probably, because Spark prefers locality, but not necessarily.
>
> On Fri, Jan 21, 2022 at 2:10 PM Siddhesh Kalgaonkar <
> kalgaonkarsiddh...@gmail.com> wrote:
>
>> Thank you so much for this information, Sean. One more question, that
>> when it wants to re-run the failed partition, where does it run? On the
>> same node or some other node?
>>
>>
>> On Fri, 21 Jan 2022, 23:41 Sean Owen,  wrote:
>>
>>> The Spark program already knows the partitions of the data and where
>>> they exist; that's just defined by the data layout. It doesn't care what
>>> data is inside. It knows partition 1 needs to be processed and if the task
>>> processing it fails, needs to be run again. I'm not sure where you're
>>> seeing data loss here? the data is already stored to begin with, not
>>> somehow consumed and deleted.
>>>
>>> On Fri, Jan 21, 2022 at 12:07 PM Siddhesh Kalgaonkar <
>>> kalgaonkarsiddh...@gmail.com> wrote:
>>>
 Okay, so suppose I have 10 records distributed across 5 nodes and the
 partition of the first node holding 2 records failed. I understand that it
 will re-process this partition but how will it come to know that XYZ
 partition was holding XYZ data so that it will pick again only those
 records and reprocess it? In case of failure of a partition, is there a
 data loss? or is it stored somewhere?

 Maybe my question is very naive but I am trying to understand it in a
 better way.

 On Fri, Jan 21, 2022 at 11:32 PM Sean Owen  wrote:

> In that case, the file exists in parts across machines. No, tasks
> won't re-read the whole file; no task does or can do that. Failed
> partitions are reprocessed, but as in the first pass, the same partition 
> is
> processed.
>
> On Fri, Jan 21, 2022 at 12:00 PM Siddhesh Kalgaonkar <
> kalgaonkarsiddh...@gmail.com> wrote:
>
>> Hello team,
>>
>> I am aware that in case of memory issues when a task fails, it will
>> try to restart 4 times since it is a default number and if it still fails
>> then it will cause the entire job to fail.
>>
>> But suppose if I am reading a file that is distributed across nodes
>> in partitions. So, what will happen if a partition fails that holds some
>> data? Will it re-read the entire file and get that 

What are the most common operators for shuffle in Spark

2022-01-23 Thread ashok34...@yahoo.com.INVALID
Hello,
I know some operators in Spark are expensive because of shuffle.
This document describes shuffle
https://www.educba.com/spark-shuffle/

and saysMore shufflings in numbers are not always bad. Memory constraints and 
other impossibilities can be overcome by shuffling.

In RDD, the below are a few operations and examples of shuffle:
– subtractByKey
– groupBy
– foldByKey
– reduceByKey
– aggregateByKey
– transformations of a join of any type
– distinct
– cogroup
I know some operations like reduceBykey are well known for creating shuffle but 
what I don't understand why distinct operation should cause shuffle!

Thanking