>>> df.show(3)
++-+
|word|count|
++-+
| on|1|
| dec|1|
|2020|1|
++-+
only showing top 3 rows
>>> df2.show(3)
++-+
|stopword|count|
++-+
|able|1|
| about|1|
| above|1|
++-+
only showing top
Hello
When spark started in my home server, I saw there were two ports open then.
8080 for master, 8081 for worker.
If I keep these two ports open without any network filter, does it have
security issues?
Thanks
Hi,
Changing cloud providers won't help if your job is slow, has skew, etc... I
think first you have to see why "big jobs" are not completing.
On Sun, 23 Jan 2022 at 22:18, Andrew Davidson
wrote:
> Hi recently started using GCP dataproc spark.
>
>
>
> Seem to have trouble getting big jobs to c
Just couple of points to add:
1. "partition" is more of a logical construct so partitions can not fail. A
task which is reading from persistent storage to RDD can fail, and thus can
be rerun to reprocess the partition. What is Ranadip mentioned above is
true, with a caveat that data will be actual
Hi recently started using GCP dataproc spark.
Seem to have trouble getting big jobs to complete. I am using check points. I
am wondering if maybe I should look for another cloud solution
Kind regards
Andy
I don't know actual implementation: But, to me it's still necessary as
each worker reads data separately and reduces to get local distinct these
will then need to be shuffled to find actual distinct.
On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID,
wrote:
> Hello,
>
> I know some operat
Interesting question! I think this goes back to the roots of Spark. You ask
"But suppose if I am reading a file that is distributed across nodes in
partitions. So, what will happen if a partition fails that holds some
data?". Assuming you mean the distributed file system that holds the file
suffers
Hello,
I know some operators in Spark are expensive because of shuffle.
This document describes shuffle
https://www.educba.com/spark-shuffle/
and saysMore shufflings in numbers are not always bad. Memory constraints and
other impossibilities can be overcome by shuffling.
In RDD, the below are a