may I need a join here?

2022-01-23 Thread Bitfox
>>> df.show(3) ++-+ |word|count| ++-+ | on|1| | dec|1| |2020|1| ++-+ only showing top 3 rows >>> df2.show(3) ++-+ |stopword|count| ++-+ |able|1| | about|1| | above|1| ++-+ only showing

Question about ports in spark

2022-01-23 Thread Bitfox
Hello When spark started in my home server, I saw there were two ports open then. 8080 for master, 8081 for worker. If I keep these two ports open without any network filter, does it have security issues? Thanks

Re: What are your experiences using google cloud platform

2022-01-23 Thread German Schiavon
Hi, Changing cloud providers won't help if your job is slow, has skew, etc... I think first you have to see why "big jobs" are not completing. On Sun, 23 Jan 2022 at 22:18, Andrew Davidson wrote: > Hi recently started using GCP dataproc spark. > > > > Seem to have trouble getting big jobs to

Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread ayan guha
Just couple of points to add: 1. "partition" is more of a logical construct so partitions can not fail. A task which is reading from persistent storage to RDD can fail, and thus can be rerun to reprocess the partition. What is Ranadip mentioned above is true, with a caveat that data will be

What are your experiences using google cloud platform

2022-01-23 Thread Andrew Davidson
Hi recently started using GCP dataproc spark. Seem to have trouble getting big jobs to complete. I am using check points. I am wondering if maybe I should look for another cloud solution Kind regards Andy

Re: What are the most common operators for shuffle in Spark

2022-01-23 Thread Khalid Mammadov
I don't know actual implementation: But, to me it's still necessary as each worker reads data separately and reduces to get local distinct these will then need to be shuffled to find actual distinct. On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID, wrote: > Hello, > > I know some

Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread Ranadip Chatterjee
Interesting question! I think this goes back to the roots of Spark. You ask "But suppose if I am reading a file that is distributed across nodes in partitions. So, what will happen if a partition fails that holds some data?". Assuming you mean the distributed file system that holds the file

What are the most common operators for shuffle in Spark

2022-01-23 Thread ashok34...@yahoo.com.INVALID
Hello, I know some operators in Spark are expensive because of shuffle. This document describes shuffle https://www.educba.com/spark-shuffle/ and saysMore shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling. In RDD, the below are a