date:20220123

may I need a join here?

2022-01-23 Thread Bitfox

>>> df.show(3) ++-+ |word|count| ++-+ | on|1| | dec|1| |2020|1| ++-+ only showing top 3 rows >>> df2.show(3) ++-+ |stopword|count| ++-+ |able|1| | about|1| | above|1| ++-+ only showing top

Question about ports in spark

2022-01-23 Thread Bitfox

Hello When spark started in my home server, I saw there were two ports open then. 8080 for master, 8081 for worker. If I keep these two ports open without any network filter, does it have security issues? Thanks

Re: What are your experiences using google cloud platform

2022-01-23 Thread German Schiavon

Hi, Changing cloud providers won't help if your job is slow, has skew, etc... I think first you have to see why "big jobs" are not completing. On Sun, 23 Jan 2022 at 22:18, Andrew Davidson wrote: > Hi recently started using GCP dataproc spark. > > > > Seem to have trouble getting big jobs to c

Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread ayan guha

Just couple of points to add: 1. "partition" is more of a logical construct so partitions can not fail. A task which is reading from persistent storage to RDD can fail, and thus can be rerun to reprocess the partition. What is Ranadip mentioned above is true, with a caveat that data will be actual

What are your experiences using google cloud platform

2022-01-23 Thread Andrew Davidson

Hi recently started using GCP dataproc spark. Seem to have trouble getting big jobs to complete. I am using check points. I am wondering if maybe I should look for another cloud solution Kind regards Andy

Re: What are the most common operators for shuffle in Spark

2022-01-23 Thread Khalid Mammadov

I don't know actual implementation: But, to me it's still necessary as each worker reads data separately and reduces to get local distinct these will then need to be shuffled to find actual distinct. On Sun, 23 Jan 2022, 17:39 ashok34...@yahoo.com.INVALID, wrote: > Hello, > > I know some operat

Re: What happens when a partition that holds data under a task fails

2022-01-23 Thread Ranadip Chatterjee

Interesting question! I think this goes back to the roots of Spark. You ask "But suppose if I am reading a file that is distributed across nodes in partitions. So, what will happen if a partition fails that holds some data?". Assuming you mean the distributed file system that holds the file suffers

What are the most common operators for shuffle in Spark

2022-01-23 Thread ashok34...@yahoo.com.INVALID

Hello, I know some operators in Spark are expensive because of shuffle. This document describes shuffle https://www.educba.com/spark-shuffle/ and saysMore shufflings in numbers are not always bad. Memory constraints and other impossibilities can be overcome by shuffling. In RDD, the below are a

may I need a join here?

Question about ports in spark

Re: What are your experiences using google cloud platform

Re: What happens when a partition that holds data under a task fails

What are your experiences using google cloud platform

Re: What are the most common operators for shuffle in Spark

Re: What happens when a partition that holds data under a task fails

What are the most common operators for shuffle in Spark

8 matches

Site Navigation

Mail list logo

Footer information