Re: When should we cache / persist ? After or Before Actions?

2022-04-27 Thread Sean Owen
You certainly shouldn't just sprinkle them in, no, that has never been the idea here. It can help in some cases, but is just overhead in others. Be thoughtful about why you are adding these statements. On Wed, Apr 27, 2022 at 11:16 AM Koert Kuipers wrote: > we have quite a few persists

Re: When should we cache / persist ? After or Before Actions?

2022-04-27 Thread Koert Kuipers
we have quite a few persists statements in our codebase whenever we are reusing a dataframe. we noticed that it slows things down quite a bit (sometimes doubles the runtime), while providing little benefits, since spark already re-uses the shuffle files underlying the dataframe efficiently even if

Re: Dealing with large number of small files

2022-04-27 Thread Sid
Yes, It created a list of records separated by , and it was created faster as well. On Wed, 27 Apr 2022, 13:42 Gourav Sengupta, wrote: > Hi, > did that result in valid JSON in the output file? > > Regards, > Gourav Sengupta > > On Tue, Apr 26, 2022 at 8:18 PM Sid wrote: > >> I have .txt

[window aggregate][debug] Rows not dropping with watermark and window

2022-04-27 Thread Xavier Gervilla
Hi team, With your help last week I was able to adapt a project I'm developing and apply a sentiment analysis and NER retrieval to streaming tweets. One of the next steps in order to ensure that memory doesn't collapse is applying windows and watermarks to discard tweets after some time.

Re: Dealing with large number of small files

2022-04-27 Thread Gourav Sengupta
Hi, did that result in valid JSON in the output file? Regards, Gourav Sengupta On Tue, Apr 26, 2022 at 8:18 PM Sid wrote: > I have .txt files with JSON inside it. It is generated by some API calls > by the Client. > > On Wed, Apr 27, 2022 at 12:39 AM Bjørn Jørgensen > wrote: > >> What is that