[DISCUSS] Time to evaluate "continuous mode" in SS?

2020-09-14 Thread Jungtaek Lim
Hi devs, It was Spark 2.3 in Feb 2018 which introduced continuous mode in Structured Streaming as "experimental". Now we are here at 2.5 years after its release - I feel it would be a good time to evaluate the mode, whether the mode has been widely used or not, and the mode has been making

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread kalyan
+1 Will positively improve the performance and reliability of spark... Looking fwd to this.. Regards Kalyan. On Tue, Sep 15, 2020, 9:26 AM Joseph Torres wrote: > +1 > > On Mon, Sep 14, 2020 at 6:39 PM angers.zhu wrote: > >> +1 >> >> angers.zhu >> angers@gmail.com >> >>

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Joseph Torres
+1 On Mon, Sep 14, 2020 at 6:39 PM angers.zhu wrote: > +1 > > angers.zhu > angers@gmail.com > >

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Chang Chen
I See. In our case, we use SingleBufferInputStream, so time spent is duplicating the backing byte buffer. Thanks Chang Ryan Blue 于2020年9月15日周二 上午2:04写道: > Before, the input was a byte array so we could read from it directly. Now, > the input is a `ByteBufferInputStream` so that Parquet can

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread angers . zhu
+1

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Xiao Li
+1 Xiao DB Tsai 于2020年9月14日周一 下午4:09写道: > +1 > > On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh wrote: > >> +1 >> >> Chandni >> >> On Mon, Sep 14, 2020 at 11:41 AM Tom Graves >> wrote: >> >>> +1 >>> >>> Tom >>> >>> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan < >>>

Re: How to clear spark Shuffle files

2020-09-14 Thread lsn248
Our use case is as follows: We repartition 6 months worth of data for each client on clientId & recordcreationdate, so that it can write one file per partition. Our partition is on client and recordcreationdate. The job fills up the disk after it process say 30 tenants out of 50. I am

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread DB Tsai
+1 On Mon, Sep 14, 2020 at 12:30 PM Chandni Singh wrote: > +1 > > Chandni > > On Mon, Sep 14, 2020 at 11:41 AM Tom Graves > wrote: > >> +1 >> >> Tom >> >> On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan < >> mri...@gmail.com> wrote: >> >> >> Hi, >> >> I'd like to call for a

Re: How to clear spark Shuffle files

2020-09-14 Thread Holden Karau
There's a second new mechanism which uses TTL for cleanup of shuffle files. Can you share more about your use case? On Mon, Sep 14, 2020 at 1:33 PM Edward Mitchell wrote: > We've also had some similar disk fill issues. > > For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM >

Re: How to clear spark Shuffle files

2020-09-14 Thread Edward Mitchell
We've also had some similar disk fill issues. For Java/Scala RDDs, shuffle file cleanup is done as part of the JVM garbage collection. I've noticed that if RDDs maintain references in the code, and cannot be garbage collected, then immediate shuffle files hang around. Best way to handle this is

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Chandni Singh
+1 Chandni On Mon, Sep 14, 2020 at 11:41 AM Tom Graves wrote: > +1 > > Tom > > On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan < > mri...@gmail.com> wrote: > > > Hi, > > I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based > shuffle to improve shuffle

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Tom Graves
+1 Tom On Sunday, September 13, 2020, 10:00:05 PM CDT, Mridul Muralidharan wrote: Hi, I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based shuffle to improve shuffle efficiency.Please take a look at: - SPIP jira: https://issues.apache.org/jira/browse/SPARK-30602

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Venkatakrishnan Sowrirajan
+1. Interesting indeed :) Regards Venkata krishnan On Mon, Sep 14, 2020 at 11:14 AM Xingbo Jiang wrote: > +1 This is an exciting new feature! > > On Sun, Sep 13, 2020 at 8:00 PM Mridul Muralidharan > wrote: > >> Hi, >> >> I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based

Re: [VOTE][SPARK-30602] SPIP: Support push-based shuffle to improve shuffle efficiency

2020-09-14 Thread Xingbo Jiang
+1 This is an exciting new feature! On Sun, Sep 13, 2020 at 8:00 PM Mridul Muralidharan wrote: > Hi, > > I'd like to call for a vote on SPARK-30602 - SPIP: Support push-based > shuffle to improve shuffle efficiency. > Please take a look at: > >- SPIP jira:

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Ryan Blue
Before, the input was a byte array so we could read from it directly. Now, the input is a `ByteBufferInputStream` so that Parquet can choose how to allocate buffers. For example, we use vectored reads from S3 that pull back multiple buffers in parallel. Now that the input is a stream based on

How to clear spark Shuffle files

2020-09-14 Thread lsn248
Hi, I have a long running application and spark seem to fill up the disk with shuffle files. Eventually the job fails running out of disk space. Is there a way for me to clean the shuffle files ? Thanks -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

Re: Performance of VectorizedRleValuesReader

2020-09-14 Thread Sean Owen
Ryan do you happen to have any opinion there? that particular section was introduced in the Parquet 1.10 update: https://github.com/apache/spark/commit/cac9b1dea1bb44fa42abf77829c05bf93f70cf20 It looks like it didn't use to make a ByteBuffer each time, but read from in. On Sun, Sep 13, 2020 at