[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-05 Thread ScrapCodes
Github user ScrapCodes commented on the issue:

https://github.com/apache/spark/pull/22339
  
Thank you @srowen and @steveloughran.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-04 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged to master


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96887/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-03 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96887 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96887/testReport)**
 for PR 22339 at commit 
[`d91c815`](https://github.com/apache/spark/commit/d91c815774bff070bdb3cb149678ff080bc06b45).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96885/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96885 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96885/testReport)**
 for PR 22339 at commit 
[`542872c`](https://github.com/apache/spark/commit/542872cb5459fae1a66ee45aa193986e9a37fb96).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96887 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96887/testReport)**
 for PR 22339 at commit 
[`d91c815`](https://github.com/apache/spark/commit/d91c815774bff070bdb3cb149678ff080bc06b45).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3649/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96886/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96886 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96886/testReport)**
 for PR 22339 at commit 
[`dab9bf3`](https://github.com/apache/spark/commit/dab9bf3771994989e5de2857f91d117dc8b74623).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96886 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96886/testReport)**
 for PR 22339 at commit 
[`dab9bf3`](https://github.com/apache/spark/commit/dab9bf3771994989e5de2857f91d117dc8b74623).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3648/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96885 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96885/testReport)**
 for PR 22339 at commit 
[`542872c`](https://github.com/apache/spark/commit/542872cb5459fae1a66ee45aa193986e9a37fb96).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3647/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96843/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96843 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96843/testReport)**
 for PR 22339 at commit 
[`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3616/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96843 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96843/testReport)**
 for PR 22339 at commit 
[`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-10-02 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22339
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-28 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/22339
  
no, no cost penalties. Slightly lower namenode load too. If you had many, 
many spark streaming clients scanning directories, HDFS ops teams would 
eventually get upset. This will postpone the day


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-28 Thread srowen
Github user srowen commented on the issue:

https://github.com/apache/spark/pull/22339
  
Yeah I agree, I was saying I do think it will speed things up. If it's a 
non-trivial win it's worthwhile even if it isn't the last optimization here. Is 
there any downside to this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-28 Thread steveloughran
Github user steveloughran commented on the issue:

https://github.com/apache/spark/pull/22339
  
Why the speedups? Comes from that glob filter calling getFileStatus() on 
every entry, which is is 1-3 HTTP requests and a few hundred millis per call, 
when instead that can be handled later. As a result, the more files you have in 
a path, the more time the scan takes, until eventually the scan time > window 
interval at which point your code is dead.

The other stuff is simply associated optimisations.

Now, I'm obviously happy with this, especially as I seem I getting credit. 
And it will help speedup working with any store. But I need to warn people: it 
is not sufficient

The key problem here is: files uploaded by S3 multipart upload get a 
timestamp on when the upload began, not finished —yet only become visible at 
the end of the upload. If a caller starts up an upload in window t, and doesn't 
complete it until window t+1, then it may get missed.

There's not much which can be done here, except in documenting the risk.

What is a good solution? It'd be to use the cloud-infra-providers own event 
notification mechanism and subscribe to changes in a store. AWS, Azure and GCS 
all offer something like this. 

There's a home for the S3 one of those in aws-kinesis, perhaps


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-28 Thread ScrapCodes
Github user ScrapCodes commented on the issue:

https://github.com/apache/spark/pull/22339
  
Hi @srowen, would you like to take a look? Is there anything I can do, if 
this patch is missing something? I have tested it thoroughly against an object 
store.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-26 Thread ScrapCodes
Github user ScrapCodes commented on the issue:

https://github.com/apache/spark/pull/22339
  
For numbers, while testing with object store having 50 files/dirs, without 
this patch it took 130 REST requests for 2 batches to complete and with this 
patch it took 56 rest requests. So number of rest calls are reduced, and this 
translates to speedup. How much speed up is dependent on number of files, but 
for the particular test, I have run, it was 2x.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96047/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96047 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96047/testReport)**
 for PR 22339 at commit 
[`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22339
  
**[Test build #96047 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96047/testReport)**
 for PR 22339 at commit 
[`2fba9af`](https://github.com/apache/spark/commit/2fba9af597349fc023e04a845d1cfacfc3ab7d9e).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22339
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3093/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22339: [SPARK-17159][STREAM] Significant speed up for running s...

2018-09-13 Thread dongjoon-hyun
Github user dongjoon-hyun commented on the issue:

https://github.com/apache/spark/pull/22339
  
Retest this please.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org