Github user asfgit closed the pull request at:
https://github.com/apache/spark/pull/2087
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enab
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19419188
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -224,18 +223,18 @@ class HadoopRDD[K, V](
val key: K = reader.createKey()
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19387877
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -224,18 +223,18 @@ class HadoopRDD[K, V](
val key: K = reader.createKey()
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60528666
Hey @sryza this looks good - I tested it locally and it worked. I stumbled
a bit with the test because I was using coalesce() and these metrics don't work
well with coal
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19382961
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -224,18 +223,18 @@ class HadoopRDD[K, V](
val key: K = reader.createKey()
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60458950
Ah sorry about that - I'm out until tomorrow morning but I can look then. I
just wanted to test this locally with a few hadoop versions to check it, this
looks good. In
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60454724
Anything else needed here? Sorry to keep pestering - I have an output
metrics patch that depends on this that I'm eager to post.
---
If your project is set up for it, you
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60198189
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60198186
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull)
for PR 2087 at commit
[`23010b8`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60193928
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22061/consoleFull)
for PR 2087 at commit
[`23010b8`](https://github.com/ap
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60193683
Oops, sorry about that. Posted a new patch.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60193547
Just had some minor style comments - there were four cases which used the
confusing invocation style but you only changed one of them.
---
If your project is set up for
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19259861
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -147,12 +150,37 @@ class NewHadoopRDD[K, V](
throw new java.util.N
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19259865
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -147,12 +150,37 @@ class NewHadoopRDD[K, V](
throw new java.util.N
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19259854
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -244,12 +243,35 @@ class HadoopRDD[K, V](
case eof: EOFException =>
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-60193215
@pwendell any further comments on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59970613
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21995/consoleFull)
for PR 2087 at commit
[`74fc9bb`](https://github.com/a
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59970626
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59959931
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21995/consoleFull)
for PR 2087 at commit
[`74fc9bb`](https://github.com/ap
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59959228
Small change to make a method I added private
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your projec
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59892045
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull)
for PR 2087 at commit
[`1ab662d`](https://github.com/a
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59892050
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59891774
Jenkins, retest this pleas.e
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not hav
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59891129
**[Tests timed
out](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull)**
for PR 2087 at commit
[`1ab662d`](https://github.com/apac
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59891135
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59886278
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21976/consoleFull)
for PR 2087 at commit
[`1ab662d`](https://github.com/ap
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59885707
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have t
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59885695
Cool, updated patch addresses comments. It look like the failure is caused
by a failure to fetch from git.
---
If your project is set up for it, you can reply to this ema
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59882564
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59882086
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21971/consoleFull)
for PR 2087 at commit
[`1ab662d`](https://github.com/ap
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r19113109
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGroupI
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18939723
--- Diff:
core/src/test/scala/org/apache/spark/metrics/InputMetricsSuite.scala ---
@@ -0,0 +1,53 @@
+/*
+ * Licensed to the Apache Software Foundatio
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59318880
Yeah you are totally right - the performance bit was not correct from my
end. I added some more comments on this.
---
If your project is set up for it, you can reply to
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18939679
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
case eof: EOFException =>
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18939658
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18939560
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18939525
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18833578
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGroupI
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-59057738
> I.e. the Hadoop RDD should look up the entire function for the computing
thread at the beginning, then it can invoke that function within the hot loop
only.
Comm
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18832984
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGroupI
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18832748
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
case eof: EOFException =>
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18832502
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGroupI
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18831556
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGroupI
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-58988193
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/376/consoleFull)
for PR 2087 at commit
[`305ad9f`](https://github.com/
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-58984415
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/376/consoleFull)
for PR 2087 at commit
[`305ad9f`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-58979294
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/367/consoleFull)
for PR 2087 at commit
[`305ad9f`](https://github.com/
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-58975673
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/367/consoleFull)
for PR 2087 at commit
[`305ad9f`](https://github.com/a
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18797454
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-58960161
Hey Sandy, had a couple questions about behavior and assumptions from
Hadoop. A couple of things here. The current approach does a lot of reflection
every time we invoke
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18796711
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
case eof: EOFException =>
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18796643
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18796436
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -222,12 +221,33 @@ class HadoopRDD[K, V](
case eof: EOFException =>
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18796228
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18795775
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18795734
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
---
@@ -121,6 +125,31 @@ class SparkHadoopUtil extends Logging {
UserGro
Github user pwendell commented on a diff in the pull request:
https://github.com/apache/spark/pull/2087#discussion_r18795643
--- Diff: core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala ---
@@ -147,12 +150,36 @@ class NewHadoopRDD[K, V](
throw new java.util.N
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57425658
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57425652
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21091/consoleFull)
for PR 2087 at commit
[`305ad9f`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57421546
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21091/consoleFull)
for PR 2087 at commit
[`305ad9f`](https://github.com/ap
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57421448
Updated patch switches from the pull to push model as requested by
@pwendell and adds a test. I verified that the test succeeds against both
Hadoop 2.2 and Hadoop 2.5 (whi
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57260156
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/21013/consoleFull)
for PR 2087 at commit
[`a5486af`](https://github.com/ap
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57244182
Yeah so I just prefer keeping the TaskMetrics/InputMetrics as simple as
possible rather than having callback registration and other state in them. The
simplest possible
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57236118
> The current approach couples the updating of this metric with the
heartbeats in a way that seems strange.
The heartbeats (and task completion, which, my bad, I ne
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-57229133
Hey @sryza so it seems like there are two things going on here. One is
adding incremental update and the other is changing the way we deal with
tracking read bytes for H
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-56461580
MapReduce doesn't use getPos, but it does look like it might be helpful in
some situations. One caveat is that pos only means # bytes for file input
formats. For example,
Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-56307027
@aarondav @sryza Did you consider using reader.getPos() to get the correct
metrics for older versions of Hadoop (as in here:
https://github.com/kayousterhout/spark-
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-55358462
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20199/consoleFull)
for PR 2087 at commit
[`8bfaa24`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-55355492
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20199/consoleFull)
for PR 2087 at commit
[`8bfaa24`](https://github.com/ap
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-55355339
Updated patch includes fallback to the split size
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54915960
I think we need some indication of the bytes being read from Hadoop. If
this is our only current mechanism, then I think removing the code is not worth
the behavioral re
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54867619
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19983/consoleFull)
for PR 2087 at commit
[`0034292`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54859158
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19983/consoleFull)
for PR 2087 at commit
[`0034292`](https://github.com/ap
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54852759
Just to make sure it's clear, the issue isn't only that we can be a few
bytes off when we're reading outside of split boundaries, but that it'll look
like we read the full
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54851789
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this f
Github user aarondav commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54807921
FWIW, I think mostly-accurate metrics are much better than no metrics in
this case. The read/write bytes are very useful from Hadoop FSes, and Hadoop
<2.5 is still very
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54780033
It looks like all the core tests are passing, but there are some failures
in streaming and SQL tests. Have those been showing up elsewhere?
---
If your project is set up
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54698426
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19865/consoleFull)
for PR 2087 at commit
[`0034292`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54695093
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19865/consoleFull)
for PR 2087 at commit
[`0034292`](https://github.com/ap
Github user pwendell commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54563520
Jenkins, test this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user andrewor14 commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54563140
Hm, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54521510
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19793/consoleFull)
for PR 2087 at commit
[`0034292`](https://github.com/ap
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-54519238
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19790/consoleFull)
for PR 2087 at commit
[`0a743c0`](https://github.com/ap
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-53005751
[QA tests have
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19069/consoleFull)
for PR 2087 at commit
[`32daf1f`](https://github.com/a
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/2087#issuecomment-52998163
[QA tests have
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/19069/consoleFull)
for PR 2087 at commit
[`32daf1f`](https://github.com/ap
GitHub user sryza opened a pull request:
https://github.com/apache/spark/pull/2087
SPARK-2621. Update task InputMetrics incrementally
The patch takes advantage an API provided in Hadoop 2.5 that allows getting
accurate data on Hadoop FileSystem bytes read. It eliminates the old met
85 matches
Mail list logo