Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
OK, weight has been removed when calculating.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this featur
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
The bucketing is trying to to bucket into buckets of equal P(x). It's a
condition on P(y | x). That said the right point isn't knowable from the
training data, and splitting to balance P(x) on either
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
By the way, it's safe to use mean value as it is match the other libraries.
If requested, I'd like to modify the PR.
---
If your project is set up for it, you can reply to this email and have your
r
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
For a (train) sample of continuous series, say {x0, x1, x2, x3, ..., x100}.
Now spark select quantile as split point.
Suppose 10-quantiles is used, and x2 is 1st quantile, and x10 is 2nd
qu
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
Ah OK I should think about this more first. Say you have a continuous
predictor x and binary output y. Say the optimal split is found to be between
0.1 and 0.2, with 1 observation of 0.1 and 99 of 0.
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
@sethah what's the issue there ... train/test ought to be from the same
distribution, in theory. The empirical distribution of the test data will of
course be a little different, but what is the issu
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/17556
I don't mind the weighted midpoints. However, if for a continuous feature
we find that many points have the exact same value, we are assuming we may find
data points in the test set that are close to
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3677 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3677/testReport)**
for PR 17556 at commit
[`031c61a`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3677 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3677/testReport)**
for PR 17556 at commit
[`031c61a`](https://github.com/apache/spark/commit/0
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
fix failed case, please retest it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled a
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3673 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3673/testReport)**
for PR 17556 at commit
[`19eab3a`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3673 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3673/testReport)**
for PR 17556 at commit
[`19eab3a`](https://github.com/apache/spark/commit/1
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
I scanned split critical of sklearn and xgboost.
1. sklearn
count all continuous values and split at mean value.
commit 5147fd09c6a063188efde444f47bd006fa5f95f0
sk
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
That's good info. It's a tough call -- matching a known package is always
nice. However I agree that a weighted split is a little more theoretically
sound (don't have a reference on that though). I'd
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
Hi, I has checked R GBM's code and found that:
R's gbm uses mean value $(x + y) / 2$, not weighted mean $(c_x * x + c_y *
y) / (c_x + c_y)$ described in [JIRA
SPARK-16957](https://issues.apache.
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
@sethah Perhaps it's hard to compare R with Spark's behavior, since many
factors involved. I'd like to read R GBM's code, and verify consistency of both
side's design on split criteria. Is it OK?
-
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/17556
Seems like a reasonable change. Just left some minor comments.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user sethah commented on the issue:
https://github.com/apache/spark/pull/17556
If we are attempting to match R GBM, it would be great to show, at least on
the PR, that we get the same results.
---
If your project is set up for it, you can reply to this email and have your
repl
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
many thanks, @srowen
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so,
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
It's looking good, and the R tests pass. I'll also ask @mengxr or maybe
@dbtsai if they have any concerns about this change?
---
If your project is set up for it, you can reply to this email and hav
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3662 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3662/testReport)**
for PR 17556 at commit
[`b74702a`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3662 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3662/testReport)**
for PR 17556 at commit
[`b74702a`](https://github.com/apache/spark/commit/b
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
I have ran all unit test case of MLlib in Python. However, I am not
familiar with R, and I don't want waste too many time on deploying R's
environment.
Could CI retest the pr? We can check
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
http://spark.apache.org/docs/latest/building-spark.html
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have t
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
@srowen Hi, I forget unit tests in python and R. Where can I find document
about creating develop environment? thanks.
---
If your project is set up for it, you can reply to this email and have your
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3655 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)**
for PR 17556 at commit
[`9ca5750`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3655 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3655/testReport)**
for PR 17556 at commit
[`9ca5750`](https://github.com/apache/spark/commit/9
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3654 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)**
for PR 17556 at commit
[`9ca5750`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3654 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3654/testReport)**
for PR 17556 at commit
[`9ca5750`](https://github.com/apache/spark/commit/9
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
Just a flaky test. Can't be related
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
```
Test Result (1 failure / +1)
org.apache.spark.storage.TopologyAwareBlockReplicationPolicyBehavior.Peers in 2
racks
```
Does anyone know what is this?
---
If your projec
Github user facaiy commented on the issue:
https://github.com/apache/spark/pull/17556
is there something wrong with spark CI?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
ena
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17556
**[Test build #3652 has
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/3652/testReport)**
for PR 17556 at commit
[`9ca5750`](https://github.com/apache/spark/commit/9
Github user srowen commented on the issue:
https://github.com/apache/spark/pull/17556
It seems OK to me but @sethah or @jkbradley might be good as a second set
of eyes. It does slightly alter behavior, but, it does seem like something that
should work better in general.
---
If your
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17556
Can one of the admins verify this patch?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feat
35 matches
Mail list logo