[jira] [Resolved] (SPARK-31326) create Function docs structure for SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31326. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/28099] > create Function docs structure for SQL Reference > > > Key: SPARK-31326 > URL: https://issues.apache.org/jira/browse/SPARK-31326 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > create Function docs structure for SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
[ https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31328. - Resolution: Fixed Issue resolved by pull request 28101 [https://github.com/apache/spark/pull/28101] > Incorrect timestamps rebasing on autumn daylight saving time > > > Key: SPARK-31328 > URL: https://issues.apache.org/jira/browse/SPARK-31328 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Run the following code in the *America/Los_Angeles* time zone: > {code:scala} > test("rebasing differences") { > withDefaultTimeZone(getZoneId("America/Los_Angeles")) { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += 30 * MICROS_PER_MINUTE > } > println(s"counter = $counter") > } > } > {code} > The rebased and original micros must be the same after 1883-11-18 because the > standard zone offset and DST offset are the same in Proleptic Gregorian > calendar and in the hybrid calendar (Julian+Gregorian) but actually there are > differences of 60 minutes: > {code:java} > local date-time = 0001-01-01T00:00 diff = -2872 minutes > local date-time = 0100-03-01T00:00 diff = -1432 minutes > local date-time = 0200-03-01T00:00 diff = 7 minutes > local date-time = 0300-03-01T00:00 diff = 1447 minutes > local date-time = 0500-03-01T00:00 diff = 2887 minutes > local date-time = 0600-03-01T00:00 diff = 4327 minutes > local date-time = 0700-03-01T00:00 diff = 5767 minutes > local date-time = 0900-03-01T00:00 diff = 7207 minutes > local date-time = 1000-03-01T00:00 diff = 8647 minutes > local date-time = 1100-03-01T00:00 diff = 10087 minutes > local date-time = 1300-03-01T00:00 diff = 11527 minutes > local date-time = 1400-03-01T00:00 diff = 12967 minutes > local date-time = 1500-03-01T00:00 diff = 14407 minutes > local date-time = 1582-10-15T00:00 diff = 7 minutes > local date-time = 1883-11-18T12:22:58 diff = 0 minutes > local date-time = 1918-10-27T01:22:58 diff = 60 minutes > local date-time = 1918-10-27T01:22:58 diff = 0 minutes > local date-time = 1919-10-26T01:22:58 diff = 60 minutes > local date-time = 1919-10-26T01:22:58 diff = 0 minutes > local date-time = 1945-09-30T01:22:58 diff = 60 minutes > local date-time = 1945-09-30T01:22:58 diff = 0 minutes > local date-time = 1949-01-01T01:22:58 diff = 60 minutes > local date-time = 1949-01-01T01:22:58 diff = 0 minutes > local date-time = 1950-09-24T01:22:58 diff = 60 minutes > local date-time = 1950-09-24T01:22:58 diff = 0 minutes > local date-time = 1951-09-30T01:22:58 diff = 60 minutes > local date-time = 1951-09-30T01:22:58 diff = 0 minutes > local date-time = 1952-09-28T01:22:58 diff = 60 minutes > local date-time = 1952-09-28T01:22:58 diff = 0 minutes > local date-time = 1953-09-27T01:22:58 diff = 60 minutes > local date-time = 1953-09-27T01:22:58 diff = 0 minutes > local date-time = 1954-09-26T01:22:58 diff = 60 minutes > local date-time = 1954-09-26T01:22:58 diff = 0 minutes > local date-time = 1955-09-25T01:22:58 diff = 60 minutes > local date-time = 1955-09-25T01:22:58 diff = 0 minutes > local date-time = 1956-09-30T01:22:58 diff = 60 minutes > local date-time = 1956-09-30T01:22:58 diff = 0 minutes > local date-time = 1957-09-29T01:22:58 diff = 60 minutes > local date-time = 1957-09-29T01:22:58 diff = 0 minutes > local date-time = 1958-09-28T01:22:58 diff = 60 minutes > local date-time = 1958-09-28T01:22:58 diff = 0 minutes > local date-time = 1959-09-27T01:22:58 diff = 60 minutes > local date-time = 1959-09-27T01:22:58 diff = 0 minutes > local date-time = 1960-09-25T01:22:58 diff = 60 minutes > local date-time = 1960-09-25T01:22:58 diff = 0 minutes > local date-time = 1961-09-24T01:22:58 diff = 60 minutes > local date-time = 1961-09-24T01:22:58 diff = 0 minutes > local date-time = 1962-10-28T01:22:58 diff = 60 minutes > local date-time = 1962-10-28T01:22:58 diff = 0 minutes > local date-time = 1963-10-27T01:22:58 diff = 60 minutes > local
[jira] [Resolved] (SPARK-31325) Control a plan explain mode in the events of SQL listeners via SQLConf
[ https://issues.apache.org/jira/browse/SPARK-31325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31325. Assignee: Takeshi Yamamuro Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/28097 > Control a plan explain mode in the events of SQL listeners via SQLConf > -- > > Key: SPARK-31325 > URL: https://issues.apache.org/jira/browse/SPARK-31325 > Project: Spark > Issue Type: Improvement > Components: SQL, Web UI >Affects Versions: 3.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Major > > This proposes to add a new SQL config for controlling a plan explain mode in > the events of (e.g., `SparkListenerSQLExecutionStart` and > `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. > In the current master, the output of `QueryExecution.toString` (this is > equivalent to the "extended" explain mode) is stored in these events. I think > it is useful to control the content via SQLConf. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch
[ https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074217#comment-17074217 ] Nicholas Chammas commented on SPARK-31330: -- Hmm, I didn't see anything from you on the mailing list. But thanks for these references! This is very helpful. Looks like you had Infra enable autolabeler for the Avro project over in INFRA-17367. I will ask Infra to do the same for Spark and cc [~hyukjin.kwon] for committer approval (which I guess Infra may ask for). > Automatically label PRs based on the paths they touch > - > > Key: SPARK-31330 > URL: https://issues.apache.org/jira/browse/SPARK-31330 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We can potentially leverage the added labels to drive testing, review, or > other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31333) Document Join Hints
Xiao Li created SPARK-31333: --- Summary: Document Join Hints Key: SPARK-31333 URL: https://issues.apache.org/jira/browse/SPARK-31333 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Xiao Li Assignee: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch
[ https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074142#comment-17074142 ] Ismaël Mejía commented on SPARK-31330: -- What about the approach I suggested in the ML? The autolabeler has not the mentioned llimitation and it has already been used by various apache projects: https://github.com/mithro/autolabeler https://github.com/apache/avro/blob/master/.github/autolabeler.yml > Automatically label PRs based on the paths they touch > - > > Key: SPARK-31330 > URL: https://issues.apache.org/jira/browse/SPARK-31330 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We can potentially leverage the added labels to drive testing, review, or > other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31330) Automatically label PRs based on the paths they touch
[ https://issues.apache.org/jira/browse/SPARK-31330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17074124#comment-17074124 ] Nicholas Chammas commented on SPARK-31330: -- Unfortunately, it seems I jumped the gun on sending that dev email about the GitHub PR labeler action. It has a fundamental limitation that currently makes it [useless for us|https://github.com/actions/labeler/tree/d2c408e7ed8498dfdf675c5f8d133ab37b6f8520#pull-request-labeler]: {quote}Note that only pull requests being opened from the same repository can be labeled. This action will not currently work for pull requests from forks – like is common in open source projects – because the token for forked pull request workflows does not have write permissions. {quote} Additional detail: [https://github.com/actions/labeler/issues/12#issuecomment-525762657] I'll keep my eye on that Action in case they somehow lift or work around the limitation on forked repositories. Of course, we can always implement this functionality ourselves, but the attraction of the GitHub Action was that we could reuse an existing, tested, and widely adopted implementation. > Automatically label PRs based on the paths they touch > - > > Key: SPARK-31330 > URL: https://issues.apache.org/jira/browse/SPARK-31330 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > We can potentially leverage the added labels to drive testing, review, or > other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31332) Proposal to add Proximity Measure in Random Forest
[ https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanley Poon updated SPARK-31332: - Description: h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned in [https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20] h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage, proximity requires O(NxT) memory, and it may still not fit in memory: where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] was: h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned in [https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20] h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage of NxT, it may still not fit in memory, where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] > Proposal to add Proximity Measure in Random Forest > -- > > Key: SPARK-31332 > URL: https://issues.apache.org/jira/browse/SPARK-31332 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.5 > Environment: The proposal should apply to any Spark version and OS's > that are supported by Spark. > Specifically, the observations reported were based on: > * Spark 2.3.1 and 2.4.5 > * Ubuntu 16.04.6 LTS > * Mac OS 10.13.6 > >
[jira] [Updated] (SPARK-31332) Proposal to add Proximity Measure in Random Forest
[ https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanley Poon updated SPARK-31332: - Description: h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned [here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]]. h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage of NxT, it may still not fit in memory, where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] was: h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned [here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]]. h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage of NxT, it may still not fit in memory, where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment Based on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] > Proposal to add Proximity Measure in Random Forest > -- > > Key: SPARK-31332 > URL: https://issues.apache.org/jira/browse/SPARK-31332 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.5 > Environment: The proposal should apply to any Spark version and OS's > that are supported by Spark. > Specifically, the observations reported were based on: > * Spark 2.3.1 and 2.4.5 > * Ubuntu 16.04.6 LTS > * Mac OS 10.13.6 > >Reporter: Stanley Poon >Priority: Major > Labels: Proximity, RandomForest, ml > > h3. Background > The RandomForest model does not provide proximity
[jira] [Updated] (SPARK-31332) Proposal to add Proximity Measure in Random Forest
[ https://issues.apache.org/jira/browse/SPARK-31332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stanley Poon updated SPARK-31332: - Description: h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned in [https://dzone.com/articles/classification-using-random-forest-with-spark-20.|https://dzone.com/articles/classification-using-random-forest-with-spark-20] h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage of NxT, it may still not fit in memory, where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] was: h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned [here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]]. h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage of NxT, it may still not fit in memory, where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] > Proposal to add Proximity Measure in Random Forest > -- > > Key: SPARK-31332 > URL: https://issues.apache.org/jira/browse/SPARK-31332 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.5 > Environment: The proposal should apply to any Spark version and OS's > that are supported by Spark. > Specifically, the observations reported were based on: > * Spark 2.3.1 and 2.4.5 > * Ubuntu 16.04.6 LTS > * Mac OS 10.13.6 > >Reporter: Stanley Poon >Priority: Major > Labels: Proximity, RandomForest, ml > >
[jira] [Created] (SPARK-31332) Proposal to add Proximity Measure in Random Forest
Stanley Poon created SPARK-31332: Summary: Proposal to add Proximity Measure in Random Forest Key: SPARK-31332 URL: https://issues.apache.org/jira/browse/SPARK-31332 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.4.5 Environment: The proposal should apply to any Spark version and OS's that are supported by Spark. Specifically, the observations reported were based on: * Spark 2.3.1 and 2.4.5 * Ubuntu 16.04.6 LTS * Mac OS 10.13.6 Reporter: Stanley Poon h3. Background The RandomForest model does not provide proximity measure as described in [Breiman|[https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm]]. There are many important use cases of proximity: - more accurate replacement for missing data - identify outliers - clustering or multi-dimensional scaling - compute the proximities of test set in the training set - unsupervised learning Performance and storage concerns are among reasons that proximities are not computed and kept during prediction, as mentioned [here|[https://dzone.com/articles/classification-using-random-forest-with-spark-20]]. h3. Proposal RF in Spark is optimized for massive scalability on large-scale dataset where the number of data points, features and trees can be very big. Even with optimized storage of NxT, it may still not fit in memory, where N is number of data points and T is number of trees in the forest. We propose to add a column in the prediction output to return the node-id (or hash) of the terminal node for each sample data point. The required changes on the current RF implementation will not increase the computation and storage by significant amounts. And it will leave the possibility open for computing some form of proximity after prediction. It us up to the users how to use the extra column of node-ids. Without this, currently there is no work around to compute proximity measure. h4. Experiment Based on Spark 2.3.1 and 2.4.5 In one prototype, we output the terminal node id for each prediction from RandomForestClassificationModel. And then we use Spark’s LSHModel to cluster prediction results by terminal node ids. The performance of the whole pipeline was reasonable for the size of our dataset. h3. References * L. Breiman. Manual on setting up, using, and understanding random forests v3.1, 2002. [https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm] * [https://dzone.com/articles/classification-using-random-forest-with-spark-20] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31331) Document Spark integration with Hive UDFs/UDAFs/UDTFs
Huaxin Gao created SPARK-31331: -- Summary: Document Spark integration with Hive UDFs/UDAFs/UDTFs Key: SPARK-31331 URL: https://issues.apache.org/jira/browse/SPARK-31331 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao Document Spark integration with Hive UDFs/UDAFs/UDTFs -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31330) Automatically label PRs based on the paths they touch
Nicholas Chammas created SPARK-31330: Summary: Automatically label PRs based on the paths they touch Key: SPARK-31330 URL: https://issues.apache.org/jira/browse/SPARK-31330 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.1.0 Reporter: Nicholas Chammas We can potentially leverage the added labels to drive testing, review, or other project tooling. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31329) Modify executor monitor to allow for moving shuffle blocks
Holden Karau created SPARK-31329: Summary: Modify executor monitor to allow for moving shuffle blocks Key: SPARK-31329 URL: https://issues.apache.org/jira/browse/SPARK-31329 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Core Affects Versions: 3.1.0 Reporter: Holden Karau Assignee: Holden Karau To enable Spark-20629 we need to revisit code that assumes shuffle blocks don't move. Currently, the executor monitor assumes that shuffle blocks are immovable. We should modify this code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073951#comment-17073951 ] Michael Armbrust commented on SPARK-29358: -- Sure, but it is very easy to make this not a behavior change. Add an optional boolean parameter, {{allowMissingColumns}} (or something) that defaults to {{false}}. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution
[ https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073928#comment-17073928 ] L. C. Hsieh commented on SPARK-27913: - As we support schema merging in ORC by SPARK-11412, is this still an issue? > Spark SQL's native ORC reader implements its own schema evolution > - > > Key: SPARK-27913 > URL: https://issues.apache.org/jira/browse/SPARK-27913 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.3 >Reporter: Owen O'Malley >Priority: Major > > ORC's reader handles a wide range of schema evolution, but the Spark SQL > native ORC bindings do not provide the desired schema to the ORC reader. This > causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
[ https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31328: --- Description: Run the following code in the *America/Los_Angeles* time zone: {code:scala} test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } {code} The rebased and original micros must be the same after 1883-11-18 because the standard zone offset and DST offset are the same in Proleptic Gregorian calendar and in the hybrid calendar (Julian+Gregorian) but actually there are differences of 60 minutes: {code:java} local date-time = 0001-01-01T00:00 diff = -2872 minutes local date-time = 0100-03-01T00:00 diff = -1432 minutes local date-time = 0200-03-01T00:00 diff = 7 minutes local date-time = 0300-03-01T00:00 diff = 1447 minutes local date-time = 0500-03-01T00:00 diff = 2887 minutes local date-time = 0600-03-01T00:00 diff = 4327 minutes local date-time = 0700-03-01T00:00 diff = 5767 minutes local date-time = 0900-03-01T00:00 diff = 7207 minutes local date-time = 1000-03-01T00:00 diff = 8647 minutes local date-time = 1100-03-01T00:00 diff = 10087 minutes local date-time = 1300-03-01T00:00 diff = 11527 minutes local date-time = 1400-03-01T00:00 diff = 12967 minutes local date-time = 1500-03-01T00:00 diff = 14407 minutes local date-time = 1582-10-15T00:00 diff = 7 minutes local date-time = 1883-11-18T12:22:58 diff = 0 minutes local date-time = 1918-10-27T01:22:58 diff = 60 minutes local date-time = 1918-10-27T01:22:58 diff = 0 minutes local date-time = 1919-10-26T01:22:58 diff = 60 minutes local date-time = 1919-10-26T01:22:58 diff = 0 minutes local date-time = 1945-09-30T01:22:58 diff = 60 minutes local date-time = 1945-09-30T01:22:58 diff = 0 minutes local date-time = 1949-01-01T01:22:58 diff = 60 minutes local date-time = 1949-01-01T01:22:58 diff = 0 minutes local date-time = 1950-09-24T01:22:58 diff = 60 minutes local date-time = 1950-09-24T01:22:58 diff = 0 minutes local date-time = 1951-09-30T01:22:58 diff = 60 minutes local date-time = 1951-09-30T01:22:58 diff = 0 minutes local date-time = 1952-09-28T01:22:58 diff = 60 minutes local date-time = 1952-09-28T01:22:58 diff = 0 minutes local date-time = 1953-09-27T01:22:58 diff = 60 minutes local date-time = 1953-09-27T01:22:58 diff = 0 minutes local date-time = 1954-09-26T01:22:58 diff = 60 minutes local date-time = 1954-09-26T01:22:58 diff = 0 minutes local date-time = 1955-09-25T01:22:58 diff = 60 minutes local date-time = 1955-09-25T01:22:58 diff = 0 minutes local date-time = 1956-09-30T01:22:58 diff = 60 minutes local date-time = 1956-09-30T01:22:58 diff = 0 minutes local date-time = 1957-09-29T01:22:58 diff = 60 minutes local date-time = 1957-09-29T01:22:58 diff = 0 minutes local date-time = 1958-09-28T01:22:58 diff = 60 minutes local date-time = 1958-09-28T01:22:58 diff = 0 minutes local date-time = 1959-09-27T01:22:58 diff = 60 minutes local date-time = 1959-09-27T01:22:58 diff = 0 minutes local date-time = 1960-09-25T01:22:58 diff = 60 minutes local date-time = 1960-09-25T01:22:58 diff = 0 minutes local date-time = 1961-09-24T01:22:58 diff = 60 minutes local date-time = 1961-09-24T01:22:58 diff = 0 minutes local date-time = 1962-10-28T01:22:58 diff = 60 minutes local date-time = 1962-10-28T01:22:58 diff = 0 minutes local date-time = 1963-10-27T01:22:58 diff = 60 minutes local date-time = 1963-10-27T01:22:58 diff = 0 minutes local date-time = 1964-10-25T01:22:58 diff = 60 minutes local date-time = 1964-10-25T01:22:58 diff = 0 minutes local date-time = 1965-10-31T01:22:58 diff = 60 minutes local date-time = 1965-10-31T01:22:58 diff = 0 minutes local date-time = 1966-10-30T01:22:58 diff = 60 minutes local date-time = 1966-10-30T01:22:58 diff = 0 minutes local date-time = 1967-10-29T01:22:58 diff = 60 minutes local date-time = 1967-10-29T01:22:58 diff = 0 minutes local date-time = 1968-10-27T01:22:58 diff = 60 minutes local date-time = 1968-10-27T01:22:58 diff = 0 minutes local date-time = 1969-10-26T01:22:58 diff = 60 minutes local date-time = 1969-10-26T01:22:58 diff = 0 minutes local date-time = 1970-10-25T01:22:58
[jira] [Updated] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
[ https://issues.apache.org/jira/browse/SPARK-31328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31328: --- Description: Run the following code in the *America/Los_Angeles* time zone: {code:scala} test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } {code} {code:java} local date-time = 0001-01-01T00:00 diff = -2909 minutes local date-time = 0100-02-28T14:00 diff = -1469 minutes local date-time = 0200-02-28T14:00 diff = -29 minutes local date-time = 0300-02-28T14:00 diff = 1410 minutes local date-time = 0500-02-28T14:00 diff = 2850 minutes local date-time = 0600-02-28T14:00 diff = 4290 minutes local date-time = 0700-02-28T14:00 diff = 5730 minutes local date-time = 0900-02-28T14:00 diff = 7170 minutes local date-time = 1000-02-28T14:00 diff = 8610 minutes local date-time = 1100-02-28T14:00 diff = 10050 minutes local date-time = 1300-02-28T14:00 diff = 11490 minutes local date-time = 1400-02-28T14:00 diff = 12930 minutes local date-time = 1500-02-28T14:00 diff = 14370 minutes local date-time = 1582-10-14T14:00 diff = -29 minutes local date-time = 1899-12-31T16:52:58 diff = 0 minutes local date-time = 1917-12-27T11:52:58 diff = 60 minutes local date-time = 1917-12-27T12:52:58 diff = 0 minutes local date-time = 1918-09-15T12:52:58 diff = 60 minutes local date-time = 1918-09-15T13:52:58 diff = 0 minutes local date-time = 1919-06-30T16:52:58 diff = 31 minutes local date-time = 1919-06-30T17:52:58 diff = 0 minutes local date-time = 1919-08-15T12:52:58 diff = 60 minutes local date-time = 1919-08-15T13:52:58 diff = 0 minutes local date-time = 1921-08-31T10:52:58 diff = 60 minutes local date-time = 1921-08-31T11:52:58 diff = 0 minutes local date-time = 1921-09-30T11:52:58 diff = 60 minutes local date-time = 1921-09-30T12:52:58 diff = 0 minutes local date-time = 1922-09-30T12:52:58 diff = 60 minutes local date-time = 1922-09-30T13:52:58 diff = 0 minutes local date-time = 1981-09-30T12:52:58 diff = 60 minutes local date-time = 1981-09-30T13:52:58 diff = 0 minutes local date-time = 1982-09-30T12:52:58 diff = 60 minutes local date-time = 1982-09-30T13:52:58 diff = 0 minutes local date-time = 1983-09-30T12:52:58 diff = 60 minutes local date-time = 1983-09-30T13:52:58 diff = 0 minutes local date-time = 1984-09-29T15:52:58 diff = 60 minutes local date-time = 1984-09-29T16:52:58 diff = 0 minutes local date-time = 1985-09-28T15:52:58 diff = 60 minutes local date-time = 1985-09-28T16:52:58 diff = 0 minutes local date-time = 1986-09-27T15:52:58 diff = 60 minutes local date-time = 1986-09-27T16:52:58 diff = 0 minutes local date-time = 1987-09-26T15:52:58 diff = 60 minutes local date-time = 1987-09-26T16:52:58 diff = 0 minutes local date-time = 1988-09-24T15:52:58 diff = 60 minutes local date-time = 1988-09-24T16:52:58 diff = 0 minutes local date-time = 1989-09-23T15:52:58 diff = 60 minutes local date-time = 1989-09-23T16:52:58 diff = 0 minutes local date-time = 1990-09-29T15:52:58 diff = 60 minutes local date-time = 1990-09-29T16:52:58 diff = 0 minutes local date-time = 1991-09-28T16:52:58 diff = 60 minutes local date-time = 1991-09-28T17:52:58 diff = 0 minutes local date-time = 1992-09-26T15:52:58 diff = 60 minutes local date-time = 1992-09-26T16:52:58 diff = 0 minutes local date-time = 1993-09-25T15:52:58 diff = 60 minutes local date-time = 1993-09-25T16:52:58 diff = 0 minutes local date-time = 1994-09-24T15:52:58 diff = 60 minutes local date-time = 1994-09-24T16:52:58 diff = 0 minutes local date-time = 1995-09-23T15:52:58 diff = 60 minutes local date-time = 1995-09-23T16:52:58 diff = 0 minutes local date-time = 1996-10-26T15:52:58 diff = 60 minutes local date-time = 1996-10-26T16:52:58 diff = 0 minutes local date-time = 1997-10-25T15:52:58 diff = 60 minutes local date-time = 1997-10-25T16:52:58 diff = 0 minutes local date-time = 1998-10-24T15:52:58 diff = 60 minutes local date-time = 1998-10-24T16:52:58 diff = 0 minutes local date-time = 1999-10-30T15:52:58 diff = 60 minutes local date-time = 1999-10-30T16:52:58 diff = 0 minutes local date-time = 2000-10-28T15:52:58 diff = 60 minutes local date-time
[jira] [Created] (SPARK-31328) Incorrect timestamps rebasing on autumn daylight saving time
Maxim Gekk created SPARK-31328: -- Summary: Incorrect timestamps rebasing on autumn daylight saving time Key: SPARK-31328 URL: https://issues.apache.org/jira/browse/SPARK-31328 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.0.0 I do believe it is possible to speed up date-time rebasing by building a map of micros to diffs between original and rebased micros. And look up at the map via binary search. For example, the *America/Los_Angeles* time zone has less than 100 points when diff changes: {code:scala} test("optimize rebasing") { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += MICROS_PER_HOUR } println(s"counter = $counter") } {code} {code:java} local date-time = 0001-01-01T00:00 diff = -2909 minutes local date-time = 0100-02-28T14:00 diff = -1469 minutes local date-time = 0200-02-28T14:00 diff = -29 minutes local date-time = 0300-02-28T14:00 diff = 1410 minutes local date-time = 0500-02-28T14:00 diff = 2850 minutes local date-time = 0600-02-28T14:00 diff = 4290 minutes local date-time = 0700-02-28T14:00 diff = 5730 minutes local date-time = 0900-02-28T14:00 diff = 7170 minutes local date-time = 1000-02-28T14:00 diff = 8610 minutes local date-time = 1100-02-28T14:00 diff = 10050 minutes local date-time = 1300-02-28T14:00 diff = 11490 minutes local date-time = 1400-02-28T14:00 diff = 12930 minutes local date-time = 1500-02-28T14:00 diff = 14370 minutes local date-time = 1582-10-14T14:00 diff = -29 minutes local date-time = 1899-12-31T16:52:58 diff = 0 minutes local date-time = 1917-12-27T11:52:58 diff = 60 minutes local date-time = 1917-12-27T12:52:58 diff = 0 minutes local date-time = 1918-09-15T12:52:58 diff = 60 minutes local date-time = 1918-09-15T13:52:58 diff = 0 minutes local date-time = 1919-06-30T16:52:58 diff = 31 minutes local date-time = 1919-06-30T17:52:58 diff = 0 minutes local date-time = 1919-08-15T12:52:58 diff = 60 minutes local date-time = 1919-08-15T13:52:58 diff = 0 minutes local date-time = 1921-08-31T10:52:58 diff = 60 minutes local date-time = 1921-08-31T11:52:58 diff = 0 minutes local date-time = 1921-09-30T11:52:58 diff = 60 minutes local date-time = 1921-09-30T12:52:58 diff = 0 minutes local date-time = 1922-09-30T12:52:58 diff = 60 minutes local date-time = 1922-09-30T13:52:58 diff = 0 minutes local date-time = 1981-09-30T12:52:58 diff = 60 minutes local date-time = 1981-09-30T13:52:58 diff = 0 minutes local date-time = 1982-09-30T12:52:58 diff = 60 minutes local date-time = 1982-09-30T13:52:58 diff = 0 minutes local date-time = 1983-09-30T12:52:58 diff = 60 minutes local date-time = 1983-09-30T13:52:58 diff = 0 minutes local date-time = 1984-09-29T15:52:58 diff = 60 minutes local date-time = 1984-09-29T16:52:58 diff = 0 minutes local date-time = 1985-09-28T15:52:58 diff = 60 minutes local date-time = 1985-09-28T16:52:58 diff = 0 minutes local date-time = 1986-09-27T15:52:58 diff = 60 minutes local date-time = 1986-09-27T16:52:58 diff = 0 minutes local date-time = 1987-09-26T15:52:58 diff = 60 minutes local date-time = 1987-09-26T16:52:58 diff = 0 minutes local date-time = 1988-09-24T15:52:58 diff = 60 minutes local date-time = 1988-09-24T16:52:58 diff = 0 minutes local date-time = 1989-09-23T15:52:58 diff = 60 minutes local date-time = 1989-09-23T16:52:58 diff = 0 minutes local date-time = 1990-09-29T15:52:58 diff = 60 minutes local date-time = 1990-09-29T16:52:58 diff = 0 minutes local date-time = 1991-09-28T16:52:58 diff = 60 minutes local date-time = 1991-09-28T17:52:58 diff = 0 minutes local date-time = 1992-09-26T15:52:58 diff = 60 minutes local date-time = 1992-09-26T16:52:58 diff = 0 minutes local date-time = 1993-09-25T15:52:58 diff = 60 minutes local date-time = 1993-09-25T16:52:58 diff = 0 minutes local date-time = 1994-09-24T15:52:58 diff = 60 minutes local date-time = 1994-09-24T16:52:58 diff = 0 minutes local date-time = 1995-09-23T15:52:58 diff = 60 minutes local date-time = 1995-09-23T16:52:58 diff = 0 minutes local date-time = 1996-10-26T15:52:58 diff = 60 minutes local date-time = 1996-10-26T16:52:58 diff = 0 minutes local
[jira] [Created] (SPARK-31327) write spark version to avro file metadata
Wenchen Fan created SPARK-31327: --- Summary: write spark version to avro file metadata Key: SPARK-31327 URL: https://issues.apache.org/jira/browse/SPARK-31327 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29153) ResourceProfile conflict resolution stage level scheduling
[ https://issues.apache.org/jira/browse/SPARK-29153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-29153. --- Fix Version/s: 3.1.0 Assignee: Thomas Graves Resolution: Fixed > ResourceProfile conflict resolution stage level scheduling > -- > > Key: SPARK-29153 > URL: https://issues.apache.org/jira/browse/SPARK-29153 > Project: Spark > Issue Type: Story > Components: Scheduler >Affects Versions: 3.0.0 >Reporter: Thomas Graves >Assignee: Thomas Graves >Priority: Major > Fix For: 3.1.0 > > > For the stage level scheduling, if a stage has ResourceProfiles from multiple > RDD that conflict we have to resolve that conflict. > We may have 2 approaches. > # default to error out if conflicting, that way user realizes what is going > on, have a config to turn this on and off. > # If config to error out if off, then resolve the conflict. See below from > the design doc on the SPIP. > For the merge strategy we can choose the max from the ResourceProfiles to > make the largest container required. This in general will work but there are > a few cases people may have intended them to be a sum. For instance lets say > one RDD needs X memory and another RDD needs Y memory. It might be when those > get combined into a stage you really need X+Y memory vs the max(X, Y). > Another example might be union, where you would want to sum the resources of > each RDD. I think we can document what we choose for now and later on add in > the ability to have other alternatives then max. Or perhaps we do need to > change what we do either per operation or per resource type. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait
[ https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-31179. --- Fix Version/s: 3.1.0 Assignee: feiwang Resolution: Fixed > Fast fail the connection while last shuffle connection failed in the last > retry IO wait > > > Key: SPARK-31179 > URL: https://issues.apache.org/jira/browse/SPARK-31179 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.1.0 >Reporter: feiwang >Assignee: feiwang >Priority: Major > Fix For: 3.1.0 > > > When reading shuffle data, maybe several fetch request sent to a same shuffle > server. > There is a client pool, and these request may share the same client. > When the shuffle server is busy, it may cause the request connection timeout. > For example: there are two request connection, rc1 and rc2. > Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 > minutes. > 1: rc1 hold the client lock, it timeout after 2 minutes. > 2: rc2 hold the client lock, it timeout after 2 minutes. > 3: rc1 start the second retry, hold lock and timeout after 2 minutes. > 4: rc2 start the second retry, hold lock and timeout after 2 minutes. > 5: rc1 start the third retry, hold lock and timeout after 2 minutes. > 6: rc2 start the third retry, hold lock and timeout after 2 minutes. > It wastes lots of time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31315) SQLQueryTestSuite: Display the total compile time for generated java code.
[ https://issues.apache.org/jira/browse/SPARK-31315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31315. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28081 [https://github.com/apache/spark/pull/28081] > SQLQueryTestSuite: Display the total compile time for generated java code. > -- > > Key: SPARK-31315 > URL: https://issues.apache.org/jira/browse/SPARK-31315 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > SQLQueryTestSuite spent a lot of time compiling the generated java code. > We should display the total compile time for generated java code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31315) SQLQueryTestSuite: Display the total compile time for generated java code.
[ https://issues.apache.org/jira/browse/SPARK-31315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31315: --- Assignee: jiaan.geng > SQLQueryTestSuite: Display the total compile time for generated java code. > -- > > Key: SPARK-31315 > URL: https://issues.apache.org/jira/browse/SPARK-31315 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > SQLQueryTestSuite spent a lot of time compiling the generated java code. > We should display the total compile time for generated java code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30839) Add version information for Spark configuration
[ https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30839. -- Resolution: Done Thanks, [~beliefer] for working on this. > Add version information for Spark configuration > --- > > Key: SPARK-30839 > URL: https://issues.apache.org/jira/browse/SPARK-30839 > Project: Spark > Issue Type: Improvement > Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, > SQL, Structured Streaming, YARN >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark ConfigEntry and ConfigBuilder missing Spark version information of each > configuration at release. This is not good for Spark user when they visiting > the page of spark configuration. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31321) Remove SaveMode check in v2 FileWriteBuilder
[ https://issues.apache.org/jira/browse/SPARK-31321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31321: --- Assignee: Kent Yao > Remove SaveMode check in v2 FileWriteBuilder > > > Key: SPARK-31321 > URL: https://issues.apache.org/jira/browse/SPARK-31321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > > SaveMode is never assigned, so it will fail when calling `validateInputs` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31321) Remove SaveMode check in v2 FileWriteBuilder
[ https://issues.apache.org/jira/browse/SPARK-31321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31321. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28090 [https://github.com/apache/spark/pull/28090] > Remove SaveMode check in v2 FileWriteBuilder > > > Key: SPARK-31321 > URL: https://issues.apache.org/jira/browse/SPARK-31321 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > SaveMode is never assigned, so it will fail when calling `validateInputs` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30951) Potential data loss for legacy applications after switch to proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-30951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073463#comment-17073463 ] Wenchen Fan commented on SPARK-30951: - Theoretically, Parquet spec implicitly requires Gregorian calendar by referring to the Java 8 time API: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#timestamp That said, Spark 2.x writes "wrong" datetime values to Parquet, and I don't think we should keep this "wrong" behavior by default in 3.0. Besides, you will hit the mixed calendar Parquet files anyway if the data is written by multiple systems (e.g. Spark and Hive). I'd suggest users turn on the legacy config only if they have legacy datetime values in Parquet that are before 1582. To make users easier to realize the existence of these legacy data, we can fail by default when reading datetime values before 1582 from parquet files. > Potential data loss for legacy applications after switch to proleptic > Gregorian calendar > > > Key: SPARK-30951 > URL: https://issues.apache.org/jira/browse/SPARK-30951 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Bruce Robbins >Assignee: Maxim Gekk >Priority: Blocker > Labels: release-notes > Fix For: 3.0.0 > > > tl;dr: We recently discovered some Spark 2.x sites that have lots of data > containing dates before October 15, 1582. This could be an issue when such > sites try to upgrade to Spark 3.0. > From SPARK-26651: > {quote}"The changes might impact on the results for dates and timestamps > before October 15, 1582 (Gregorian) > {quote} > We recently discovered that some large scale Spark 2.x applications rely on > dates before October 15, 1582. > Two cases came up recently: > * An application that uses a commercial third-party library to encode > sensitive dates. On insert, the library encodes the actual date as some other > date. On select, the library decodes the date back to the original date. The > encoded value could be any date, including one before October 15, 1582 (e.g., > "0602-04-04"). > * An application that uses a specific unlikely date (e.g., "1200-01-01") as > a marker to indicate "unknown date" (in lieu of null) > Both sites ran into problems after another component in their system was > upgraded to use the proleptic Gregorian calendar. Spark applications that > read files created by the upgraded component were interpreting encoded or > marker dates incorrectly, and vice versa. Also, their data now had a mix of > calendars (hybrid and proleptic Gregorian) with no metadata to indicate which > file used which calendar. > Both sites had enormous amounts of existing data, so re-encoding the dates > using some other scheme was not a feasible solution. > This is relevant to Spark 3: > Any Spark 2 application that uses such date-encoding schemes may run into > trouble when run on Spark 3. The application may not properly interpret the > dates previously written by Spark 2. Also, once the Spark 3 version of the > application writes data, the tables will have a mix of calendars (hybrid and > proleptic gregorian) with no metadata to indicate which file uses which > calendar. > Similarly, sites might run with mixed Spark versions, resulting in data > written by one version that cannot be interpreted by the other. And as above, > the tables will now have a mix of calendars with no way to detect which file > uses which calendar. > As with the two real-life example cases, these applications may have enormous > amounts of legacy data, so re-encoding the dates using some other scheme may > not be feasible. > We might want to consider a configuration setting to allow the user to > specify the calendar for storing and retrieving date and timestamp values > (not sure how such a flag would affect other date and timestamp-related > functions). I realize the change is far bigger than just adding a > configuration setting. > Here's a quick example of where trouble may happen, using the real-life case > of the marker date. > In Spark 2.4: > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 1 > scala> > {noformat} > In Spark 3.0 (reading from the same legacy file): > {noformat} > scala> spark.read.orc(s"$home/data/datefile").filter("dt == > '1200-01-01'").count > res0: Long = 0 > scala> > {noformat} > By the way, Hive had a similar problem. Hive switched from hybrid calendar to > proleptic Gregorian calendar between 2.x and 3.x. After some upgrade > headaches related to dates before 1582, the Hive community made the following > changes: > * When writing date or timestamp data to ORC, Parquet, and Avro
[jira] [Created] (SPARK-31326) create Function docs structure for SQL Reference
Huaxin Gao created SPARK-31326: -- Summary: create Function docs structure for SQL Reference Key: SPARK-31326 URL: https://issues.apache.org/jira/browse/SPARK-31326 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao create Function docs structure for SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31299) Pyspark.ml.clustering illegalArgumentException with dataframe created from rows
[ https://issues.apache.org/jira/browse/SPARK-31299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073422#comment-17073422 ] Lukas Thaler commented on SPARK-31299: -- Oh dear. Now, that's embarrassing. Thank you for pointing this out > Pyspark.ml.clustering illegalArgumentException with dataframe created from > rows > --- > > Key: SPARK-31299 > URL: https://issues.apache.org/jira/browse/SPARK-31299 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Lukas Thaler >Priority: Major > > I hope this is the right place and way to report a bug in (at least) the > PySpark API: > BisectingKMeans in the following example is only exemplary, the error occurs > with all clustering algorithms: > {code:python} > from pyspark.sql import Row > from pyspark.mllib.linalg import DenseVector > from pyspark.ml.clustering import BisectingKMeans > data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, > 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])), > Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])), > Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])), > Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])), > Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))]) > kmeans = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1) > model = kmeans.fit(data) > {code} > The .fit-call in the last line will fail with the following error: > {code:java} > Py4JJavaError: An error occurred while calling o51.fit. > : java.lang.IllegalArgumentException: requirement failed: Column > test_features must be of type equal to one of the following types: > [struct,values:array>, > array, array] but was actually of type > struct,values:array>. > {code} > As can be seen, the data type reported to be passed to the function is the > first data type in the list of allowed data types, yet the call ends in an > error because of it. > See my [StackOverflow > issue|[https://stackoverflow.com/questions/60884142/pyspark-py4j-illegalargumentexception-with-spark-createdataframe-and-pyspark-ml]] > for more context -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation
[ https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073409#comment-17073409 ] Dongjoon Hyun commented on SPARK-31312: --- Since I saw your opinion, I'll not ping you about that again. > Transforming Hive simple UDF (using JAR) expression may incur CNFE in later > evaluation > -- > > Key: SPARK-31312 > URL: https://issues.apache.org/jira/browse/SPARK-31312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of > current thread context classloader. > [~cloud_fan] pointed out another potential issue in post-review of > SPARK-26560 - quoting the comment: > {quote} > Found a potential problem: here we call HiveSimpleUDF.dateType (which is a > lazy val), to force to load the class with the corrected class loader. > However, if the expression gets transformed later, which copies > HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class > loading, and at that time there is no guarantee that the corrected > classloader is used. > I think we should materialize the loaded class in HiveSimpleUDF. > {quote} > This JIRA issue is to track the effort of verifying the potential issue and > fixing the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation
[ https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17073407#comment-17073407 ] Dongjoon Hyun commented on SPARK-31312: --- It's for informimg the users (and the downstream distributors) the risk and to recommend upgrade their versions. If we set 2.4.5 only, that can be also considered as a bug occurred at 2.4.5 . If we set 2.3.x at least, all 2.4.0 ~ 2.4.4 users also understand the risk. > Transforming Hive simple UDF (using JAR) expression may incur CNFE in later > evaluation > -- > > Key: SPARK-31312 > URL: https://issues.apache.org/jira/browse/SPARK-31312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of > current thread context classloader. > [~cloud_fan] pointed out another potential issue in post-review of > SPARK-26560 - quoting the comment: > {quote} > Found a potential problem: here we call HiveSimpleUDF.dateType (which is a > lazy val), to force to load the class with the corrected class loader. > However, if the expression gets transformed later, which copies > HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class > loading, and at that time there is no guarantee that the corrected > classloader is used. > I think we should materialize the loaded class in HiveSimpleUDF. > {quote} > This JIRA issue is to track the effort of verifying the potential issue and > fixing the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org