[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671326 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23245/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] kiszk commented on issue #27577: [DOC] add config naming guideline
kiszk commented on issue #27577: [DOC] add config naming guideline URL: https://github.com/apache/spark/pull/27577#issuecomment-586671458 Beyond this PR, can we create a sanity checker? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118488/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671757 **[Test build #118488 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)** for PR 27398 at commit [`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671777 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118488/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671228 **[Test build #118488 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)** for PR 27398 at commit [`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671776 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md
AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md URL: https://github.com/apache/spark/pull/27398#issuecomment-586671776 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793 **[Test build #118489 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)** for PR 27565 at commit [`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year
MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#discussion_r379880203 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala ## @@ -51,7 +51,6 @@ trait TimeZoneAwareExpression extends Expression { /** Returns a copy of this expression with the specified timeZoneId. */ def withTimeZone(timeZoneId: String): TimeZoneAwareExpression - @transient lazy val timeZone: TimeZone = DateTimeUtils.getTimeZone(timeZoneId.get) Review comment: Finally, all expressions bound on legacy `TimeZone` have been gone. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
SparkQA commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586675997 **[Test build #118490 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118490/testReport)** for PR 24936 at commit [`f99a528`](https://github.com/apache/spark/commit/f99a528fafa9f59a37de8fd7b824c4168d7f4e26). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676073 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23247/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676073 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23247/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676070 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay
AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay URL: https://github.com/apache/spark/pull/24936#issuecomment-586676070 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25987: [SPARK-29314][SS] Don't overwrite the metric "updated" of state operator to 0 if empty batch is run
SparkQA commented on issue #25987: [SPARK-29314][SS] Don't overwrite the metric "updated" of state operator to 0 if empty batch is run URL: https://github.com/apache/spark/pull/25987#issuecomment-586676448 **[Test build #118492 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118492/testReport)** for PR 25987 at commit [`c6993d0`](https://github.com/apache/spark/commit/c6993d04917a9327cf3e9e32276997c055da2935). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676449 **[Test build #118491 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)** for PR 27565 at commit [`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#discussion_r379881028 ## File path: python/pyspark/sql/dataframe.py ## @@ -2153,6 +2153,59 @@ def transform(self, func): "should have been DataFrame." % type(result) return result +@since(3.1) +def sameSemantics(self, other): +""" +Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and +therefore return same results. + +.. note:: The equality comparison here is simplified by tolerating the cosmetic differences +such as attribute names. + +.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return +`False` on the :class:`DataFrame` that return the same results, for instance, from +different plans. Such false negative semantic can be useful when caching as an example. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2)) +True +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2)) +False +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2)) +True +""" +if not isinstance(other, DataFrame): +raise ValueError("other parameter should be of DataFrame; however, got %s" + % type(other)) +return self._jdf.sameSemantics(other._jdf) + +@since(3.1) +def semanticHash(self): +""" +Returns a hash code of the logical query plan against this :class:`DataFrame`. + +.. note:: Unlike the standard hash code, the hash is calculated against the query plan +simplified by tolerating the cosmetic differences such as attribute names. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df2.withColumn("col1", df2.id * 2).semanticHash() +True +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df3.withColumn("col1", df3.id + 2).semanticHash() +False Review comment: ``` Failed example: df1.withColumn("col1", df1.id * 2).semanticHash() == df3.withColumn("col1", df3.id + 2).semanticHash() Differences (ndiff with -expected +actual): - False + True ``` Now we have another unexpected result. (Note L2176 passed, which is expected.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866 **[Test build #118493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)** for PR 27565 at commit [`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
maropu commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder URL: https://github.com/apache/spark/pull/27592#discussion_r379881204 ## File path: core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala ## @@ -74,7 +76,8 @@ private[spark] abstract class ConfigEntry[T] ( def defaultValue: Option[T] = None override def toString: String = { -s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, public=$isPublic)" +s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, " + + s"public=$isPublic, version = $version)" Review comment: nit: `version=$version` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
maropu commented on issue #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder URL: https://github.com/apache/spark/pull/27592#issuecomment-586677152 For reviewers, can you add a screenshot of a html document generated by `gen-sql-config-docs.py` in the PR description? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year
MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#discussion_r379881509 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala ## @@ -290,32 +293,38 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper { } test("hours") { -var input = date(2015, 3, 18, 13, 2, 11, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 13) -assert(getHours(input, TimeZoneGMT) === 20) -input = date(2015, 12, 8, 2, 7, 9, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 2) -assert(getHours(input, TimeZoneGMT) === 10) +var input = date(2015, 3, 18, 13, 2, 11, 0, zonePST) +assert(getHours(input, zonePST) === 13) +assert(getHours(input, zoneGMT) === 20) +input = date(2015, 12, 8, 2, 7, 9, 0, zonePST) +assert(getHours(input, zonePST) === 2) +assert(getHours(input, zoneGMT) === 10) +input = date(10, 1, 1, 0, 0, 0, 0, zonePST) +assert(getHours(input, zonePST) === 0) Review comment: This is new test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year
MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year URL: https://github.com/apache/spark/pull/27596#discussion_r379881626 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala ## @@ -290,32 +293,38 @@ class DateTimeUtilsSuite extends SparkFunSuite with Matchers with SQLHelper { } test("hours") { -var input = date(2015, 3, 18, 13, 2, 11, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 13) -assert(getHours(input, TimeZoneGMT) === 20) -input = date(2015, 12, 8, 2, 7, 9, 0, TimeZonePST) -assert(getHours(input, TimeZonePST) === 2) -assert(getHours(input, TimeZoneGMT) === 10) +var input = date(2015, 3, 18, 13, 2, 11, 0, zonePST) +assert(getHours(input, zonePST) === 13) +assert(getHours(input, zoneGMT) === 20) +input = date(2015, 12, 8, 2, 7, 9, 0, zonePST) +assert(getHours(input, zonePST) === 2) +assert(getHours(input, zoneGMT) === 10) +input = date(10, 1, 1, 0, 0, 0, 0, zonePST) +assert(getHours(input, zonePST) === 0) Review comment: Before the changes: ```sql spark-sql> select hour(timestamp '0010-01-01 00:00:00'); 23 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
beliefer commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder URL: https://github.com/apache/spark/pull/27592#discussion_r379881659 ## File path: core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala ## @@ -74,7 +76,8 @@ private[spark] abstract class ConfigEntry[T] ( def defaultValue: Option[T] = None override def toString: String = { -s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, public=$isPublic)" +s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, " + + s"public=$isPublic, version = $version)" Review comment: OK This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880677 ## File path: R/pkg/tests/fulltests/test_mllib_classification.R ## @@ -488,4 +488,36 @@ test_that("spark.naiveBayes", { expect_equal(class(collect(predictions)$clicked[1]), "character") }) +test_that("spark.fmClassifier", { + df <- withColumn( +suppressWarnings(createDataFrame(iris)), +"Species", otherwise(when(column("Species") == "Setosa", "Setosa"), "Not-Setosa") + ) + + model1 <- spark.fmClassifier( +df, Species ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE, factorSize = 3 + ) + + prediction1 <- predict(model1, df) + expect_is(prediction1, "SparkDataFrame") Review comment: Can we also check the predict result here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#discussion_r379881773 ## File path: python/pyspark/sql/dataframe.py ## @@ -2153,6 +2153,59 @@ def transform(self, func): "should have been DataFrame." % type(result) return result +@since(3.1) +def sameSemantics(self, other): +""" +Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and +therefore return same results. + +.. note:: The equality comparison here is simplified by tolerating the cosmetic differences +such as attribute names. + +.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return +`False` on the :class:`DataFrame` that return the same results, for instance, from +different plans. Such false negative semantic can be useful when caching as an example. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2)) +True +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2)) +False +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2)) +True +""" +if not isinstance(other, DataFrame): +raise ValueError("other parameter should be of DataFrame; however, got %s" + % type(other)) +return self._jdf.sameSemantics(other._jdf) + +@since(3.1) +def semanticHash(self): +""" +Returns a hash code of the logical query plan against this :class:`DataFrame`. + +.. note:: Unlike the standard hash code, the hash is calculated against the query plan +simplified by tolerating the cosmetic differences such as attribute names. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df2.withColumn("col1", df2.id * 2).semanticHash() +True +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df3.withColumn("col1", df3.id + 2).semanticHash() +False Review comment: More tests: ``` >>> df1=spark.range(100) >>> df2=spark.range(100) >>> df3=spark.range(100) >>> df11=df1.withColumn("col1", df1.id +1) >>> df21=df2.withColumn("col1", df2.id -1) >>> df31=df3.withColumn("col1", df3.id *2) >>> df32=df3.withColumn("col1", df3.id +2) >>> df33=df3.withColumn("col1", df3.id /2) >>> df34=df3.withColumn("col1", df3.id -2) >>> df11.semanticHash() 1855039936 >>> df21.semanticHash() 1855039936 >>> df31.semanticHash() -1719131362 >>> df32.semanticHash() -1719131362 >>> df32.semanticHash() -1719131362 >>> df34.semanticHash() -706037631 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379881557 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.classification.{FMClassificationModel, FMClassifier} +import org.apache.spark.ml.feature.{IndexToString, RFormula} +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMClassifierWrapper private ( +val pipeline: PipelineModel, +val features: Array[String], +val labels: Array[String]) extends MLWritable { + import FMClassifierWrapper._ + + private val fmClassificationModel: FMClassificationModel = +pipeline.stages(1).asInstanceOf[FMClassificationModel] + + lazy val rFeatures: Array[String] = if (fmClassificationModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmClassificationModel.getFitIntercept) { +Array(fmClassificationModel.intercept) ++ fmClassificationModel.linear.toArray + } else { +fmClassificationModel.linear.toArray + } + + lazy val rFactors = fmClassificationModel.factors.toArray + + lazy val numClasses: Int = fmClassificationModel.numClasses + + lazy val numFeatures: Int = fmClassificationModel.numFeatures + + lazy val factorSize: Int = fmClassificationModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(PREDICTED_LABEL_INDEX_COL) + .drop(fmClassificationModel.getFeaturesCol) + .drop(fmClassificationModel.getLabelCol) + } + + override def write: MLWriter = new FMClassifierWrapper.FMClassifierWrapperWriter(this) +} + +private[r] object FMClassifierWrapper + extends MLReadable[FMClassifierWrapper] { + + val PREDICTED_LABEL_INDEX_COL = "pred_label_idx" + val PREDICTED_LABEL_COL = "prediction" + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + thresholds: Array[Double], + handleInvalid: String): FMClassifierWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setForceIndexLabel(true) + .setHandleInvalid(handleInvalid) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get labels and feature names from output schema +val (features, labels) = getFeaturesAndLabels(rFormulaModel, data) + +// assemble and fit the pipeline +val fmc = new FMClassifier() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) + .setLabelCol(rFormula.getLabelCol) + .setPredictionCol(PREDICTED_LABEL_INDEX_COL) Review comment: add ```setStepSize```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586574117 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880577 ## File path: R/pkg/tests/fulltests/test_mllib_classification.R ## @@ -488,4 +488,36 @@ test_that("spark.naiveBayes", { expect_equal(class(collect(predictions)$clicked[1]), "character") }) +test_that("spark.fmClassifier", { + df <- withColumn( +suppressWarnings(createDataFrame(iris)), +"Species", otherwise(when(column("Species") == "Setosa", "Setosa"), "Not-Setosa") + ) + + model1 <- spark.fmClassifier( +df, Species ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE, factorSize = 3 + ) + + prediction1 <- predict(model1, df) + expect_is(prediction1, "SparkDataFrame") + expect_equal(summary(model1)$factorSize, 3) + + # Test model save/load + if (windows_with_hadoop()) { +modelPath <- tempfile(pattern = "spark-fmclassifier", fileext = ".tmp") +write.ml(model1, modelPath) +model2 <- read.ml(modelPath) + +expect_is(model2, "FMClassificationModel") + +prediction2 <- predict(model2, df) +expect_equal( + collect(drop(prediction1, c("rawPrediction", "probability"))), + collect(drop(prediction2, c("rawPrediction", "probability"))) +) + } +}) + + Review comment: nit: delete extra line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880985 ## File path: R/pkg/R/mllib_classification.R ## @@ -649,3 +655,155 @@ setMethod("write.ml", signature(object = "NaiveBayesModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Classification Model +#' +#' \code{spark.fmClassifier} fits a factorization classification model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param thresholds in binary classification, in range [0, 1]. If the estimated probability of +#' class label 1 is > threshold, then predict 1, else 0. A high threshold +#' encourages the model to predict 0 more often; a low threshold encourages the +#' model to predict 1 more often. Note: Setting this with threshold p is +#' equivalent to setting thresholds c(1-p, p). +#' @param seed seed parameter for weights initialization. +#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and +#' label column of string type. +#' Supported options: "skip" (filter out rows with invalid data), +#' "error" (throw an error), "keep" (put invalid data in +#' a special additional bucket, at index numLabels). Default +#' is "error". +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmClassifier} returns a fitted Factorization Machines Classification Model. +#' @rdname spark.fmClassifier +#' @aliases spark.fmClassifier,SparkDataFrame,formula-method +#' @name spark.fmClassifier +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Classification Model +#' model <- spark.fmClassifier( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmClassifier since 3.0.0 +setMethod("spark.fmClassifier", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 0.0, + miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, stepSize=1.0, + tol = 1e-6, solver = c("adamW", "gd"), thresholds = NULL, seed = NULL, + handleInvalid = c("error", "keep", "skip")) { Review comment: any reason why ```fitIntercept``` is not here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379879918 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R Review comment: decisionTree.R -> fmClassifier.R This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880122 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmclasfier-example") + +# $example on:classification$ +# Load training data +df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") +training <- df +test <- df + +# Fit a FM classification model +model <- spark.fmClassifier(df, label ~ features) + Review comment: add ```summary(model)``` as an example too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880075 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmclasfier-example") + +# $example on:classification$ +# Load training data +df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") +training <- df +test <- df + +# Fit a FM classification model +model <- spark.fmClassifier(df, label ~ features) + +# Prediction +predictions <- predict(model, test) Review comment: add ```head(predictions)```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677693 Since this fix is trivial, it looks fine to me if the tests passed. cc: @srowen This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880740 ## File path: R/pkg/R/mllib_classification.R ## @@ -649,3 +655,155 @@ setMethod("write.ml", signature(object = "NaiveBayesModel", path = "character"), function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Classification Model +#' +#' \code{spark.fmClassifier} fits a factorization classification model against a SparkDataFrame. +#' Users can call \code{summary} to print a summary of the fitted model, \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' Only categorical data is supported. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param thresholds in binary classification, in range [0, 1]. If the estimated probability of +#' class label 1 is > threshold, then predict 1, else 0. A high threshold +#' encourages the model to predict 0 more often; a low threshold encourages the +#' model to predict 1 more often. Note: Setting this with threshold p is +#' equivalent to setting thresholds c(1-p, p). +#' @param seed seed parameter for weights initialization. +#' @param handleInvalid How to handle invalid data (unseen labels or NULL values) in features and +#' label column of string type. +#' Supported options: "skip" (filter out rows with invalid data), +#' "error" (throw an error), "keep" (put invalid data in +#' a special additional bucket, at index numLabels). Default +#' is "error". +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmClassifier} returns a fitted Factorization Machines Classification Model. +#' @rdname spark.fmClassifier +#' @aliases spark.fmClassifier,SparkDataFrame,formula-method +#' @name spark.fmClassifier +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_binary_classification_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Classification Model +#' model <- spark.fmClassifier( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmClassifier since 3.0.0 Review comment: 3.1.0? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677656 ok to test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880095 ## File path: examples/src/main/r/ml/fmClassifier.R ## @@ -0,0 +1,38 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmclasfier-example") + +# $example on:classification$ +# Load training data +df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm") +training <- df +test <- df + +# Fit a FM classification model +model <- spark.fmClassifier(df, label ~ features) + +# Prediction +predictions <- predict(model, test) +# $example off:classification$ Review comment: add ```sparkR.session.stop()```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR URL: https://github.com/apache/spark/pull/27570#discussion_r379880497 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala ## @@ -0,0 +1,176 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.classification.{FMClassificationModel, FMClassifier} +import org.apache.spark.ml.feature.{IndexToString, RFormula} +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMClassifierWrapper private ( +val pipeline: PipelineModel, +val features: Array[String], +val labels: Array[String]) extends MLWritable { + import FMClassifierWrapper._ + + private val fmClassificationModel: FMClassificationModel = +pipeline.stages(1).asInstanceOf[FMClassificationModel] + + lazy val rFeatures: Array[String] = if (fmClassificationModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmClassificationModel.getFitIntercept) { +Array(fmClassificationModel.intercept) ++ fmClassificationModel.linear.toArray + } else { +fmClassificationModel.linear.toArray + } + + lazy val rFactors = fmClassificationModel.factors.toArray + + lazy val numClasses: Int = fmClassificationModel.numClasses + + lazy val numFeatures: Int = fmClassificationModel.numFeatures + + lazy val factorSize: Int = fmClassificationModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(PREDICTED_LABEL_INDEX_COL) + .drop(fmClassificationModel.getFeaturesCol) + .drop(fmClassificationModel.getLabelCol) + } + + override def write: MLWriter = new FMClassifierWrapper.FMClassifierWrapperWriter(this) +} + +private[r] object FMClassifierWrapper + extends MLReadable[FMClassifierWrapper] { + + val PREDICTED_LABEL_INDEX_COL = "pred_label_idx" + val PREDICTED_LABEL_COL = "prediction" + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + thresholds: Array[Double], + handleInvalid: String): FMClassifierWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setForceIndexLabel(true) + .setHandleInvalid(handleInvalid) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get labels and feature names from output schema +val (features, labels) = getFeaturesAndLabels(rFormulaModel, data) + +// assemble and fit the pipeline +val fmc = new FMClassifier() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) + .setLabelCol(rFormula.getLabelCol) + .setPredictionCol(PREDICTED_LABEL_INDEX_COL) + +if (seed != null) { Review comment: ```if (seed != null && seed.length > 0)```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.o
[GitHub] [spark] SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677822 **[Test build #118494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)** for PR 27594 at commit [`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677912 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677913 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23250/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677912 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677913 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23250/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments
maropu commented on a change in pull request #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments URL: https://github.com/apache/spark/pull/27495#discussion_r379882153 ## File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/comments.sql ## @@ -47,4 +45,5 @@ Now just one deep... */ 'deeply nested example' AS sixth; --QUERY-DELIMITER-END -/* and this is the end of the file */ +-- [SPARK-30824] Support submit sql content only contains comments. Review comment: Is this an ANSI-related issue? https://issues.apache.org/jira/browse/SPARK-30824 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on issue #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments
maropu commented on issue #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments URL: https://github.com/apache/spark/pull/27495#issuecomment-586678147 Looks fine now except for one comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset
liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset URL: https://github.com/apache/spark/pull/27565#discussion_r379882384 ## File path: python/pyspark/sql/dataframe.py ## @@ -2153,6 +2153,59 @@ def transform(self, func): "should have been DataFrame." % type(result) return result +@since(3.1) +def sameSemantics(self, other): +""" +Returns `True` when the logical query plans inside both :class:`DataFrame`\\s are equal and +therefore return same results. + +.. note:: The equality comparison here is simplified by tolerating the cosmetic differences +such as attribute names. + +.. note::This API can compare both :class:`DataFrame`\\s very fast but can still return +`False` on the :class:`DataFrame` that return the same results, for instance, from +different plans. Such false negative semantic can be useful when caching as an example. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df2.withColumn("col1", df2.id * 2)) +True +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df3.withColumn("col1", df3.id + 2)) +False +>>> df1.withColumn("col1", df1.id * 2).sameSemantics(df4.withColumn("col0", df4.id * 2)) +True +""" +if not isinstance(other, DataFrame): +raise ValueError("other parameter should be of DataFrame; however, got %s" + % type(other)) +return self._jdf.sameSemantics(other._jdf) + +@since(3.1) +def semanticHash(self): +""" +Returns a hash code of the logical query plan against this :class:`DataFrame`. + +.. note:: Unlike the standard hash code, the hash is calculated against the query plan +simplified by tolerating the cosmetic differences such as attribute names. + +>>> df1 = spark.range(100) +>>> df2 = spark.range(100) +>>> df3 = spark.range(100) +>>> df4 = spark.range(100) +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df2.withColumn("col1", df2.id * 2).semanticHash() +True +>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \ +df3.withColumn("col1", df3.id + 2).semanticHash() +False Review comment: Same behavior for dataframe from `spark.read.load()` ``` >>> df4=spark.read.load(csv_file_path, format="csv", inferSchema="true", header="true") >>> df4.schema StructType(List(StructField(bool_col,BooleanType,true),StructField(float_col,DoubleType,true),StructField(double_col,DoubleType,true),StructField(int_col,IntegerType,true),StructField(long_col,IntegerType,true))) >>> df4.withColumn("col1", df4.int_col *2).semanticHash() -1746346451 >>> df4.withColumn("col1", df4.int_col +2).semanticHash() -1746346451 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678746 **[Test build #118494 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)** for PR 27594 at commit [`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678766 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
SparkQA removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586677822 **[Test build #118494 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)** for PR 27594 at commit [`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678768 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118494/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678768 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118494/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest URL: https://github.com/apache/spark/pull/27594#issuecomment-586678766 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882027 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model Review comment: nit: ```Model Model``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882145 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model +#' +#' \code{spark.fmRegressor} fits a factorization regression model against a SparkDataFrame. +#' Users can call \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. Review comment: I guess also mention ```summary``` here? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882502 ## File path: R/pkg/tests/fulltests/test_mllib_regression.R ## @@ -551,4 +551,33 @@ test_that("spark.survreg", { } }) + +test_that("spark.fmRegressor", { + df <- suppressWarnings(createDataFrame(iris)) + + model <- spark.fmRegressor( +df, Sepal_Width ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE + ) + + prediction1 <- predict(model, df) + expect_is(prediction1, "SparkDataFrame") + + # Test model save/load + if (windows_with_hadoop()) { +modelPath <- tempfile(pattern = "spark-fmregressor", fileext = ".tmp") +write.ml(model, modelPath) +model2 <- read.ml(modelPath) + +expect_is(model2, "FMRegressionModel") + +prediction2 <- predict(model2, df) +expect_equal( + collect(prediction1), + collect(prediction2) +) + } +}) + + Review comment: nit: delete extra line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379881943 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + Review comment: nit: delete extra line? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882615 ## File path: examples/src/main/r/ml/fmRegressor.R ## @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R Review comment: change this ```decisionTree.R```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882600 ## File path: examples/src/main/r/ml/fmRegressor.R ## @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmRegressor-example") + +# $example on +# Load training data +df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +training_test <- randomSplit(df, c(0.7, 0.3)) +training <- training_test[[1]] +test <- training_test[[2]] + + Review comment: nit: delete extra line This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379883014 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMRegressorWrapper.scala ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.attribute.AttributeGroup +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMRegressorWrapper private ( +val pipeline: PipelineModel, +val features: Array[String]) extends MLWritable { + import FMRegressorWrapper._ + + private val fmRegressionModel: FMRegressionModel = +pipeline.stages(1).asInstanceOf[FMRegressionModel] + + lazy val rFeatures: Array[String] = if (fmRegressionModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmRegressionModel.getFitIntercept) { +Array(fmRegressionModel.intercept) ++ fmRegressionModel.linear.toArray + } else { +fmRegressionModel.linear.toArray + } + + lazy val rFactors = fmRegressionModel.factors.toArray + + lazy val numFeatures: Int = fmRegressionModel.numFeatures + + lazy val factorSize: Int = fmRegressionModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(fmRegressionModel.getFeaturesCol) + } + + override def write: MLWriter = new FMRegressorWrapper.FMRegressorWrapperWriter(this) +} + +private[r] object FMRegressorWrapper + extends MLReadable[FMRegressorWrapper] { + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + stringIndexerOrderType: String): FMRegressorWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setStringIndexerOrderType(stringIndexerOrderType) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get feature names from output schema +val schema = rFormulaModel.transform(data).schema +val featureAttrs = AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol)) + .attributes.get +val features = featureAttrs.map(_.name.get) + +// assemble and fit the pipeline +val fmr = new FMRegressor() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) + +if (seed != null) { Review comment: also check ```seed.length > 0```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882860 ## File path: examples/src/main/r/ml/fmRegressor.R ## @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# To run this example use +# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R + +# Load SparkR library into your R session +library(SparkR) + +# Initialize SparkSession +sparkR.session(appName = "SparkR-ML-fmRegressor-example") + +# $example on +# Load training data +df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +training_test <- randomSplit(df, c(0.7, 0.3)) +training <- training_test[[1]] +test <- training_test[[2]] + + +# Fit a FM regression model +model <- spark.fmRegressor(training, label ~ features) + +# Prediction +predictions <- predict(model, test) Review comment: same as the classifier example, I guess add ```summary(model)```, ```head(predictions)``` and also add ```sparkR.session.stop()``` in the end? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882920 ## File path: mllib/src/main/scala/org/apache/spark/ml/r/FMRegressorWrapper.scala ## @@ -0,0 +1,157 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.r + +import org.apache.hadoop.fs.Path +import org.json4s._ +import org.json4s.JsonDSL._ +import org.json4s.jackson.JsonMethods._ + +import org.apache.spark.ml.{Pipeline, PipelineModel} +import org.apache.spark.ml.attribute.AttributeGroup +import org.apache.spark.ml.feature.RFormula +import org.apache.spark.ml.r.RWrapperUtils._ +import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset} + +private[r] class FMRegressorWrapper private ( +val pipeline: PipelineModel, +val features: Array[String]) extends MLWritable { + import FMRegressorWrapper._ + + private val fmRegressionModel: FMRegressionModel = +pipeline.stages(1).asInstanceOf[FMRegressionModel] + + lazy val rFeatures: Array[String] = if (fmRegressionModel.getFitIntercept) { +Array("(Intercept)") ++ features + } else { +features + } + + lazy val rCoefficients: Array[Double] = if (fmRegressionModel.getFitIntercept) { +Array(fmRegressionModel.intercept) ++ fmRegressionModel.linear.toArray + } else { +fmRegressionModel.linear.toArray + } + + lazy val rFactors = fmRegressionModel.factors.toArray + + lazy val numFeatures: Int = fmRegressionModel.numFeatures + + lazy val factorSize: Int = fmRegressionModel.getFactorSize + + def transform(dataset: Dataset[_]): DataFrame = { +pipeline.transform(dataset) + .drop(fmRegressionModel.getFeaturesCol) + } + + override def write: MLWriter = new FMRegressorWrapper.FMRegressorWrapperWriter(this) +} + +private[r] object FMRegressorWrapper + extends MLReadable[FMRegressorWrapper] { + + def fit( // scalastyle:ignore + data: DataFrame, + formula: String, + factorSize: Int, + fitLinear: Boolean, + regParam: Double, + miniBatchFraction: Double, + initStd: Double, + maxIter: Int, + stepSize: Double, + tol: Double, + solver: String, + seed: String, + stringIndexerOrderType: String): FMRegressorWrapper = { + +val rFormula = new RFormula() + .setFormula(formula) + .setStringIndexerOrderType(stringIndexerOrderType) +checkDataColumns(rFormula, data) +val rFormulaModel = rFormula.fit(data) + +val fitIntercept = rFormula.hasIntercept + +// get feature names from output schema +val schema = rFormulaModel.transform(data).schema +val featureAttrs = AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol)) + .attributes.get +val features = featureAttrs.map(_.name.get) + +// assemble and fit the pipeline +val fmr = new FMRegressor() + .setFactorSize(factorSize) + .setFitLinear(fitLinear) + .setRegParam(regParam) + .setMiniBatchFraction(miniBatchFraction) + .setInitStd(initStd) + .setMaxIter(maxIter) + .setTol(tol) + .setSolver(solver) + .setFitIntercept(fitIntercept) + .setFeaturesCol(rFormula.getFeaturesCol) Review comment: add ```setStepSize```? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882476 ## File path: R/pkg/tests/fulltests/test_mllib_regression.R ## @@ -551,4 +551,33 @@ test_that("spark.survreg", { } }) + +test_that("spark.fmRegressor", { + df <- suppressWarnings(createDataFrame(iris)) + + model <- spark.fmRegressor( +df, Sepal_Width ~ ., +regParam = 0.01, maxIter = 10, fitLinear = TRUE + ) + + prediction1 <- predict(model, df) + expect_is(prediction1, "SparkDataFrame") Review comment: I guess we may want to check the predict result too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882343 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model +#' +#' \code{spark.fmRegressor} fits a factorization regression model against a SparkDataFrame. +#' Users can call \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param seed seed parameter for weights initialization. +#' @param stringIndexerOrderType how to order categories of a string feature column. This is used to +#' decide the base level of a string feature as the last category +#' after ordering is dropped when encoding strings. Supported options +#' are "frequencyDesc", "frequencyAsc", "alphabetDesc", and +#' "alphabetAsc". The default value is "frequencyDesc". When the +#' ordering is set to "alphabetDesc", this drops the same category +#' as R when encoding strings. +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmRegressor} returns a fitted Factorization Machines Regression Model. +#' +#' @rdname spark.fmRegressor +#' @aliases spark.fmRegressor,SparkDataFrame,formula-method +#' @name spark.fmRegressor +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Regression Model +#' model <- spark.fmRegressor( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmRegressor since 3.1.0 +setMethod("spark.fmRegressor", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 0.0, + miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, stepSize=1.0, + tol = 1e-6, solver = c("adamW", "gd"), seed = NULL, + stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", + "alphabetDesc", "alphabetAsc")) { + +formula <- paste(deparse(formula), collapse = "") + +if (!is.null(seed)) { + seed <- as.character(as.integer(seed)) +} + +solver <- match.arg(solver) +stringIndexerOrderType <- match.arg(stringIndexerOrderType) + +jobj <- callJStatic("org.apache.spark.ml.r.FMRegressorWrapper", +"fit", +data@sdf, +formula, +as.integer(factorSize), +as.logical(fitLinear), +as.numeric(regParam), +as.numeric(miniBatchFraction), +as.numeric(initStd), +as.integer(maxIter), +as.numeric(stepSize), +as.numeric(tol), +solver, +seed, +stringIndexerOrderType) +new("FMRegressionModel", jobj = jobj) + }) + + Review comment: nit: delete extra line?
[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR
huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR URL: https://github.com/apache/spark/pull/27571#discussion_r379882358 ## File path: R/pkg/R/mllib_regression.R ## @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = "AFTSurvivalRegressionModel", path = "c function(object, path, overwrite = FALSE) { write_internal(object, path, overwrite) }) + + +#' Factorization Machines Regression Model Model +#' +#' \code{spark.fmRegressor} fits a factorization regression model against a SparkDataFrame. +#' Users can call \code{predict} to make +#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load fitted models. +#' +#' @param data a \code{SparkDataFrame} of observations and labels for model fitting. +#' @param formula a symbolic description of the model to be fitted. Currently only a few formula +#'operators are supported, including '~', '.', ':', '+', and '-'. +#' @param factorSize dimensionality of the factors. +#' @param fitLinear whether to fit linear term. # TODO Can we express this with formula? +#' @param regParam the regularization parameter. +#' @param miniBatchFraction the mini-batch fraction parameter. +#' @param initStd the standard deviation of initial coefficients. +#' @param maxIter maximum iteration number. +#' @param stepSize stepSize parameter. +#' @param tol convergence tolerance of iterations. +#' @param solver solver parameter, supported options: "gd" (minibatch gradient descent) or "adamW". +#' @param seed seed parameter for weights initialization. +#' @param stringIndexerOrderType how to order categories of a string feature column. This is used to +#' decide the base level of a string feature as the last category +#' after ordering is dropped when encoding strings. Supported options +#' are "frequencyDesc", "frequencyAsc", "alphabetDesc", and +#' "alphabetAsc". The default value is "frequencyDesc". When the +#' ordering is set to "alphabetDesc", this drops the same category +#' as R when encoding strings. +#' @param ... additional arguments passed to the method. +#' @return \code{spark.fmRegressor} returns a fitted Factorization Machines Regression Model. +#' +#' @rdname spark.fmRegressor +#' @aliases spark.fmRegressor,SparkDataFrame,formula-method +#' @name spark.fmRegressor +#' @seealso \link{read.ml} +#' @examples +#' \dontrun{ +#' df <- read.df("data/mllib/sample_linear_regression_data.txt", source = "libsvm") +#' +#' # fit Factorization Machines Regression Model +#' model <- spark.fmRegressor( +#'df, label ~ features, +#'regParam = 0.01, maxIter = 10, fitLinear = TRUE +#' ) +#' +#' # get the summary of the model +#' summary(model) +#' +#' # make predictions +#' predictions <- predict(model, df) +#' +#' # save and load the model +#' path <- "path/to/model" +#' write.ml(model, path) +#' savedModel <- read.ml(path) +#' summary(savedModel) +#' } +#' @note spark.fmRegressor since 3.1.0 +setMethod("spark.fmRegressor", signature(data = "SparkDataFrame", formula = "formula"), + function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 0.0, + miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, stepSize=1.0, + tol = 1e-6, solver = c("adamW", "gd"), seed = NULL, + stringIndexerOrderType = c("frequencyDesc", "frequencyAsc", + "alphabetDesc", "alphabetAsc")) { + +formula <- paste(deparse(formula), collapse = "") + +if (!is.null(seed)) { + seed <- as.character(as.integer(seed)) +} + +solver <- match.arg(solver) +stringIndexerOrderType <- match.arg(stringIndexerOrderType) + +jobj <- callJStatic("org.apache.spark.ml.r.FMRegressorWrapper", +"fit", +data@sdf, +formula, +as.integer(factorSize), +as.logical(fitLinear), +as.numeric(regParam), +as.numeric(miniBatchFraction), +as.numeric(initStd), +as.integer(maxIter), +as.numeric(stepSize), +as.numeric(tol), +solver, +seed, +stringIndexerOrderType) +new("FMRegressionModel", jobj = jobj) + }) + + +# Returns the summary of a FM Regression model produced by \code{spark.fmRegressor} + +#' @param obj