date:20200215

[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] 
Document event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671326
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23245/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] kiszk commented on issue #27577: [DOC] add config naming guideline

2020-02-15 Thread GitBox

kiszk commented on issue #27577: [DOC] add config naming guideline
URL: https://github.com/apache/spark/pull/27577#issuecomment-586671458
 
 
   Beyond this PR, can we create a sanity checker?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document 
event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671777
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118488/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

SparkQA commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event 
log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671757
 
 
   **[Test build #118488 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)**
 for PR 27398 at commit 
[`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] 
Document event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671777
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118488/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

SparkQA removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] 
Document event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671228
 
 
   **[Test build #118488 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118488/testReport)**
 for PR 27398 at commit 
[`803663f`](https://github.com/apache/spark/commit/803663fd3e8f6d73ee5731f0f5a0228e4f65d776).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] 
Document event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671776
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document event log compaction into new section of monitoring.md

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27398: [SPARK-30481][DOCS][FOLLOWUP] Document 
event log compaction into new section of monitoring.md
URL: https://github.com/apache/spark/pull/27398#issuecomment-586671776
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674793
 
 
   **[Test build #118489 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118489/testReport)**
 for PR 27565 at commit 
[`a1d4ba1`](https://github.com/apache/spark/commit/a1d4ba1f33c81435da84cbeee3c7e579e5dd8061).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674850
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23246/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586674848
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year

2020-02-15 Thread GitBox

MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time 
components before 1582 year
URL: https://github.com/apache/spark/pull/27596#discussion_r379880203
 
 

 ##
 File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala
 ##
 @@ -51,7 +51,6 @@ trait TimeZoneAwareExpression extends Expression {
   /** Returns a copy of this expression with the specified timeZoneId. */
   def withTimeZone(timeZoneId: String): TimeZoneAwareExpression
 
-  @transient lazy val timeZone: TimeZone = 
DateTimeUtils.getTimeZone(timeZoneId.get)
 
 Review comment:
   Finally, all expressions bound on legacy `TimeZone` have been gone.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay

2020-02-15 Thread GitBox

SparkQA commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding 
number of rows later than watermark plus allowed delay
URL: https://github.com/apache/spark/pull/24936#issuecomment-586675997
 
 
   **[Test build #118490 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118490/testReport)**
 for PR 24936 at commit 
[`f99a528`](https://github.com/apache/spark/commit/f99a528fafa9f59a37de8fd7b824c4168d7f4e26).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new 
metric regarding number of rows later than watermark plus allowed delay
URL: https://github.com/apache/spark/pull/24936#issuecomment-586676073
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23247/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric 
regarding number of rows later than watermark plus allowed delay
URL: https://github.com/apache/spark/pull/24936#issuecomment-586676073
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23247/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #24936: [SPARK-24634][SS] Add a new 
metric regarding number of rows later than watermark plus allowed delay
URL: https://github.com/apache/spark/pull/24936#issuecomment-586676070
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric regarding number of rows later than watermark plus allowed delay

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #24936: [SPARK-24634][SS] Add a new metric 
regarding number of rows later than watermark plus allowed delay
URL: https://github.com/apache/spark/pull/24936#issuecomment-586676070
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #25987: [SPARK-29314][SS] Don't overwrite the metric "updated" of state operator to 0 if empty batch is run

2020-02-15 Thread GitBox

SparkQA commented on issue #25987: [SPARK-29314][SS] Don't overwrite the metric 
"updated" of state operator to 0 if empty batch is run
URL: https://github.com/apache/spark/pull/25987#issuecomment-586676448
 
 
   **[Test build #118492 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118492/testReport)**
 for PR 25987 at commit 
[`c6993d0`](https://github.com/apache/spark/commit/c6993d04917a9327cf3e9e32276997c055da2935).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676449
 
 
   **[Test build #118491 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118491/testReport)**
 for PR 27565 at commit 
[`ddba494`](https://github.com/apache/spark/commit/ddba494405ea7ce79e03a8e93f97cd64a3a2acfc).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676502
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676503
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23248/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

liangz1 commented on a change in pull request #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379881028
 
 

 ##
 File path: python/pyspark/sql/dataframe.py
 ##
 @@ -2153,6 +2153,59 @@ def transform(self, func):
   "should have been DataFrame." % 
type(result)
 return result
 
+@since(3.1)
+def sameSemantics(self, other):
+"""
+Returns `True` when the logical query plans inside both 
:class:`DataFrame`\\s are equal and
+therefore return same results.
+
+.. note:: The equality comparison here is simplified by tolerating the 
cosmetic differences
+such as attribute names.
+
+.. note::This API can compare both :class:`DataFrame`\\s very fast but 
can still return
+`False` on the :class:`DataFrame` that return the same results, 
for instance, from
+different plans. Such false negative semantic can be useful when 
caching as an example.
+
+>>> df1 = spark.range(100)
+>>> df2 = spark.range(100)
+>>> df3 = spark.range(100)
+>>> df4 = spark.range(100)
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+True
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+False
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+True
+"""
+if not isinstance(other, DataFrame):
+raise ValueError("other parameter should be of DataFrame; however, 
got %s"
+ % type(other))
+return self._jdf.sameSemantics(other._jdf)
+
+@since(3.1)
+def semanticHash(self):
+"""
+Returns a hash code of the logical query plan against this 
:class:`DataFrame`.
+
+.. note:: Unlike the standard hash code, the hash is calculated 
against the query plan
+simplified by tolerating the cosmetic differences such as 
attribute names.
+
+>>> df1 = spark.range(100)
+>>> df2 = spark.range(100)
+>>> df3 = spark.range(100)
+>>> df4 = spark.range(100)
+>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+df2.withColumn("col1", df2.id * 2).semanticHash()
+True
+>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+df3.withColumn("col1", df3.id + 2).semanticHash()
+False
 
 Review comment:
   ```
   Failed example:
   df1.withColumn("col1", df1.id * 2).semanticHash() == 
df3.withColumn("col1", df3.id + 2).semanticHash()
   Differences (ndiff with -expected +actual):
   - False
   + True
   ```
   Now we have another unexpected result. (Note L2176 passed, which is 
expected.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

SparkQA commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676866
 
 
   **[Test build #118493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118493/testReport)**
 for PR 27565 at commit 
[`61f7ca1`](https://github.com/apache/spark/commit/61f7ca11af14d399d0e2512c51c2f37c4aa4a38f).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder

2020-02-15 Thread GitBox

maropu commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] 
Add version property for ConfigEntry and ConfigBuilder
URL: https://github.com/apache/spark/pull/27592#discussion_r379881204
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
 ##
 @@ -74,7 +76,8 @@ private[spark] abstract class ConfigEntry[T] (
   def defaultValue: Option[T] = None
 
   override def toString: String = {
-s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, 
public=$isPublic)"
+s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, " +
+  s"public=$isPublic, version = $version)"
 
 Review comment:
   nit: `version=$version`


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676945
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 
'sameSemantics' and 'sementicHash' methods in Dataset
URL: https://github.com/apache/spark/pull/27565#issuecomment-586676946
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23249/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on issue #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder

2020-02-15 Thread GitBox

maropu commented on issue #27592: [SPARK-30840][CORE][SQL] Add version property 
for ConfigEntry and ConfigBuilder
URL: https://github.com/apache/spark/pull/27592#issuecomment-586677152
 
 
   For reviewers, can you add a screenshot of a html document generated by 
`gen-sql-config-docs.py` in the PR description?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year

2020-02-15 Thread GitBox

MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time 
components before 1582 year
URL: https://github.com/apache/spark/pull/27596#discussion_r379881509
 
 

 ##
 File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ##
 @@ -290,32 +293,38 @@ class DateTimeUtilsSuite extends SparkFunSuite with 
Matchers with SQLHelper {
   }
 
   test("hours") {
-var input = date(2015, 3, 18, 13, 2, 11, 0, TimeZonePST)
-assert(getHours(input, TimeZonePST) === 13)
-assert(getHours(input, TimeZoneGMT) === 20)
-input = date(2015, 12, 8, 2, 7, 9, 0, TimeZonePST)
-assert(getHours(input, TimeZonePST) === 2)
-assert(getHours(input, TimeZoneGMT) === 10)
+var input = date(2015, 3, 18, 13, 2, 11, 0, zonePST)
+assert(getHours(input, zonePST) === 13)
+assert(getHours(input, zoneGMT) === 20)
+input = date(2015, 12, 8, 2, 7, 9, 0, zonePST)
+assert(getHours(input, zonePST) === 2)
+assert(getHours(input, zoneGMT) === 10)
+input = date(10, 1, 1, 0, 0, 0, 0, zonePST)
+assert(getHours(input, zonePST) === 0)
 
 Review comment:
   This is new test.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time components before 1582 year

2020-02-15 Thread GitBox

MaxGekk commented on a change in pull request #27596: [WIP] Fix getting of time 
components before 1582 year
URL: https://github.com/apache/spark/pull/27596#discussion_r379881626
 
 

 ##
 File path: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeUtilsSuite.scala
 ##
 @@ -290,32 +293,38 @@ class DateTimeUtilsSuite extends SparkFunSuite with 
Matchers with SQLHelper {
   }
 
   test("hours") {
-var input = date(2015, 3, 18, 13, 2, 11, 0, TimeZonePST)
-assert(getHours(input, TimeZonePST) === 13)
-assert(getHours(input, TimeZoneGMT) === 20)
-input = date(2015, 12, 8, 2, 7, 9, 0, TimeZonePST)
-assert(getHours(input, TimeZonePST) === 2)
-assert(getHours(input, TimeZoneGMT) === 10)
+var input = date(2015, 3, 18, 13, 2, 11, 0, zonePST)
+assert(getHours(input, zonePST) === 13)
+assert(getHours(input, zoneGMT) === 20)
+input = date(2015, 12, 8, 2, 7, 9, 0, zonePST)
+assert(getHours(input, zonePST) === 2)
+assert(getHours(input, zoneGMT) === 10)
+input = date(10, 1, 1, 0, 0, 0, 0, zonePST)
+assert(getHours(input, zonePST) === 0)
 
 Review comment:
   Before the changes:
   ```sql
   spark-sql> select hour(timestamp '0010-01-01 00:00:00');
   23
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder

2020-02-15 Thread GitBox

beliefer commented on a change in pull request #27592: [SPARK-30840][CORE][SQL] 
Add version property for ConfigEntry and ConfigBuilder
URL: https://github.com/apache/spark/pull/27592#discussion_r379881659
 
 

 ##
 File path: 
core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
 ##
 @@ -74,7 +76,8 @@ private[spark] abstract class ConfigEntry[T] (
   def defaultValue: Option[T] = None
 
   override def toString: String = {
-s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, 
public=$isPublic)"
+s"ConfigEntry(key=$key, defaultValue=$defaultValueString, doc=$doc, " +
+  s"public=$isPublic, version = $version)"
 
 Review comment:
   OK


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880677
 
 

 ##
 File path: R/pkg/tests/fulltests/test_mllib_classification.R
 ##
 @@ -488,4 +488,36 @@ test_that("spark.naiveBayes", {
   expect_equal(class(collect(predictions)$clicked[1]), "character")
 })
 
+test_that("spark.fmClassifier", {
+  df <- withColumn(
+suppressWarnings(createDataFrame(iris)),
+"Species", otherwise(when(column("Species") == "Setosa", "Setosa"), 
"Not-Setosa")
+  )
+
+  model1 <- spark.fmClassifier(
+df,  Species ~ .,
+regParam = 0.01, maxIter = 10, fitLinear = TRUE, factorSize = 3
+  )
+
+  prediction1 <- predict(model1, df)
+  expect_is(prediction1, "SparkDataFrame")
 
 Review comment:
Can we also check the predict result here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

liangz1 commented on a change in pull request #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379881773
 
 

 ##
 File path: python/pyspark/sql/dataframe.py
 ##
 @@ -2153,6 +2153,59 @@ def transform(self, func):
   "should have been DataFrame." % 
type(result)
 return result
 
+@since(3.1)
+def sameSemantics(self, other):
+"""
+Returns `True` when the logical query plans inside both 
:class:`DataFrame`\\s are equal and
+therefore return same results.
+
+.. note:: The equality comparison here is simplified by tolerating the 
cosmetic differences
+such as attribute names.
+
+.. note::This API can compare both :class:`DataFrame`\\s very fast but 
can still return
+`False` on the :class:`DataFrame` that return the same results, 
for instance, from
+different plans. Such false negative semantic can be useful when 
caching as an example.
+
+>>> df1 = spark.range(100)
+>>> df2 = spark.range(100)
+>>> df3 = spark.range(100)
+>>> df4 = spark.range(100)
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+True
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+False
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+True
+"""
+if not isinstance(other, DataFrame):
+raise ValueError("other parameter should be of DataFrame; however, 
got %s"
+ % type(other))
+return self._jdf.sameSemantics(other._jdf)
+
+@since(3.1)
+def semanticHash(self):
+"""
+Returns a hash code of the logical query plan against this 
:class:`DataFrame`.
+
+.. note:: Unlike the standard hash code, the hash is calculated 
against the query plan
+simplified by tolerating the cosmetic differences such as 
attribute names.
+
+>>> df1 = spark.range(100)
+>>> df2 = spark.range(100)
+>>> df3 = spark.range(100)
+>>> df4 = spark.range(100)
+>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+df2.withColumn("col1", df2.id * 2).semanticHash()
+True
+>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+df3.withColumn("col1", df3.id + 2).semanticHash()
+False
 
 Review comment:
   More tests:
   ```
   >>> df1=spark.range(100)
   >>> df2=spark.range(100)
   >>> df3=spark.range(100)
   >>> df11=df1.withColumn("col1", df1.id +1)
   >>> df21=df2.withColumn("col1", df2.id -1)
   >>> df31=df3.withColumn("col1", df3.id *2)
   >>> df32=df3.withColumn("col1", df3.id +2)
   >>> df33=df3.withColumn("col1", df3.id /2)
   >>> df34=df3.withColumn("col1", df3.id -2)
   
   >>> df11.semanticHash()
   1855039936
   >>> df21.semanticHash()
   1855039936
   
   >>> df31.semanticHash()
   -1719131362
   >>> df32.semanticHash()
   -1719131362
   
   >>> df32.semanticHash()
   -1719131362
   >>> df34.semanticHash()
   -706037631


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379881557
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala
 ##
 @@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.{FMClassificationModel, FMClassifier}
+import org.apache.spark.ml.feature.{IndexToString, RFormula}
+import org.apache.spark.ml.r.RWrapperUtils._
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class FMClassifierWrapper private (
+val pipeline: PipelineModel,
+val features: Array[String],
+val labels: Array[String]) extends MLWritable {
+  import FMClassifierWrapper._
+
+  private val fmClassificationModel: FMClassificationModel =
+pipeline.stages(1).asInstanceOf[FMClassificationModel]
+
+  lazy val rFeatures: Array[String] = if 
(fmClassificationModel.getFitIntercept) {
+Array("(Intercept)") ++ features
+  } else {
+features
+  }
+
+  lazy val rCoefficients: Array[Double] = if 
(fmClassificationModel.getFitIntercept) {
+Array(fmClassificationModel.intercept) ++ 
fmClassificationModel.linear.toArray
+  } else {
+fmClassificationModel.linear.toArray
+  }
+
+  lazy val rFactors = fmClassificationModel.factors.toArray
+
+  lazy val numClasses: Int = fmClassificationModel.numClasses
+
+  lazy val numFeatures: Int = fmClassificationModel.numFeatures
+
+  lazy val factorSize: Int = fmClassificationModel.getFactorSize
+
+  def transform(dataset: Dataset[_]): DataFrame = {
+pipeline.transform(dataset)
+  .drop(PREDICTED_LABEL_INDEX_COL)
+  .drop(fmClassificationModel.getFeaturesCol)
+  .drop(fmClassificationModel.getLabelCol)
+  }
+
+  override def write: MLWriter = new 
FMClassifierWrapper.FMClassifierWrapperWriter(this)
+}
+
+private[r] object FMClassifierWrapper
+  extends MLReadable[FMClassifierWrapper] {
+
+  val PREDICTED_LABEL_INDEX_COL = "pred_label_idx"
+  val PREDICTED_LABEL_COL = "prediction"
+
+  def fit(  // scalastyle:ignore
+  data: DataFrame,
+  formula: String,
+  factorSize: Int,
+  fitLinear: Boolean,
+  regParam: Double,
+  miniBatchFraction: Double,
+  initStd: Double,
+  maxIter: Int,
+  stepSize: Double,
+  tol: Double,
+  solver: String,
+  seed: String,
+  thresholds: Array[Double],
+  handleInvalid: String): FMClassifierWrapper = {
+
+val rFormula = new RFormula()
+  .setFormula(formula)
+  .setForceIndexLabel(true)
+  .setHandleInvalid(handleInvalid)
+checkDataColumns(rFormula, data)
+val rFormulaModel = rFormula.fit(data)
+
+val fitIntercept = rFormula.hasIntercept
+
+// get labels and feature names from output schema
+val (features, labels) = getFeaturesAndLabels(rFormulaModel, data)
+
+// assemble and fit the pipeline
+val fmc = new FMClassifier()
+  .setFactorSize(factorSize)
+  .setFitLinear(fitLinear)
+  .setRegParam(regParam)
+  .setMiniBatchFraction(miniBatchFraction)
+  .setInitStd(initStd)
+  .setMaxIter(maxIter)
+  .setTol(tol)
+  .setSolver(solver)
+  .setFitIntercept(fitIntercept)
+  .setFeaturesCol(rFormula.getFeaturesCol)
+  .setLabelCol(rFormula.getLabelCol)
+  .setPredictionCol(PREDICTED_LABEL_INDEX_COL)
 
 Review comment:
   add ```setStepSize```?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...

[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo 
setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586574117
 
 
   Can one of the admins verify this patch?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880577
 
 

 ##
 File path: R/pkg/tests/fulltests/test_mllib_classification.R
 ##
 @@ -488,4 +488,36 @@ test_that("spark.naiveBayes", {
   expect_equal(class(collect(predictions)$clicked[1]), "character")
 })
 
+test_that("spark.fmClassifier", {
+  df <- withColumn(
+suppressWarnings(createDataFrame(iris)),
+"Species", otherwise(when(column("Species") == "Setosa", "Setosa"), 
"Not-Setosa")
+  )
+
+  model1 <- spark.fmClassifier(
+df,  Species ~ .,
+regParam = 0.01, maxIter = 10, fitLinear = TRUE, factorSize = 3
+  )
+
+  prediction1 <- predict(model1, df)
+  expect_is(prediction1, "SparkDataFrame")
+  expect_equal(summary(model1)$factorSize, 3)
+
+  # Test model save/load
+  if (windows_with_hadoop()) {
+modelPath <- tempfile(pattern = "spark-fmclassifier", fileext = ".tmp")
+write.ml(model1, modelPath)
+model2 <- read.ml(modelPath)
+
+expect_is(model2, "FMClassificationModel")
+
+prediction2 <- predict(model2, df)
+expect_equal(
+  collect(drop(prediction1, c("rawPrediction", "probability"))),
+  collect(drop(prediction2, c("rawPrediction", "probability")))
+)
+  }
+})
+
+
 
 Review comment:
   nit: delete extra line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880985
 
 

 ##
 File path: R/pkg/R/mllib_classification.R
 ##
 @@ -649,3 +655,155 @@ setMethod("write.ml", signature(object = 
"NaiveBayesModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
+#' Factorization Machines Classification Model
+#'
+#' \code{spark.fmClassifier} fits a factorization classification model against 
a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load 
fitted models.
+#' Only categorical data is supported.
+#'
+#' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
+#' @param factorSize dimensionality of the factors.
+#' @param fitLinear whether to fit linear term.  # TODO Can we express this 
with formula?
+#' @param regParam the regularization parameter.
+#' @param miniBatchFraction the mini-batch fraction parameter.
+#' @param initStd the standard deviation of initial coefficients.
+#' @param maxIter maximum iteration number.
+#' @param stepSize stepSize parameter.
+#' @param tol convergence tolerance of iterations.
+#' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "adamW".
+#' @param thresholds in binary classification, in range [0, 1]. If the 
estimated probability of
+#'   class label 1 is > threshold, then predict 1, else 0. A 
high threshold
+#'   encourages the model to predict 0 more often; a low 
threshold encourages the
+#'   model to predict 1 more often. Note: Setting this with 
threshold p is
+#'   equivalent to setting thresholds c(1-p, p).
+#' @param seed seed parameter for weights initialization.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in features and
+#'  label column of string type.
+#'  Supported options: "skip" (filter out rows with 
invalid data),
+#' "error" (throw an error), "keep" 
(put invalid data in
+#' a special additional bucket, at 
index numLabels). Default
+#' is "error".
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.fmClassifier} returns a fitted Factorization Machines 
Classification Model.
+#' @rdname spark.fmClassifier
+#' @aliases spark.fmClassifier,SparkDataFrame,formula-method
+#' @name spark.fmClassifier
+#' @seealso \link{read.ml}
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_binary_classification_data.txt", source = 
"libsvm")
+#'
+#' # fit Factorization Machines Classification Model
+#' model <- spark.fmClassifier(
+#'df, label ~ features,
+#'regParam = 0.01, maxIter = 10, fitLinear = TRUE
+#'  )
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.fmClassifier since 3.0.0
+setMethod("spark.fmClassifier", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 
0.0,
+   miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, 
stepSize=1.0,
+   tol = 1e-6, solver = c("adamW", "gd"), thresholds = NULL, 
seed = NULL,
+   handleInvalid = c("error", "keep", "skip")) {
 
 Review comment:
   any reason why ```fitIntercept``` is not here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379879918
 
 

 ##
 File path: examples/src/main/r/ml/fmClassifier.R
 ##
 @@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
 
 Review comment:
   decisionTree.R -> fmClassifier.R


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880122
 
 

 ##
 File path: examples/src/main/r/ml/fmClassifier.R
 ##
 @@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-fmclasfier-example")
+
+# $example on:classification$
+# Load training data
+df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
+training <- df
+test <- df
+
+# Fit a FM classification model
+model <- spark.fmClassifier(df, label ~ features)
+
 
 Review comment:
   add ```summary(model)``` as an example too?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880075
 
 

 ##
 File path: examples/src/main/r/ml/fmClassifier.R
 ##
 @@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-fmclasfier-example")
+
+# $example on:classification$
+# Load training data
+df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
+training <- df
+test <- df
+
+# Fit a FM classification model
+model <- spark.fmClassifier(df, label ~ features)
+
+# Prediction
+predictions <- predict(model, test)
 
 Review comment:
   add ```head(predictions)```?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677693
 
 
   Since this fix is trivial, it looks fine to me if the tests passed. cc: 
@srowen 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880740
 
 

 ##
 File path: R/pkg/R/mllib_classification.R
 ##
 @@ -649,3 +655,155 @@ setMethod("write.ml", signature(object = 
"NaiveBayesModel", path = "character"),
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
+#' Factorization Machines Classification Model
+#'
+#' \code{spark.fmClassifier} fits a factorization classification model against 
a SparkDataFrame.
+#' Users can call \code{summary} to print a summary of the fitted model, 
\code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load 
fitted models.
+#' Only categorical data is supported.
+#'
+#' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
+#' @param factorSize dimensionality of the factors.
+#' @param fitLinear whether to fit linear term.  # TODO Can we express this 
with formula?
+#' @param regParam the regularization parameter.
+#' @param miniBatchFraction the mini-batch fraction parameter.
+#' @param initStd the standard deviation of initial coefficients.
+#' @param maxIter maximum iteration number.
+#' @param stepSize stepSize parameter.
+#' @param tol convergence tolerance of iterations.
+#' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "adamW".
+#' @param thresholds in binary classification, in range [0, 1]. If the 
estimated probability of
+#'   class label 1 is > threshold, then predict 1, else 0. A 
high threshold
+#'   encourages the model to predict 0 more often; a low 
threshold encourages the
+#'   model to predict 1 more often. Note: Setting this with 
threshold p is
+#'   equivalent to setting thresholds c(1-p, p).
+#' @param seed seed parameter for weights initialization.
+#' @param handleInvalid How to handle invalid data (unseen labels or NULL 
values) in features and
+#'  label column of string type.
+#'  Supported options: "skip" (filter out rows with 
invalid data),
+#' "error" (throw an error), "keep" 
(put invalid data in
+#' a special additional bucket, at 
index numLabels). Default
+#' is "error".
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.fmClassifier} returns a fitted Factorization Machines 
Classification Model.
+#' @rdname spark.fmClassifier
+#' @aliases spark.fmClassifier,SparkDataFrame,formula-method
+#' @name spark.fmClassifier
+#' @seealso \link{read.ml}
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_binary_classification_data.txt", source = 
"libsvm")
+#'
+#' # fit Factorization Machines Classification Model
+#' model <- spark.fmClassifier(
+#'df, label ~ features,
+#'regParam = 0.01, maxIter = 10, fitLinear = TRUE
+#'  )
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.fmClassifier since 3.0.0
 
 Review comment:
   3.1.0?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

maropu commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677656
 
 
   ok to test


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880095
 
 

 ##
 File path: examples/src/main/r/ml/fmClassifier.R
 ##
 @@ -0,0 +1,38 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-fmclasfier-example")
+
+# $example on:classification$
+# Load training data
+df <- read.df("data/mllib/sample_libsvm_data.txt", source = "libsvm")
+training <- df
+test <- df
+
+# Fit a FM classification model
+model <- spark.fmClassifier(df, label ~ features)
+
+# Prediction
+predictions <- predict(model, test)
+# $example off:classification$
 
 Review comment:
   add ```sparkR.session.stop()```?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27570: [SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27570: 
[SPARK-30820][SPARKR][ML] Add FMClassifier to SparkR
URL: https://github.com/apache/spark/pull/27570#discussion_r379880497
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/r/FMClassifierWrapper.scala
 ##
 @@ -0,0 +1,176 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.classification.{FMClassificationModel, FMClassifier}
+import org.apache.spark.ml.feature.{IndexToString, RFormula}
+import org.apache.spark.ml.r.RWrapperUtils._
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class FMClassifierWrapper private (
+val pipeline: PipelineModel,
+val features: Array[String],
+val labels: Array[String]) extends MLWritable {
+  import FMClassifierWrapper._
+
+  private val fmClassificationModel: FMClassificationModel =
+pipeline.stages(1).asInstanceOf[FMClassificationModel]
+
+  lazy val rFeatures: Array[String] = if 
(fmClassificationModel.getFitIntercept) {
+Array("(Intercept)") ++ features
+  } else {
+features
+  }
+
+  lazy val rCoefficients: Array[Double] = if 
(fmClassificationModel.getFitIntercept) {
+Array(fmClassificationModel.intercept) ++ 
fmClassificationModel.linear.toArray
+  } else {
+fmClassificationModel.linear.toArray
+  }
+
+  lazy val rFactors = fmClassificationModel.factors.toArray
+
+  lazy val numClasses: Int = fmClassificationModel.numClasses
+
+  lazy val numFeatures: Int = fmClassificationModel.numFeatures
+
+  lazy val factorSize: Int = fmClassificationModel.getFactorSize
+
+  def transform(dataset: Dataset[_]): DataFrame = {
+pipeline.transform(dataset)
+  .drop(PREDICTED_LABEL_INDEX_COL)
+  .drop(fmClassificationModel.getFeaturesCol)
+  .drop(fmClassificationModel.getLabelCol)
+  }
+
+  override def write: MLWriter = new 
FMClassifierWrapper.FMClassifierWrapperWriter(this)
+}
+
+private[r] object FMClassifierWrapper
+  extends MLReadable[FMClassifierWrapper] {
+
+  val PREDICTED_LABEL_INDEX_COL = "pred_label_idx"
+  val PREDICTED_LABEL_COL = "prediction"
+
+  def fit(  // scalastyle:ignore
+  data: DataFrame,
+  formula: String,
+  factorSize: Int,
+  fitLinear: Boolean,
+  regParam: Double,
+  miniBatchFraction: Double,
+  initStd: Double,
+  maxIter: Int,
+  stepSize: Double,
+  tol: Double,
+  solver: String,
+  seed: String,
+  thresholds: Array[Double],
+  handleInvalid: String): FMClassifierWrapper = {
+
+val rFormula = new RFormula()
+  .setFormula(formula)
+  .setForceIndexLabel(true)
+  .setHandleInvalid(handleInvalid)
+checkDataColumns(rFormula, data)
+val rFormulaModel = rFormula.fit(data)
+
+val fitIntercept = rFormula.hasIntercept
+
+// get labels and feature names from output schema
+val (features, labels) = getFeaturesAndLabels(rFormulaModel, data)
+
+// assemble and fit the pipeline
+val fmc = new FMClassifier()
+  .setFactorSize(factorSize)
+  .setFitLinear(fitLinear)
+  .setRegParam(regParam)
+  .setMiniBatchFraction(miniBatchFraction)
+  .setInitStd(initStd)
+  .setMaxIter(maxIter)
+  .setTol(tol)
+  .setSolver(solver)
+  .setFitIntercept(fitIntercept)
+  .setFeaturesCol(rFormula.getFeaturesCol)
+  .setLabelCol(rFormula.getLabelCol)
+  .setPredictionCol(PREDICTED_LABEL_INDEX_COL)
+
+if (seed != null) {
 
 Review comment:
   ```if (seed != null && seed.length > 0)```?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.o

[GitHub] [spark] SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677822
 
 
   **[Test build #118494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)**
 for PR 27594 at commit 
[`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => 
setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677912
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => 
setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677913
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23250/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo 
setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677912
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo 
setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677913
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/23250/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on a change in pull request #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments

2020-02-15 Thread GitBox

maropu commented on a change in pull request #27495: [SPARK-28880][SQL] Support 
ANSI nested bracketed comments
URL: https://github.com/apache/spark/pull/27495#discussion_r379882153
 
 

 ##
 File path: sql/core/src/test/resources/sql-tests/inputs/postgreSQL/comments.sql
 ##
 @@ -47,4 +45,5 @@ Now just one deep...
 */
 'deeply nested example' AS sixth;
 --QUERY-DELIMITER-END
-/* and this is the end of the file */
+-- [SPARK-30824] Support submit sql content only contains comments.
 
 Review comment:
   Is this an ANSI-related issue? 
https://issues.apache.org/jira/browse/SPARK-30824


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] maropu commented on issue #27495: [SPARK-28880][SQL] Support ANSI nested bracketed comments

2020-02-15 Thread GitBox

maropu commented on issue #27495: [SPARK-28880][SQL] Support ANSI nested 
bracketed comments
URL: https://github.com/apache/spark/pull/27495#issuecomment-586678147
 
 
   Looks fine now except for one comment.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liangz1 commented on a change in pull request #27565: [WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset

2020-02-15 Thread GitBox

liangz1 commented on a change in pull request #27565: 
[WIP][SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods 
in Dataset
URL: https://github.com/apache/spark/pull/27565#discussion_r379882384
 
 

 ##
 File path: python/pyspark/sql/dataframe.py
 ##
 @@ -2153,6 +2153,59 @@ def transform(self, func):
   "should have been DataFrame." % 
type(result)
 return result
 
+@since(3.1)
+def sameSemantics(self, other):
+"""
+Returns `True` when the logical query plans inside both 
:class:`DataFrame`\\s are equal and
+therefore return same results.
+
+.. note:: The equality comparison here is simplified by tolerating the 
cosmetic differences
+such as attribute names.
+
+.. note::This API can compare both :class:`DataFrame`\\s very fast but 
can still return
+`False` on the :class:`DataFrame` that return the same results, 
for instance, from
+different plans. Such false negative semantic can be useful when 
caching as an example.
+
+>>> df1 = spark.range(100)
+>>> df2 = spark.range(100)
+>>> df3 = spark.range(100)
+>>> df4 = spark.range(100)
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df2.withColumn("col1", df2.id * 2))
+True
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df3.withColumn("col1", df3.id + 2))
+False
+>>> df1.withColumn("col1", df1.id * 
2).sameSemantics(df4.withColumn("col0", df4.id * 2))
+True
+"""
+if not isinstance(other, DataFrame):
+raise ValueError("other parameter should be of DataFrame; however, 
got %s"
+ % type(other))
+return self._jdf.sameSemantics(other._jdf)
+
+@since(3.1)
+def semanticHash(self):
+"""
+Returns a hash code of the logical query plan against this 
:class:`DataFrame`.
+
+.. note:: Unlike the standard hash code, the hash is calculated 
against the query plan
+simplified by tolerating the cosmetic differences such as 
attribute names.
+
+>>> df1 = spark.range(100)
+>>> df2 = spark.range(100)
+>>> df3 = spark.range(100)
+>>> df4 = spark.range(100)
+>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+df2.withColumn("col1", df2.id * 2).semanticHash()
+True
+>>> df1.withColumn("col1", df1.id * 2).semanticHash() == \
+df3.withColumn("col1", df3.id + 2).semanticHash()
+False
 
 Review comment:
   Same behavior for dataframe from `spark.read.load()`
   ```
   >>> df4=spark.read.load(csv_file_path, format="csv", inferSchema="true", 
header="true")
   >>> df4.schema
   
StructType(List(StructField(bool_col,BooleanType,true),StructField(float_col,DoubleType,true),StructField(double_col,DoubleType,true),StructField(int_col,IntegerType,true),StructField(long_col,IntegerType,true)))
   >>> df4.withColumn("col1", df4.int_col *2).semanticHash()
   -1746346451
   >>> df4.withColumn("col1", df4.int_col +2).semanticHash()
   -1746346451


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

SparkQA commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586678746
 
 
   **[Test build #118494 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)**
 for PR 27594 at commit 
[`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150).
* This patch passes all tests.
* This patch merges cleanly.
* This patch adds no public classes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo 
setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586678766
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] SparkQA removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

SparkQA removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => 
setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586677822
 
 
   **[Test build #118494 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/118494/testReport)**
 for PR 27594 at commit 
[`0bb6301`](https://github.com/apache/spark/commit/0bb630176af979773779ff2516a8f82d37970150).


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => 
setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586678768
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118494/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins removed a comment on issue #27594: [GRAPHX] [MINOR] Fix typo 
setRest => setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586678768
 
 
   Test PASSed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/118494/
   Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => setDest

2020-02-15 Thread GitBox

AmplabJenkins commented on issue #27594: [GRAPHX] [MINOR] Fix typo setRest => 
setDest
URL: https://github.com/apache/spark/pull/27594#issuecomment-586678766
 
 
   Merged build finished. Test PASSed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882027
 
 

 ##
 File path: R/pkg/R/mllib_regression.R
 ##
 @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = 
"AFTSurvivalRegressionModel", path = "c
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
+#' Factorization Machines Regression Model Model
 
 Review comment:
   nit: ```Model Model```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882145
 
 

 ##
 File path: R/pkg/R/mllib_regression.R
 ##
 @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = 
"AFTSurvivalRegressionModel", path = "c
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
+#' Factorization Machines Regression Model Model
+#'
+#' \code{spark.fmRegressor} fits a factorization regression model against a 
SparkDataFrame.
+#' Users can call \code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load 
fitted models.
 
 Review comment:
   I guess also mention ```summary``` here?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882502
 
 

 ##
 File path: R/pkg/tests/fulltests/test_mllib_regression.R
 ##
 @@ -551,4 +551,33 @@ test_that("spark.survreg", {
   }
 })
 
+
+test_that("spark.fmRegressor", {
+  df <- suppressWarnings(createDataFrame(iris))
+
+  model <- spark.fmRegressor(
+df,  Sepal_Width ~ .,
+regParam = 0.01, maxIter = 10, fitLinear = TRUE
+  )
+
+  prediction1 <- predict(model, df)
+  expect_is(prediction1, "SparkDataFrame")
+
+  # Test model save/load
+  if (windows_with_hadoop()) {
+modelPath <- tempfile(pattern = "spark-fmregressor", fileext = ".tmp")
+write.ml(model, modelPath)
+model2 <- read.ml(modelPath)
+
+expect_is(model2, "FMRegressionModel")
+
+prediction2 <- predict(model2, df)
+expect_equal(
+  collect(prediction1),
+  collect(prediction2)
+)
+  }
+})
+
+
 
 Review comment:
   nit: delete extra line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379881943
 
 

 ##
 File path: R/pkg/R/mllib_regression.R
 ##
 @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = 
"AFTSurvivalRegressionModel", path = "c
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
 
 Review comment:
   nit: delete extra line?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882615
 
 

 ##
 File path: examples/src/main/r/ml/fmRegressor.R
 ##
 @@ -0,0 +1,40 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
 
 Review comment:
   change this ```decisionTree.R```?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882600
 
 

 ##
 File path: examples/src/main/r/ml/fmRegressor.R
 ##
 @@ -0,0 +1,40 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-fmRegressor-example")
+
+# $example on
+# Load training data
+df <- read.df("data/mllib/sample_linear_regression_data.txt", source = 
"libsvm")
+training_test <- randomSplit(df, c(0.7, 0.3))
+training <- training_test[[1]]
+test <- training_test[[2]]
+
+
 
 Review comment:
   nit: delete extra line


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379883014
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/r/FMRegressorWrapper.scala
 ##
 @@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.attribute.AttributeGroup
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.ml.r.RWrapperUtils._
+import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class FMRegressorWrapper private (
+val pipeline: PipelineModel,
+val features: Array[String]) extends MLWritable {
+  import FMRegressorWrapper._
+
+  private val fmRegressionModel: FMRegressionModel =
+pipeline.stages(1).asInstanceOf[FMRegressionModel]
+
+  lazy val rFeatures: Array[String] = if (fmRegressionModel.getFitIntercept) {
+Array("(Intercept)") ++ features
+  } else {
+features
+  }
+
+  lazy val rCoefficients: Array[Double] = if 
(fmRegressionModel.getFitIntercept) {
+Array(fmRegressionModel.intercept) ++ fmRegressionModel.linear.toArray
+  } else {
+fmRegressionModel.linear.toArray
+  }
+
+  lazy val rFactors = fmRegressionModel.factors.toArray
+
+  lazy val numFeatures: Int = fmRegressionModel.numFeatures
+
+  lazy val factorSize: Int = fmRegressionModel.getFactorSize
+
+  def transform(dataset: Dataset[_]): DataFrame = {
+pipeline.transform(dataset)
+  .drop(fmRegressionModel.getFeaturesCol)
+  }
+
+  override def write: MLWriter = new 
FMRegressorWrapper.FMRegressorWrapperWriter(this)
+}
+
+private[r] object FMRegressorWrapper
+  extends MLReadable[FMRegressorWrapper] {
+
+  def fit(  // scalastyle:ignore
+  data: DataFrame,
+  formula: String,
+  factorSize: Int,
+  fitLinear: Boolean,
+  regParam: Double,
+  miniBatchFraction: Double,
+  initStd: Double,
+  maxIter: Int,
+  stepSize: Double,
+  tol: Double,
+  solver: String,
+  seed: String,
+  stringIndexerOrderType: String): FMRegressorWrapper = {
+
+val rFormula = new RFormula()
+  .setFormula(formula)
+  .setStringIndexerOrderType(stringIndexerOrderType)
+checkDataColumns(rFormula, data)
+val rFormulaModel = rFormula.fit(data)
+
+val fitIntercept = rFormula.hasIntercept
+
+// get feature names from output schema
+val schema = rFormulaModel.transform(data).schema
+val featureAttrs = 
AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol))
+  .attributes.get
+val features = featureAttrs.map(_.name.get)
+
+// assemble and fit the pipeline
+val fmr = new FMRegressor()
+  .setFactorSize(factorSize)
+  .setFitLinear(fitLinear)
+  .setRegParam(regParam)
+  .setMiniBatchFraction(miniBatchFraction)
+  .setInitStd(initStd)
+  .setMaxIter(maxIter)
+  .setTol(tol)
+  .setSolver(solver)
+  .setFitIntercept(fitIntercept)
+  .setFeaturesCol(rFormula.getFeaturesCol)
+
+if (seed != null) {
 
 Review comment:
   also check ```seed.length > 0```?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882860
 
 

 ##
 File path: examples/src/main/r/ml/fmRegressor.R
 ##
 @@ -0,0 +1,40 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/decisionTree.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-fmRegressor-example")
+
+# $example on
+# Load training data
+df <- read.df("data/mllib/sample_linear_regression_data.txt", source = 
"libsvm")
+training_test <- randomSplit(df, c(0.7, 0.3))
+training <- training_test[[1]]
+test <- training_test[[2]]
+
+
+# Fit a FM regression model
+model <- spark.fmRegressor(training, label ~ features)
+
+# Prediction
+predictions <- predict(model, test)
 
 Review comment:
   same as the classifier example, I guess add ```summary(model)```, 
```head(predictions)``` and also add ```sparkR.session.stop()``` in the end?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882920
 
 

 ##
 File path: mllib/src/main/scala/org/apache/spark/ml/r/FMRegressorWrapper.scala
 ##
 @@ -0,0 +1,157 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.r
+
+import org.apache.hadoop.fs.Path
+import org.json4s._
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.ml.{Pipeline, PipelineModel}
+import org.apache.spark.ml.attribute.AttributeGroup
+import org.apache.spark.ml.feature.RFormula
+import org.apache.spark.ml.r.RWrapperUtils._
+import org.apache.spark.ml.regression.{FMRegressionModel, FMRegressor}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset}
+
+private[r] class FMRegressorWrapper private (
+val pipeline: PipelineModel,
+val features: Array[String]) extends MLWritable {
+  import FMRegressorWrapper._
+
+  private val fmRegressionModel: FMRegressionModel =
+pipeline.stages(1).asInstanceOf[FMRegressionModel]
+
+  lazy val rFeatures: Array[String] = if (fmRegressionModel.getFitIntercept) {
+Array("(Intercept)") ++ features
+  } else {
+features
+  }
+
+  lazy val rCoefficients: Array[Double] = if 
(fmRegressionModel.getFitIntercept) {
+Array(fmRegressionModel.intercept) ++ fmRegressionModel.linear.toArray
+  } else {
+fmRegressionModel.linear.toArray
+  }
+
+  lazy val rFactors = fmRegressionModel.factors.toArray
+
+  lazy val numFeatures: Int = fmRegressionModel.numFeatures
+
+  lazy val factorSize: Int = fmRegressionModel.getFactorSize
+
+  def transform(dataset: Dataset[_]): DataFrame = {
+pipeline.transform(dataset)
+  .drop(fmRegressionModel.getFeaturesCol)
+  }
+
+  override def write: MLWriter = new 
FMRegressorWrapper.FMRegressorWrapperWriter(this)
+}
+
+private[r] object FMRegressorWrapper
+  extends MLReadable[FMRegressorWrapper] {
+
+  def fit(  // scalastyle:ignore
+  data: DataFrame,
+  formula: String,
+  factorSize: Int,
+  fitLinear: Boolean,
+  regParam: Double,
+  miniBatchFraction: Double,
+  initStd: Double,
+  maxIter: Int,
+  stepSize: Double,
+  tol: Double,
+  solver: String,
+  seed: String,
+  stringIndexerOrderType: String): FMRegressorWrapper = {
+
+val rFormula = new RFormula()
+  .setFormula(formula)
+  .setStringIndexerOrderType(stringIndexerOrderType)
+checkDataColumns(rFormula, data)
+val rFormulaModel = rFormula.fit(data)
+
+val fitIntercept = rFormula.hasIntercept
+
+// get feature names from output schema
+val schema = rFormulaModel.transform(data).schema
+val featureAttrs = 
AttributeGroup.fromStructField(schema(rFormulaModel.getFeaturesCol))
+  .attributes.get
+val features = featureAttrs.map(_.name.get)
+
+// assemble and fit the pipeline
+val fmr = new FMRegressor()
+  .setFactorSize(factorSize)
+  .setFitLinear(fitLinear)
+  .setRegParam(regParam)
+  .setMiniBatchFraction(miniBatchFraction)
+  .setInitStd(initStd)
+  .setMaxIter(maxIter)
+  .setTol(tol)
+  .setSolver(solver)
+  .setFitIntercept(fitIntercept)
+  .setFeaturesCol(rFormula.getFeaturesCol)
 
 Review comment:
   add ```setStepSize```?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882476
 
 

 ##
 File path: R/pkg/tests/fulltests/test_mllib_regression.R
 ##
 @@ -551,4 +551,33 @@ test_that("spark.survreg", {
   }
 })
 
+
+test_that("spark.fmRegressor", {
+  df <- suppressWarnings(createDataFrame(iris))
+
+  model <- spark.fmRegressor(
+df,  Sepal_Width ~ .,
+regParam = 0.01, maxIter = 10, fitLinear = TRUE
+  )
+
+  prediction1 <- predict(model, df)
+  expect_is(prediction1, "SparkDataFrame")
 
 Review comment:
   I guess we may want to check the predict result too?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882343
 
 

 ##
 File path: R/pkg/R/mllib_regression.R
 ##
 @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = 
"AFTSurvivalRegressionModel", path = "c
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
+#' Factorization Machines Regression Model Model
+#'
+#' \code{spark.fmRegressor} fits a factorization regression model against a 
SparkDataFrame.
+#' Users can call \code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load 
fitted models.
+#'
+#' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
+#' @param factorSize dimensionality of the factors.
+#' @param fitLinear whether to fit linear term.  # TODO Can we express this 
with formula?
+#' @param regParam the regularization parameter.
+#' @param miniBatchFraction the mini-batch fraction parameter.
+#' @param initStd the standard deviation of initial coefficients.
+#' @param maxIter maximum iteration number.
+#' @param stepSize stepSize parameter.
+#' @param tol convergence tolerance of iterations.
+#' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "adamW".
+#' @param seed seed parameter for weights initialization.
+#' @param stringIndexerOrderType how to order categories of a string feature 
column. This is used to
+#'   decide the base level of a string feature as 
the last category
+#'   after ordering is dropped when encoding 
strings. Supported options
+#'   are "frequencyDesc", "frequencyAsc", 
"alphabetDesc", and
+#'   "alphabetAsc". The default value is 
"frequencyDesc". When the
+#'   ordering is set to "alphabetDesc", this drops 
the same category
+#'   as R when encoding strings.
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.fmRegressor} returns a fitted Factorization Machines 
Regression Model.
+#'
+#' @rdname spark.fmRegressor
+#' @aliases spark.fmRegressor,SparkDataFrame,formula-method
+#' @name spark.fmRegressor
+#' @seealso \link{read.ml}
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_linear_regression_data.txt", source = 
"libsvm")
+#'
+#' # fit Factorization Machines Regression Model
+#' model <- spark.fmRegressor(
+#'df, label ~ features,
+#'regParam = 0.01, maxIter = 10, fitLinear = TRUE
+#'  )
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.fmRegressor since 3.1.0
+setMethod("spark.fmRegressor", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 
0.0,
+   miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, 
stepSize=1.0,
+   tol = 1e-6, solver = c("adamW", "gd"), seed = NULL,
+   stringIndexerOrderType = c("frequencyDesc", "frequencyAsc",
+  "alphabetDesc", "alphabetAsc")) {
+
+formula <- paste(deparse(formula), collapse = "")
+
+if (!is.null(seed)) {
+  seed <- as.character(as.integer(seed))
+}
+
+solver <- match.arg(solver)
+stringIndexerOrderType <- match.arg(stringIndexerOrderType)
+
+jobj <- callJStatic("org.apache.spark.ml.r.FMRegressorWrapper",
+"fit",
+data@sdf,
+formula,
+as.integer(factorSize),
+as.logical(fitLinear),
+as.numeric(regParam),
+as.numeric(miniBatchFraction),
+as.numeric(initStd),
+as.integer(maxIter),
+as.numeric(stepSize),
+as.numeric(tol),
+solver,
+seed,
+stringIndexerOrderType)
+new("FMRegressionModel", jobj = jobj)
+  })
+
+
 
 Review comment:
   nit: delete extra line?

[GitHub] [spark] huaxingao commented on a change in pull request #27571: [SPARK-30819][SPARKR][ML] Add FMRegressor wrapper to SparkR

2020-02-15 Thread GitBox

huaxingao commented on a change in pull request #27571: 
[SPARK-30819][SPARKR][ML]  Add FMRegressor wrapper to SparkR
URL: https://github.com/apache/spark/pull/27571#discussion_r379882358
 
 

 ##
 File path: R/pkg/R/mllib_regression.R
 ##
 @@ -540,3 +546,150 @@ setMethod("write.ml", signature(object = 
"AFTSurvivalRegressionModel", path = "c
   function(object, path, overwrite = FALSE) {
 write_internal(object, path, overwrite)
   })
+
+
+#' Factorization Machines Regression Model Model
+#'
+#' \code{spark.fmRegressor} fits a factorization regression model against a 
SparkDataFrame.
+#' Users can call \code{predict} to make
+#' predictions on new data, and \code{write.ml}/\code{read.ml} to save/load 
fitted models.
+#'
+#' @param data a \code{SparkDataFrame} of observations and labels for model 
fitting.
+#' @param formula a symbolic description of the model to be fitted. Currently 
only a few formula
+#'operators are supported, including '~', '.', ':', '+', and 
'-'.
+#' @param factorSize dimensionality of the factors.
+#' @param fitLinear whether to fit linear term.  # TODO Can we express this 
with formula?
+#' @param regParam the regularization parameter.
+#' @param miniBatchFraction the mini-batch fraction parameter.
+#' @param initStd the standard deviation of initial coefficients.
+#' @param maxIter maximum iteration number.
+#' @param stepSize stepSize parameter.
+#' @param tol convergence tolerance of iterations.
+#' @param solver solver parameter, supported options: "gd" (minibatch gradient 
descent) or "adamW".
+#' @param seed seed parameter for weights initialization.
+#' @param stringIndexerOrderType how to order categories of a string feature 
column. This is used to
+#'   decide the base level of a string feature as 
the last category
+#'   after ordering is dropped when encoding 
strings. Supported options
+#'   are "frequencyDesc", "frequencyAsc", 
"alphabetDesc", and
+#'   "alphabetAsc". The default value is 
"frequencyDesc". When the
+#'   ordering is set to "alphabetDesc", this drops 
the same category
+#'   as R when encoding strings.
+#' @param ... additional arguments passed to the method.
+#' @return \code{spark.fmRegressor} returns a fitted Factorization Machines 
Regression Model.
+#'
+#' @rdname spark.fmRegressor
+#' @aliases spark.fmRegressor,SparkDataFrame,formula-method
+#' @name spark.fmRegressor
+#' @seealso \link{read.ml}
+#' @examples
+#' \dontrun{
+#' df <- read.df("data/mllib/sample_linear_regression_data.txt", source = 
"libsvm")
+#'
+#' # fit Factorization Machines Regression Model
+#' model <- spark.fmRegressor(
+#'df, label ~ features,
+#'regParam = 0.01, maxIter = 10, fitLinear = TRUE
+#'  )
+#'
+#' # get the summary of the model
+#' summary(model)
+#'
+#' # make predictions
+#' predictions <- predict(model, df)
+#'
+#' # save and load the model
+#' path <- "path/to/model"
+#' write.ml(model, path)
+#' savedModel <- read.ml(path)
+#' summary(savedModel)
+#' }
+#' @note spark.fmRegressor since 3.1.0
+setMethod("spark.fmRegressor", signature(data = "SparkDataFrame", formula = 
"formula"),
+  function(data, formula, factorSize = 8, fitLinear = TRUE, regParam = 
0.0,
+   miniBatchFraction = 1.0, initStd = 0.01, maxIter = 100, 
stepSize=1.0,
+   tol = 1e-6, solver = c("adamW", "gd"), seed = NULL,
+   stringIndexerOrderType = c("frequencyDesc", "frequencyAsc",
+  "alphabetDesc", "alphabetAsc")) {
+
+formula <- paste(deparse(formula), collapse = "")
+
+if (!is.null(seed)) {
+  seed <- as.character(as.integer(seed))
+}
+
+solver <- match.arg(solver)
+stringIndexerOrderType <- match.arg(stringIndexerOrderType)
+
+jobj <- callJStatic("org.apache.spark.ml.r.FMRegressorWrapper",
+"fit",
+data@sdf,
+formula,
+as.integer(factorSize),
+as.logical(fitLinear),
+as.numeric(regParam),
+as.numeric(miniBatchFraction),
+as.numeric(initStd),
+as.integer(maxIter),
+as.numeric(stepSize),
+as.numeric(tol),
+solver,
+seed,
+stringIndexerOrderType)
+new("FMRegressionModel", jobj = jobj)
+  })
+
+
+#  Returns the summary of a FM Regression model produced by 
\code{spark.fmRegressor}
+
+#' @param obj

< 1 2 3 4

301 - 376 of 376 matches

Mail list logo