[jira] [Commented] (SPARK-25770) support SparkDataFrame pretty print

2020-07-29 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167324#comment-17167324
 ] 

S Daniel Zafar commented on SPARK-25770:


[~adrian555], what would your preferred print look like?

> support SparkDataFrame pretty print
> ---
>
> Key: SPARK-25770
> URL: https://issues.apache.org/jira/browse/SPARK-25770
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.3.0, 2.3.1, 2.3.2
>Reporter: Weiqiang Zhuang
>Priority: Minor
>
> This is for continuous discussion with a requirement added in 
> [https://github.com/apache/spark/pull/22455#discussion_r223197863.]
>  
> Summary:
> SparkDataFrame is a S4 object, `show()` is the default method to display the 
> data frame to screen output. Currently the output is simply in string format 
> returned by `showString()` call which pre-formats the data frame and displays 
> as a table. This lacks the flexibility to re-format the output with a more 
> user friendly and pretty fashion, as has been seen in 1) S3 object's 
> `print()` method allows to specify arguments like `quote` etc to control the 
> output; 2) external tools such as `Jupyter` R notebook implement their own 
> customized way of display.
>  
> This Jira aims to explore a feasible solution to improve the screen output 
> experience by both supporting a pretty print from with the SparkR package and 
> also offering a common hook for external tools to customize the display 
> function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30255) Support explain mode in SparkR df.explain

2020-07-29 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17167316#comment-17167316
 ] 

S Daniel Zafar commented on SPARK-30255:


Hello- I would like to knock this one out as a first issue. Seems pretty 
straight forward. I'm planning to copy the PySpark API directly, such that both 
`extended` and `mode` are arguments, but if `extended` comes in as an object of 
class character then it treats it as mode. Does that sound like a good plan? I 
have written it up but need a little guidance on how folks typically build 
Spark for local testing. 

If it's okay for me to work on this please assign this task to me.

> Support explain mode in SparkR df.explain
> -
>
> Key: SPARK-30255
> URL: https://issues.apache.org/jira/browse/SPARK-30255
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This pr intends to support explain modes implemented in SPARK-30200(#26829) 
> for SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-30817) SparkR ML algorithms parity

2020-07-28 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166571#comment-17166571
 ] 

S Daniel Zafar edited comment on SPARK-30817 at 7/28/20, 5:19 PM:
--

I would like to work on this issue, is that all right [~hyukjin.kwon]? It would 
be my first.


was (Author: dan_z):
I would like to address this issue, is that all right [~hyukjin.kwon]?

> SparkR ML algorithms parity 
> 
>
> Key: SPARK-30817
> URL: https://issues.apache.org/jira/browse/SPARK-30817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0 the following algorithms are missing form SparkR
> * {{LinearRegression}} 
> * {{FMRegressor}} (Added to ML in 3.0)
> * {{FMClassifier}} (Added to ML in 3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30817) SparkR ML algorithms parity

2020-07-28 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166571#comment-17166571
 ] 

S Daniel Zafar commented on SPARK-30817:


I would like to address this issue, is that all right [~hyukjin.kwon]?

> SparkR ML algorithms parity 
> 
>
> Key: SPARK-30817
> URL: https://issues.apache.org/jira/browse/SPARK-30817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, SparkR
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>
> As of 3.0 the following algorithms are missing form SparkR
> * {{LinearRegression}} 
> * {{FMRegressor}} (Added to ML in 3.0)
> * {{FMClassifier}} (Added to ML in 3.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12172) Consider removing SparkR internal RDD APIs

2020-07-28 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-12172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166569#comment-17166569
 ] 

S Daniel Zafar commented on SPARK-12172:


My opinion is that it makes sense to keep these methods, since they exist in 
PySpark. Removing basic things like `map` seems counterintuitive. The PR is 
closed, should we close this as well?

> Consider removing SparkR internal RDD APIs
> --
>
> Key: SPARK-12172
> URL: https://issues.apache.org/jira/browse/SPARK-12172
> Project: Spark
>  Issue Type: Task
>  Components: SparkR
>Reporter: Felix Cheung
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20684) expose createOrReplaceGlobalTempView/createGlobalTempView and dropGlobalTempView in SparkR

2020-07-28 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166559#comment-17166559
 ] 

S Daniel Zafar commented on SPARK-20684:


The PR ([https://github.com/apache/spark/pull/17941]) was closed. I think we 
can close this.

> expose createOrReplaceGlobalTempView/createGlobalTempView and 
> dropGlobalTempView in SparkR
> --
>
> Key: SPARK-20684
> URL: https://issues.apache.org/jira/browse/SPARK-20684
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>Priority: Major
>
> This is a useful API that is not exposed in SparkR. It will help with moving 
> data between languages on a single single Spark application.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-31137) Opportunity to simplify execution plan when passing empty dataframes to subtract()

2020-03-13 Thread S Daniel Zafar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

S Daniel Zafar updated SPARK-31137:
---
Comment: was deleted

(was: Moving this to Databricks internal board.)

> Opportunity to simplify execution plan when passing empty dataframes to 
> subtract()
> --
>
> Key: SPARK-31137
> URL: https://issues.apache.org/jira/browse/SPARK-31137
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: S Daniel Zafar
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Execution plans are similar when passing an empty versus non-empty DataFrame 
> to pyspark's subtract call.
> {code:java}
> df.subtract(regDf){code}
> yields the same physical plan as:
> {code:java}
> df.subtract(emptyDf){code}
>  Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both 
> DataFrames, this can yield some significant performance speed-ups because if 
> the incoming DF is empty no processing should happen.
>  
> Should be a quick fix for a seasoned commiter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-31137) Opportunity to simplify execution plan when passing empty dataframes to subtract()

2020-03-13 Thread S Daniel Zafar (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31137?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

S Daniel Zafar resolved SPARK-31137.

Resolution: Won't Do

moving to Databricks internal board.

> Opportunity to simplify execution plan when passing empty dataframes to 
> subtract()
> --
>
> Key: SPARK-31137
> URL: https://issues.apache.org/jira/browse/SPARK-31137
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: S Daniel Zafar
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Execution plans are similar when passing an empty versus non-empty DataFrame 
> to pyspark's subtract call.
> {code:java}
> df.subtract(regDf){code}
> yields the same physical plan as:
> {code:java}
> df.subtract(emptyDf){code}
>  Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both 
> DataFrames, this can yield some significant performance speed-ups because if 
> the incoming DF is empty no processing should happen.
>  
> Should be a quick fix for a seasoned commiter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31137) Opportunity to simplify execution plan when passing empty dataframes to subtract()

2020-03-13 Thread S Daniel Zafar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058872#comment-17058872
 ] 

S Daniel Zafar commented on SPARK-31137:


Moving this to Databricks internal board.

> Opportunity to simplify execution plan when passing empty dataframes to 
> subtract()
> --
>
> Key: SPARK-31137
> URL: https://issues.apache.org/jira/browse/SPARK-31137
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 2.4.5
>Reporter: S Daniel Zafar
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Execution plans are similar when passing an empty versus non-empty DataFrame 
> to pyspark's subtract call.
> {code:java}
> df.subtract(regDf){code}
> yields the same physical plan as:
> {code:java}
> df.subtract(emptyDf){code}
>  Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both 
> DataFrames, this can yield some significant performance speed-ups because if 
> the incoming DF is empty no processing should happen.
>  
> Should be a quick fix for a seasoned commiter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31137) Opportunity to simplify execution plan when passing empty dataframes to subtract()

2020-03-12 Thread S Daniel Zafar (Jira)
S Daniel Zafar created SPARK-31137:
--

 Summary: Opportunity to simplify execution plan when passing empty 
dataframes to subtract()
 Key: SPARK-31137
 URL: https://issues.apache.org/jira/browse/SPARK-31137
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 2.4.5
Reporter: S Daniel Zafar


Execution plans are similar when passing an empty versus non-empty DataFrame to 
pyspark's subtract call.
{code:java}
df.subtract(regDf){code}
yields the same physical plan as:
{code:java}
df.subtract(emptyDf){code}
 Since the operation (EXCEPT DISTINCT in Spark SQL) requires a sort on both 
DataFrames, this can yield some significant performance speed-ups because if 
the incoming DF is empty no processing should happen.

 

Should be a quick fix for a seasoned commiter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org