[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

Michael Patterson (JIRA) Tue, 25 Apr 2017 13:51:29 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983590#comment-15983590
 ]


Michael Patterson commented on SPARK-20456:
-------------------------------------------

I saw that there are short docstrings for the aggregate functions, but I think 
it can be unclear for people new to Spark, or relational algebra. For example, 
some of my coworkers didn't know you could do, for example, 
`df.agg(mean(col))`, without doing a `groupby`. There are also no links to 
`groupby` in any of the aggregate functions. I also didn't know about 
`collect_set` for a long time. I think adding examples would help with 
visibility and understanding.

The same things applies to `lit`. It took me a while to learn what it did.

For the datetime stuff, for example this line has a column named 'd': 
https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926

I think it would be more informative to name it 'date' or 'time'.

Do these sound reasonable?

> Document major aggregation functions for pyspark
> ------------------------------------------------
>
>                 Key: SPARK-20456
>                 URL: https://issues.apache.org/jira/browse/SPARK-20456
>             Project: Spark
>          Issue Type: Documentation
>          Components: Documentation
>    Affects Versions: 2.1.0
>            Reporter: Michael Patterson
>            Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

Reply via email to