[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983590#comment-15983590 ]
Michael Patterson commented on SPARK-20456: ------------------------------------------- I saw that there are short docstrings for the aggregate functions, but I think it can be unclear for people new to Spark, or relational algebra. For example, some of my coworkers didn't know you could do, for example, `df.agg(mean(col))`, without doing a `groupby`. There are also no links to `groupby` in any of the aggregate functions. I also didn't know about `collect_set` for a long time. I think adding examples would help with visibility and understanding. The same things applies to `lit`. It took me a while to learn what it did. For the datetime stuff, for example this line has a column named 'd': https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926 I think it would be more informative to name it 'date' or 'time'. Do these sound reasonable? > Document major aggregation functions for pyspark > ------------------------------------------------ > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation > Affects Versions: 2.1.0 > Reporter: Michael Patterson > Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org