[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983888#comment-15983888 ] Hyukjin Kwon commented on SPARK-20456: -- I simply left the comment above as the current status does not sound matching with the description and the title. Let's fix the title and the description here. > Document major aggregation functions for pyspark > > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983590#comment-15983590 ] Michael Patterson commented on SPARK-20456: --- I saw that there are short docstrings for the aggregate functions, but I think it can be unclear for people new to Spark, or relational algebra. For example, some of my coworkers didn't know you could do, for example, `df.agg(mean(col))`, without doing a `groupby`. There are also no links to `groupby` in any of the aggregate functions. I also didn't know about `collect_set` for a long time. I think adding examples would help with visibility and understanding. The same things applies to `lit`. It took me a while to learn what it did. For the datetime stuff, for example this line has a column named 'd': https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926 I think it would be more informative to name it 'date' or 'time'. Do these sound reasonable? > Document major aggregation functions for pyspark > > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson >Priority: Minor > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark
[ https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982507#comment-15982507 ] Hyukjin Kwon commented on SPARK-20456: -- > Document `sql.functions.py`: 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, `collect_set`, `collect_list`, `stddev`, `variance`) I think we have documentations for ... min - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.min max - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.max mean - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.mean count - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.count collect_set - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_set collect_list - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list stddev - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.stddev variance - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.variance in https://github.com/apache/spark/blob/3fbf0a5f9297f438bc92db11f106d4a0ae568613/python/pyspark/sql/functions.py > 2. Rename columns in datetime examples. Could you give some pointers? > 5. Document `lit` lit - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.lit It seems documented. > Document major aggregation functions for pyspark > > > Key: SPARK-20456 > URL: https://issues.apache.org/jira/browse/SPARK-20456 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 2.1.0 >Reporter: Michael Patterson > > Document `sql.functions.py`: > 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, > `collect_set`, `collect_list`, `stddev`, `variance`) > 2. Rename columns in datetime examples. > 3. Add examples for `unix_timestamp` and `from_unixtime` > 4. Add note to all trigonometry functions that units are radians. > 5. Document `lit` -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org