[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983888#comment-15983888
 ] 

Hyukjin Kwon commented on SPARK-20456:
--

I simply left the comment above as the current status does not sound matching 
with the description and the title. Let's fix the title and the description 
here.

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Michael Patterson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15983590#comment-15983590
 ] 

Michael Patterson commented on SPARK-20456:
---

I saw that there are short docstrings for the aggregate functions, but I think 
it can be unclear for people new to Spark, or relational algebra. For example, 
some of my coworkers didn't know you could do, for example, 
`df.agg(mean(col))`, without doing a `groupby`. There are also no links to 
`groupby` in any of the aggregate functions. I also didn't know about 
`collect_set` for a long time. I think adding examples would help with 
visibility and understanding.

The same things applies to `lit`. It took me a while to learn what it did.

For the datetime stuff, for example this line has a column named 'd': 
https://github.com/map222/spark/blob/master/python/pyspark/sql/functions.py#L926

I think it would be more informative to name it 'date' or 'time'.

Do these sound reasonable?

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>Priority: Minor
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20456) Document major aggregation functions for pyspark

2017-04-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15982507#comment-15982507
 ] 

Hyukjin Kwon commented on SPARK-20456:
--


> Document `sql.functions.py`:
1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
`collect_set`, `collect_list`, `stddev`, `variance`)

I think we have documentations for ...

min - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.min
max - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.max
mean - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.mean
count - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.count
collect_set - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_set
collect_list - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.collect_list
stddev - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.stddev
variance - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.variance

in 
https://github.com/apache/spark/blob/3fbf0a5f9297f438bc92db11f106d4a0ae568613/python/pyspark/sql/functions.py

> 2. Rename columns in datetime examples.

Could you give some pointers?

> 5. Document `lit`

lit - 
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.lit

It seems documented.

> Document major aggregation functions for pyspark
> 
>
> Key: SPARK-20456
> URL: https://issues.apache.org/jira/browse/SPARK-20456
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 2.1.0
>Reporter: Michael Patterson
>
> Document `sql.functions.py`:
> 1. Document the common aggregate functions (`min`, `max`, `mean`, `count`, 
> `collect_set`, `collect_list`, `stddev`, `variance`)
> 2. Rename columns in datetime examples.
> 3. Add examples for `unix_timestamp` and `from_unixtime`
> 4. Add note to all trigonometry functions that units are radians.
> 5. Document `lit`



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org