[
https://issues.apache.org/jira/browse/SPARK-17963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-17963:
---------------------------------
Description:
Currently, it seems function documentation is inconsistent and does not have
examples ({{extend}} much.
For example, some functions have a bad indentation as below:
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
Function: approx_count_distinct
Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
Usage: approx_count_distinct(expr) - Returns the estimated cardinality by
HyperLogLog++.
approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated
cardinality by HyperLogLog++
with relativeSD, the maximum estimation error allowed.
Extended Usage:
No example for approx_count_distinct.
{code}
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}
whereas some do have a pretty one
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
percentile_approx(col, percentage [, accuracy]) - Returns the approximate
percentile value of numeric
column `col` at the given percentage. The value of percentage must be
between 0.0
and 1.0. The `accuracy` parameter (default: 10000) is a positive integer
literal which
controls approximation accuracy at the cost of memory. Higher value of
`accuracy` yields
better accuracy, `1.0/accuracy` is the relative error of the
approximation.
percentile_approx(col, array(percentage1 [, percentage2]...) [,
accuracy]) - Returns the approximate
percentile array of column `col` at the given percentage array. Each
value of the
percentage array must be between 0.0 and 1.0. The `accuracy` parameter
(default: 10000) is
a positive integer literal which controls approximation accuracy at the
cost of memory.
Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the
relative error of
the approximation.
Extended Usage:
No example for percentile_approx.
{code}
Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}}
and {{_FUNC_(a, b)}} (note the indentation between arguments.
It'd be nicer if most of them have a good example with possible argument types.
Suggested format is as below for multiple line usage:
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED rand;
Function: rand
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage:
rand() - Returns a random column with i.i.d. uniformly distributed values
in [0, 1].
seed is given randomly.
rand(seed) - Returns a random column with i.i.d. uniformly distributed
values in [0, 1].
seed should be an integer/long/NULL literal.
Extended Usage:
> SELECT rand();
0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(NULL);
0.8446490682263027
{code}
For single line usage:
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
Function: date_add
Class: org.apache.spark.sql.catalyst.expressions.DateAdd
Usage: date_add(start_date, num_days) - Returns the date that is num_days after
start_date.
Extended Usage:
> SELECT date_add('2016-07-30', 1);
'2016-07-31'
{code}
was:
Currently, it seems function documentation is inconsistent and does not have
examples ({{extend}} much.
For example, some functions have a bad indentation as below:
{code}
spark-sql> DESCRIBE FUNCTION last;
Function: last
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Last
Usage: last(expr,isIgnoreNull) - Returns the last value of `child` for a group
of rows.
last(expr,isIgnoreNull=false) - Returns the last value of `child` for a
group of rows.
If isIgnoreNull is true, returns only non-null values.
{code}
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED count;
Function: count
Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
Usage: count(*) - Returns the total number of retrieved rows, including rows
containing NULL values.
count(expr) - Returns the number of rows for which the supplied expression
is non-NULL.
count(DISTINCT expr[, expr...]) - Returns the number of rows for which the
supplied expression(s) are unique and non-NULL.
Extended Usage:
No example for count.
{code}
whereas some do have a pretty one
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
Function: percentile_approx
Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
Usage:
percentile_approx(col, percentage [, accuracy]) - Returns the approximate
percentile value of numeric
column `col` at the given percentage. The value of percentage must be
between 0.0
and 1.0. The `accuracy` parameter (default: 10000) is a positive integer
literal which
controls approximation accuracy at the cost of memory. Higher value of
`accuracy` yields
better accuracy, `1.0/accuracy` is the relative error of the
approximation.
percentile_approx(col, array(percentage1 [, percentage2]...) [,
accuracy]) - Returns the approximate
percentile array of column `col` at the given percentage array. Each
value of the
percentage array must be between 0.0 and 1.0. The `accuracy` parameter
(default: 10000) is
a positive integer literal which controls approximation accuracy at the
cost of memory.
Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the
relative error of
the approximation.
Extended Usage:
No example for percentile_approx.
{code}
Also, there are several inconsistent indentation, for example, {{_FUNC_(a,b)}}
and {{_FUNC_(a, b)}} (note the indentation between arguments.
It'd be nicer if most of them have a good example with possible argument types.
Suggested format is as below for multiple line usage:
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED rand;
Function: rand
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage:
rand() - Returns a random column with i.i.d. uniformly distributed values
in [0, 1].
seed is given randomly.
rand(seed) - Returns a random column with i.i.d. uniformly distributed
values in [0, 1].
seed should be an integer/long/NULL literal.
Extended Usage:
> SELECT rand();
0.9629742951434543
> SELECT rand(0);
0.8446490682263027
> SELECT rand(NULL);
0.8446490682263027
{code}
For single line usage:
{code}
spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
Function: date_add
Class: org.apache.spark.sql.catalyst.expressions.DateAdd
Usage: date_add(start_date, num_days) - Returns the date that is num_days after
start_date.
Extended Usage:
> SELECT date_add('2016-07-30', 1);
'2016-07-31'
{code}
> Add examples (extend) in each function and improve documentation with
> arguments
> -------------------------------------------------------------------------------
>
> Key: SPARK-17963
> URL: https://issues.apache.org/jira/browse/SPARK-17963
> Project: Spark
> Issue Type: Documentation
> Components: SQL
> Reporter: Hyukjin Kwon
>
> Currently, it seems function documentation is inconsistent and does not have
> examples ({{extend}} much.
> For example, some functions have a bad indentation as below:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED approx_count_distinct;
> Function: approx_count_distinct
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
> Usage: approx_count_distinct(expr) - Returns the estimated cardinality by
> HyperLogLog++.
> approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated
> cardinality by HyperLogLog++
> with relativeSD, the maximum estimation error allowed.
> Extended Usage:
> No example for approx_count_distinct.
> {code}
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED count;
> Function: count
> Class: org.apache.spark.sql.catalyst.expressions.aggregate.Count
> Usage: count(*) - Returns the total number of retrieved rows, including rows
> containing NULL values.
> count(expr) - Returns the number of rows for which the supplied
> expression is non-NULL.
> count(DISTINCT expr[, expr...]) - Returns the number of rows for which
> the supplied expression(s) are unique and non-NULL.
> Extended Usage:
> No example for count.
> {code}
> whereas some do have a pretty one
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED percentile_approx;
> Function: percentile_approx
> Class:
> org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
> Usage:
> percentile_approx(col, percentage [, accuracy]) - Returns the
> approximate percentile value of numeric
> column `col` at the given percentage. The value of percentage must be
> between 0.0
> and 1.0. The `accuracy` parameter (default: 10000) is a positive
> integer literal which
> controls approximation accuracy at the cost of memory. Higher value of
> `accuracy` yields
> better accuracy, `1.0/accuracy` is the relative error of the
> approximation.
> percentile_approx(col, array(percentage1 [, percentage2]...) [,
> accuracy]) - Returns the approximate
> percentile array of column `col` at the given percentage array. Each
> value of the
> percentage array must be between 0.0 and 1.0. The `accuracy` parameter
> (default: 10000) is
> a positive integer literal which controls approximation accuracy at
> the cost of memory.
> Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is
> the relative error of
> the approximation.
> Extended Usage:
> No example for percentile_approx.
> {code}
> Also, there are several inconsistent indentation, for example,
> {{_FUNC_(a,b)}} and {{_FUNC_(a, b)}} (note the indentation between arguments.
> It'd be nicer if most of them have a good example with possible argument
> types.
> Suggested format is as below for multiple line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED rand;
> Function: rand
> Class: org.apache.spark.sql.catalyst.expressions.Rand
> Usage:
> rand() - Returns a random column with i.i.d. uniformly distributed
> values in [0, 1].
> seed is given randomly.
> rand(seed) - Returns a random column with i.i.d. uniformly distributed
> values in [0, 1].
> seed should be an integer/long/NULL literal.
> Extended Usage:
> > SELECT rand();
> 0.9629742951434543
> > SELECT rand(0);
> 0.8446490682263027
> > SELECT rand(NULL);
> 0.8446490682263027
> {code}
> For single line usage:
> {code}
> spark-sql> DESCRIBE FUNCTION EXTENDED date_add;
> Function: date_add
> Class: org.apache.spark.sql.catalyst.expressions.DateAdd
> Usage: date_add(start_date, num_days) - Returns the date that is num_days
> after start_date.
> Extended Usage:
> > SELECT date_add('2016-07-30', 1);
> '2016-07-31'
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]