Github user xuanyuanking commented on a diff in the pull request: https://github.com/apache/spark/pull/22746#discussion_r226011439 --- Diff: docs/sql-reference.md --- @@ -0,0 +1,641 @@ +--- +layout: global +title: Reference +displayTitle: Reference +--- + +* Table of contents +{:toc} + +## Data Types + +Spark SQL and DataFrames support the following data types: + +* Numeric types + - `ByteType`: Represents 1-byte signed integer numbers. + The range of numbers is from `-128` to `127`. + - `ShortType`: Represents 2-byte signed integer numbers. + The range of numbers is from `-32768` to `32767`. + - `IntegerType`: Represents 4-byte signed integer numbers. + The range of numbers is from `-2147483648` to `2147483647`. + - `LongType`: Represents 8-byte signed integer numbers. + The range of numbers is from `-9223372036854775808` to `9223372036854775807`. + - `FloatType`: Represents 4-byte single-precision floating point numbers. + - `DoubleType`: Represents 8-byte double-precision floating point numbers. + - `DecimalType`: Represents arbitrary-precision signed decimal numbers. Backed internally by `java.math.BigDecimal`. A `BigDecimal` consists of an arbitrary precision integer unscaled value and a 32-bit integer scale. +* String type + - `StringType`: Represents character string values. +* Binary type + - `BinaryType`: Represents byte sequence values. +* Boolean type + - `BooleanType`: Represents boolean values. +* Datetime type + - `TimestampType`: Represents values comprising values of fields year, month, day, + hour, minute, and second. + - `DateType`: Represents values comprising values of fields year, month, day. +* Complex types + - `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of + elements with the type of `elementType`. `containsNull` is used to indicate if + elements in a `ArrayType` value can have `null` values. + - `MapType(keyType, valueType, valueContainsNull)`: + Represents values comprising a set of key-value pairs. The data type of keys are + described by `keyType` and the data type of values are described by `valueType`. + For a `MapType` value, keys are not allowed to have `null` values. `valueContainsNull` + is used to indicate if values of a `MapType` value can have `null` values. + - `StructType(fields)`: Represents values with the structure described by + a sequence of `StructField`s (`fields`). + * `StructField(name, dataType, nullable)`: Represents a field in a `StructType`. + The name of a field is indicated by `name`. The data type of a field is indicated + by `dataType`. `nullable` is used to indicate if values of this fields can have + `null` values. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +All data types of Spark SQL are located in the package `org.apache.spark.sql.types`. +You can access them by doing + +{% include_example data_types scala/org/apache/spark/examples/sql/SparkSQLExample.scala %} + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in Scala</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> Byte </td> + <td> + ByteType + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> Short </td> + <td> + ShortType + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> Int </td> + <td> + IntegerType + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> Long </td> + <td> + LongType + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> Float </td> + <td> + FloatType + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> Double </td> + <td> + DoubleType + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> java.math.BigDecimal </td> + <td> + DecimalType + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> String </td> + <td> + StringType + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> Array[Byte] </td> + <td> + BinaryType + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> Boolean </td> + <td> + BooleanType + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> java.sql.Timestamp </td> + <td> + TimestampType + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> java.sql.Date </td> + <td> + DateType + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> scala.collection.Seq </td> + <td> + ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> + <b>Note:</b> The default value of <i>containsNull</i> is <i>true</i>. + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> scala.collection.Map </td> + <td> + MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> + <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>true</i>. + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> org.apache.spark.sql.Row </td> + <td> + StructType(<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same + name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in Scala of the data type of this field + (For example, Int for a StructField with the data type IntegerType) </td> + <td> + StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br /> + <b>Note:</b> The default value of <i>nullable</i> is <i>true</i>. + </td> +</tr> +</table> + +</div> + +<div data-lang="java" markdown="1"> + +All data types of Spark SQL are located in the package of +`org.apache.spark.sql.types`. To access or create a data type, +please use factory methods provided in +`org.apache.spark.sql.types.DataTypes`. + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in Java</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> byte or Byte </td> + <td> + DataTypes.ByteType + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> short or Short </td> + <td> + DataTypes.ShortType + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> int or Integer </td> + <td> + DataTypes.IntegerType + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> long or Long </td> + <td> + DataTypes.LongType + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> float or Float </td> + <td> + DataTypes.FloatType + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> double or Double </td> + <td> + DataTypes.DoubleType + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> java.math.BigDecimal </td> + <td> + DataTypes.createDecimalType()<br /> + DataTypes.createDecimalType(<i>precision</i>, <i>scale</i>). + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> String </td> + <td> + DataTypes.StringType + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> byte[] </td> + <td> + DataTypes.BinaryType + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> boolean or Boolean </td> + <td> + DataTypes.BooleanType + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> java.sql.Timestamp </td> + <td> + DataTypes.TimestampType + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> java.sql.Date </td> + <td> + DataTypes.DateType + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> java.util.List </td> + <td> + DataTypes.createArrayType(<i>elementType</i>)<br /> + <b>Note:</b> The value of <i>containsNull</i> will be <i>true</i><br /> + DataTypes.createArrayType(<i>elementType</i>, <i>containsNull</i>). + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> java.util.Map </td> + <td> + DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>)<br /> + <b>Note:</b> The value of <i>valueContainsNull</i> will be <i>true</i>.<br /> + DataTypes.createMapType(<i>keyType</i>, <i>valueType</i>, <i>valueContainsNull</i>)<br /> + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> org.apache.spark.sql.Row </td> + <td> + DataTypes.createStructType(<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a List or an array of StructFields. + Also, two fields with the same name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in Java of the data type of this field + (For example, int for a StructField with the data type IntegerType) </td> + <td> + DataTypes.createStructField(<i>name</i>, <i>dataType</i>, <i>nullable</i>) + </td> +</tr> +</table> + +</div> + +<div data-lang="python" markdown="1"> + +All data types of Spark SQL are located in the package of `pyspark.sql.types`. +You can access them by doing +{% highlight python %} +from pyspark.sql.types import * +{% endhighlight %} + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in Python</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> + int or long <br /> + <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -128 to 127. + </td> + <td> + ByteType() + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> + int or long <br /> + <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -32768 to 32767. + </td> + <td> + ShortType() + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> int or long </td> + <td> + IntegerType() + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> + long <br /> + <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of + -9223372036854775808 to 9223372036854775807. + Otherwise, please convert data to decimal.Decimal and use DecimalType. + </td> + <td> + LongType() + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> + float <br /> + <b>Note:</b> Numbers will be converted to 4-byte single-precision floating + point numbers at runtime. + </td> + <td> + FloatType() + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> float </td> + <td> + DoubleType() + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> decimal.Decimal </td> + <td> + DecimalType() + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> string </td> + <td> + StringType() + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> bytearray </td> + <td> + BinaryType() + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> bool </td> + <td> + BooleanType() + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> datetime.datetime </td> + <td> + TimestampType() + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> datetime.date </td> + <td> + DateType() + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> list, tuple, or array </td> + <td> + ArrayType(<i>elementType</i>, [<i>containsNull</i>])<br /> + <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>. + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> dict </td> + <td> + MapType(<i>keyType</i>, <i>valueType</i>, [<i>valueContainsNull</i>])<br /> + <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>. + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> list or tuple </td> + <td> + StructType(<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same + name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in Python of the data type of this field + (For example, Int for a StructField with the data type IntegerType) </td> + <td> + StructField(<i>name</i>, <i>dataType</i>, [<i>nullable</i>])<br /> + <b>Note:</b> The default value of <i>nullable</i> is <i>True</i>. + </td> +</tr> +</table> + +</div> + +<div data-lang="r" markdown="1"> + +<table class="table"> +<tr> + <th style="width:20%">Data type</th> + <th style="width:40%">Value type in R</th> + <th>API to access or create a data type</th></tr> +<tr> + <td> <b>ByteType</b> </td> + <td> + integer <br /> + <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -128 to 127. + </td> + <td> + "byte" + </td> +</tr> +<tr> + <td> <b>ShortType</b> </td> + <td> + integer <br /> + <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of -32768 to 32767. + </td> + <td> + "short" + </td> +</tr> +<tr> + <td> <b>IntegerType</b> </td> + <td> integer </td> + <td> + "integer" + </td> +</tr> +<tr> + <td> <b>LongType</b> </td> + <td> + integer <br /> + <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime. + Please make sure that numbers are within the range of + -9223372036854775808 to 9223372036854775807. + Otherwise, please convert data to decimal.Decimal and use DecimalType. + </td> + <td> + "long" + </td> +</tr> +<tr> + <td> <b>FloatType</b> </td> + <td> + numeric <br /> + <b>Note:</b> Numbers will be converted to 4-byte single-precision floating + point numbers at runtime. + </td> + <td> + "float" + </td> +</tr> +<tr> + <td> <b>DoubleType</b> </td> + <td> numeric </td> + <td> + "double" + </td> +</tr> +<tr> + <td> <b>DecimalType</b> </td> + <td> Not supported </td> + <td> + Not supported + </td> +</tr> +<tr> + <td> <b>StringType</b> </td> + <td> character </td> + <td> + "string" + </td> +</tr> +<tr> + <td> <b>BinaryType</b> </td> + <td> raw </td> + <td> + "binary" + </td> +</tr> +<tr> + <td> <b>BooleanType</b> </td> + <td> logical </td> + <td> + "bool" + </td> +</tr> +<tr> + <td> <b>TimestampType</b> </td> + <td> POSIXct </td> + <td> + "timestamp" + </td> +</tr> +<tr> + <td> <b>DateType</b> </td> + <td> Date </td> + <td> + "date" + </td> +</tr> +<tr> + <td> <b>ArrayType</b> </td> + <td> vector or list </td> + <td> + list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br /> + <b>Note:</b> The default value of <i>containsNull</i> is <i>TRUE</i>. + </td> +</tr> +<tr> + <td> <b>MapType</b> </td> + <td> environment </td> + <td> + list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br /> + <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>TRUE</i>. + </td> +</tr> +<tr> + <td> <b>StructType</b> </td> + <td> named list</td> + <td> + list(type="struct", fields=<i>fields</i>)<br /> + <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same + name are not allowed. + </td> +</tr> +<tr> + <td> <b>StructField</b> </td> + <td> The value type in R of the data type of this field + (For example, integer for a StructField with the data type IntegerType) </td> + <td> + list(name=<i>name</i>, type=<i>dataType</i>, nullable=[<i>nullable</i>])<br /> + <b>Note:</b> The default value of <i>nullable</i> is <i>TRUE</i>. + </td> +</tr> +</table> + +</div> + +</div> + +## NaN Semantics + +There is specially handling for not-a-number (NaN) when dealing with `float` or `double` types that +does not exactly match standard floating point semantics. +Specifically: + + - NaN = NaN returns true. + - In aggregations, all NaN values are grouped together. + - NaN is treated as a normal value in join keys. + - NaN values go last when in ascending order, larger than any other numeric value. + + ## Arithmetic operations --- End diff -- ah thanks! Fix in b3fc39d.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org