[GitHub] spark pull request #21296: [SPARK-24244][SQL] Passing only required columns ...

HyukjinKwon Sun, 13 May 2018 19:10:39 -0700

Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21296#discussion_r187823861
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1814,6 +1814,7 @@ working with timestamps in `pandas_udf`s to get the 
best performance, see
       - In version 2.3 and earlier, `to_utc_timestamp` and 
`from_utc_timestamp` respect the timezone in the input timestamp string, which 
breaks the assumption that the input timestamp is in a specific timezone. 
Therefore, these 2 functions can return unexpected results. In version 2.4 and 
later, this problem has been fixed. `to_utc_timestamp` and `from_utc_timestamp` 
will return null if the input timestamp string contains timezone. As an 
example, `from_utc_timestamp('2000-10-10 00:00:00', 'GMT+1')` will return 
`2000-10-10 01:00:00` in both Spark 2.3 and 2.4. However, 
`from_utc_timestamp('2000-10-10 00:00:00+00:00', 'GMT+1')`, assuming a local 
timezone of GMT+8, will return `2000-10-10 09:00:00` in Spark 2.3 but `null` in 
2.4. For people who don't care about this problem and want to retain the 
previous behaivor to keep their query unchanged, you can set 
`spark.sql.function.rejectTimezoneInString` to false. This option will be 
removed in Spark 3.0 and should only be used as a tempora
 ry workaround.
       - In version 2.3 and earlier, Spark converts Parquet Hive tables by 
default but ignores table properties like `TBLPROPERTIES (parquet.compression 
'NONE')`. This happens for ORC Hive table properties like `TBLPROPERTIES 
(orc.compress 'NONE')` in case of `spark.sql.hive.convertMetastoreOrc=true`, 
too. Since Spark 2.4, Spark respects Parquet/ORC specific table properties 
while converting Parquet/ORC Hive tables. As an example, `CREATE TABLE t(id 
int) STORED AS PARQUET TBLPROPERTIES (parquet.compression 'NONE')` would 
generate Snappy parquet files during insertion in Spark 2.3, and in Spark 2.4, 
the result would be uncompressed parquet files.
       - Since Spark 2.0, Spark converts Parquet Hive tables by default for 
better performance. Since Spark 2.4, Spark converts ORC Hive tables by default, 
too. It means Spark uses its own ORC support by default instead of Hive SerDe. 
As an example, `CREATE TABLE t(id int) STORED AS ORC` would be handled with 
Hive SerDe in Spark 2.3, and in Spark 2.4, it would be converted into Spark's 
ORC data source table and ORC vectorization would be applied. To set `false` to 
`spark.sql.hive.convertMetastoreOrc` restores the previous behavior.
    +  - Since Spark 2.4, handling of malformed rows in CSV files was changed. 
Previously, all column values of every row are parsed independently of its 
future usage. A row was considered as malformed if the CSV parser wasn't able 
to handle any column value in the row even if the value wasn't requested. 
Starting from version 2.4, only requested column values are parsed, and other 
values can be ignored. In such way, correct column values that were considered 
as malformed in previous Spark version only because of other malformed values 
become correct in Spark version 2.4.
    --- End diff --
    
    Do we really wanna this behaviour change @cloud-fan and @gatorsmile?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21296: [SPARK-24244][SQL] Passing only required columns ...

Reply via email to