[jira] [Commented] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped
[ https://issues.apache.org/jira/browse/SPARK-25686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646394#comment-16646394 ] Zack Behringer commented on SPARK-25686: OK, I thought that might be the case > date_trunc Spark SQL function silently returns null if parameters are swapped > - > > Key: SPARK-25686 > URL: https://issues.apache.org/jira/browse/SPARK-25686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Zack Behringer >Priority: Minor > > date_trunc(a_timestamp, 'minute') returns null > date_trunc('minute', a_timestamp) returns a valid timestamp > it would be nice to have a runtime error to help catch the problem > This was not helped by the fact that the doc examples had it swapped, but yes > I should have tested our use of it more thoroughly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped
[ https://issues.apache.org/jira/browse/SPARK-25686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646280#comment-16646280 ] Zack Behringer commented on SPARK-25686: SPARK-24378 only seems to fix the examples. I wanted to take it a step further and have it validate the parameters, specifically if they were swapped, but I could also see it helping if someone had a typo in their timeframe argument, like {code:java} date_trunc('minut', a_timestamp){code} > date_trunc Spark SQL function silently returns null if parameters are swapped > - > > Key: SPARK-25686 > URL: https://issues.apache.org/jira/browse/SPARK-25686 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.3.1 >Reporter: Zack Behringer >Priority: Minor > > date_trunc(a_timestamp, 'minute') returns null > date_trunc('minute', a_timestamp) returns a valid timestamp > it would be nice to have a runtime error to help catch the problem > This was not helped by the fact that the doc examples had it swapped, but yes > I should have tested our use of it more thoroughly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped
Zack Behringer created SPARK-25686: -- Summary: date_trunc Spark SQL function silently returns null if parameters are swapped Key: SPARK-25686 URL: https://issues.apache.org/jira/browse/SPARK-25686 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1, 2.3.0 Reporter: Zack Behringer date_trunc(a_timestamp, 'minute') returns null date_trunc('minute', a_timestamp) returns a valid timestamp it would be nice to have a runtime error to help catch the problem This was not helped by the fact that the doc examples had it swapped, but yes I should have tested our use of it more thoroughly -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping
Zack Behringer created SPARK-22299: -- Summary: Use OFFSET and LIMIT for JDBC DataFrameReader striping Key: SPARK-22299 URL: https://issues.apache.org/jira/browse/SPARK-22299 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 1.4.0 Reporter: Zack Behringer Priority: Minor Loading a large table (300M rows) from JDBC can be partitioned into tasks using the column, numPartitions, lowerBound and upperBound parameters on DataFrameReader.jdbc(), but that becomes troublesome if the column is skewed/fragmented (as in somebody used a global sequence for the partition column instead of a sequence specific to the table, or if the table becomes fragmented by deletes, etc.). This can be worked around by using a modulus operation on the column, but that will be slow unless there is a already an index using the modulus expression with the exact numPartitions value, so that doesn't scale well if you want to change the number partitions. Another way would be to use an expression index on a hash of the partition column, but I'm not sure if JDBC striping is smart enough to create hash ranges for each stripe using hashes of the lower and upper bound parameters. If it is, that is great, but still that requires a very large index just for this use case. A less invasive approach would be to use the table's physical ordering along with OFFSET and LIMIT so that only the total number of records to read would need to be known beforehand in order to evenly distribute, no indexes needed. I realize that OFFSET and LIMIT are not standard SQL keywords. I also see that a list of custom predicates can be defined. I haven't tried that to see if I can embed numPartitions specific predicates each with their own OFFSET and LIMIT range. Some relational databases take quite a long time to count the number of records in order to determine the stripe size, though, so this can also troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" be used in conjunction with the number of executors to read manageable batches (internally using OFFSET and LIMIT) until there are no more available results? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org