[jira] [Commented] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped

2018-10-11 Thread Zack Behringer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646394#comment-16646394
 ] 

Zack Behringer commented on SPARK-25686:


OK, I thought that might be the case

> date_trunc Spark SQL function silently returns null if parameters are swapped
> -
>
> Key: SPARK-25686
> URL: https://issues.apache.org/jira/browse/SPARK-25686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Zack Behringer
>Priority: Minor
>
> date_trunc(a_timestamp, 'minute') returns null
> date_trunc('minute', a_timestamp) returns a valid timestamp
> it would be nice to have a runtime error to help catch the problem
> This was not helped by the fact that the doc examples had it swapped, but yes 
> I should have tested our use of it more thoroughly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped

2018-10-11 Thread Zack Behringer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16646280#comment-16646280
 ] 

Zack Behringer commented on SPARK-25686:


SPARK-24378 only seems to fix the examples. I wanted to take it a step further 
and have it validate the parameters, specifically if they were swapped, but I 
could also see it helping if someone had a typo in their timeframe argument, 
like
{code:java}
date_trunc('minut', a_timestamp){code}

> date_trunc Spark SQL function silently returns null if parameters are swapped
> -
>
> Key: SPARK-25686
> URL: https://issues.apache.org/jira/browse/SPARK-25686
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.3.1
>Reporter: Zack Behringer
>Priority: Minor
>
> date_trunc(a_timestamp, 'minute') returns null
> date_trunc('minute', a_timestamp) returns a valid timestamp
> it would be nice to have a runtime error to help catch the problem
> This was not helped by the fact that the doc examples had it swapped, but yes 
> I should have tested our use of it more thoroughly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25686) date_trunc Spark SQL function silently returns null if parameters are swapped

2018-10-09 Thread Zack Behringer (JIRA)
Zack Behringer created SPARK-25686:
--

 Summary: date_trunc Spark SQL function silently returns null if 
parameters are swapped
 Key: SPARK-25686
 URL: https://issues.apache.org/jira/browse/SPARK-25686
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1, 2.3.0
Reporter: Zack Behringer


date_trunc(a_timestamp, 'minute') returns null

date_trunc('minute', a_timestamp) returns a valid timestamp

it would be nice to have a runtime error to help catch the problem

This was not helped by the fact that the doc examples had it swapped, but yes I 
should have tested our use of it more thoroughly



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping

2017-10-17 Thread Zack Behringer (JIRA)
Zack Behringer created SPARK-22299:
--

 Summary: Use OFFSET and LIMIT for JDBC DataFrameReader striping
 Key: SPARK-22299
 URL: https://issues.apache.org/jira/browse/SPARK-22299
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0, 2.1.0, 2.0.0, 1.6.0, 1.5.0, 1.4.0
Reporter: Zack Behringer
Priority: Minor


Loading a large table (300M rows) from JDBC can be partitioned into tasks using 
the column, numPartitions, lowerBound and upperBound parameters on 
DataFrameReader.jdbc(), but that becomes troublesome if the column is 
skewed/fragmented (as in somebody used a global sequence for the partition 
column instead of a sequence specific to the table, or if the table becomes 
fragmented by deletes, etc.).
This can be worked around by using a modulus operation on the column, but that 
will be slow unless there is a already an index using the modulus expression 
with the exact numPartitions value, so that doesn't scale well if you want to 
change the number partitions. Another way would be to use an expression index 
on a hash of the partition column, but I'm not sure if JDBC striping is smart 
enough to create hash ranges for each stripe using hashes of the lower and 
upper bound parameters. If it is, that is great, but still that requires a very 
large index just for this use case.

A less invasive approach would be to use the table's physical ordering along 
with OFFSET and LIMIT so that only the total number of records to read would 
need to be known beforehand in order to evenly distribute, no indexes needed. I 
realize that OFFSET and LIMIT are not standard SQL keywords.

I also see that a list of custom predicates can be defined. I haven't tried 
that to see if I can embed numPartitions specific predicates each with their 
own OFFSET and LIMIT range.

Some relational databases take quite a long time to count the number of records 
in order to determine the stripe size, though, so this can also troublesome. 
Could a feature similar to "spark.sql.files.maxRecordsPerFile" be used in 
conjunction with the number of executors to read manageable batches (internally 
using OFFSET and LIMIT) until there are no more available results?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org