[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

Pablo Langa Blanco (Jira) Sun, 10 Jul 2022 15:39:07 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564730#comment-17564730
 ]


Pablo Langa Blanco commented on SPARK-39623:
--------------------------------------------

I think the problem here is a misunderstanding of how lowerBound and upperBound 
work. 
This options only affect on how spark generate the partitions, but all data is 
returned, this options don't work as a filter.

For example with this configuration

{code:java}
    .option("numPartitions", "4")
    .option("partitionColumn", "t")
    .option("lowerBound", "2022-07-10 00:00:00")
    .option("upperBound", "2022-07-10 23:59:00")
{code}


The expected filters of the queries are

{code:sql}
    WHERE "t" < '2022-07-10 05:59:45' or "t" is null
    WHERE "t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30'
    WHERE "t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15'
    WHERE "t" >= '2022-07-10 17:59:15'
{code}

    
If you want to filter the data, you can do it as you have shown or doing 
something like that


{code:java}
    df.where(col("t") >= lit("2022-07-10 00:00:00"))
{code}

    
and then spark pushes down this filter and generates these partitions:


{code:sql}
    WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" < 
'2022-07-10 05:59:45' or "t" is null)
    WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30')
    WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15')
    WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 17:59:15')
{code}


> partitionng by datestamp leads to wrong query on backend?
> ---------------------------------------------------------
>
>                 Key: SPARK-39623
>                 URL: https://issues.apache.org/jira/browse/SPARK-39623
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Dmitry
>            Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
>         .builder \
>         .appName("foo") \
>         .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
>         .getOrCreate()
> df = spark.read \
>      .format("jdbc") \
>      .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>      .option("dbtable", "billing") \
>      .option("user", "user") \
>      .option("driver", "org.postgresql.Driver") \
>      .option("numPartitions", "4") \
>      .option("partitionColumn", "datestamp") \
>      .option("lowerBound", "2022-01-01 00:00:00") \
>      .option("upperBound", "2022-06-26 23:59:59") \
>      .option("fetchsize", 1000000) \
>      .load()
> t0 = time.time()
> print("Number of entries is =====> ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>      .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

Reply via email to