[ https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17564730#comment-17564730 ]
Pablo Langa Blanco commented on SPARK-39623: -------------------------------------------- I think the problem here is a misunderstanding of how lowerBound and upperBound work. This options only affect on how spark generate the partitions, but all data is returned, this options don't work as a filter. For example with this configuration {code:java} .option("numPartitions", "4") .option("partitionColumn", "t") .option("lowerBound", "2022-07-10 00:00:00") .option("upperBound", "2022-07-10 23:59:00") {code} The expected filters of the queries are {code:sql} WHERE "t" < '2022-07-10 05:59:45' or "t" is null WHERE "t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30' WHERE "t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15' WHERE "t" >= '2022-07-10 17:59:15' {code} If you want to filter the data, you can do it as you have shown or doing something like that {code:java} df.where(col("t") >= lit("2022-07-10 00:00:00")) {code} and then spark pushes down this filter and generates these partitions: {code:sql} WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" < '2022-07-10 05:59:45' or "t" is null) WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30') WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15') WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= '2022-07-10 17:59:15') {code} > partitionng by datestamp leads to wrong query on backend? > --------------------------------------------------------- > > Key: SPARK-39623 > URL: https://issues.apache.org/jira/browse/SPARK-39623 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.0 > Reporter: Dmitry > Priority: Major > > Hello, > I am new to Apache spark, so please bear with me. I would like to report what > seems to me a bug, but may be I am just not understanding something. > My goal is to run data analysis on a spark cluster. Data is stored in > PostgreSQL DB. Tables contained timestamped entries (timestamp with time > zone). > The code look like: > {code:python} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("foo") \ > .config("spark.jars", "/opt/postgresql-42.4.0.jar") \ > .getOrCreate() > df = spark.read \ > .format("jdbc") \ > .option("url", "jdbc:postgresql://example.org:5432/postgres") \ > .option("dbtable", "billing") \ > .option("user", "user") \ > .option("driver", "org.postgresql.Driver") \ > .option("numPartitions", "4") \ > .option("partitionColumn", "datestamp") \ > .option("lowerBound", "2022-01-01 00:00:00") \ > .option("upperBound", "2022-06-26 23:59:59") \ > .option("fetchsize", 1000000) \ > .load() > t0 = time.time() > print("Number of entries is =====> ", df.count(), " Time to execute ", > time.time()-t0) > ... > {code} > datestamp is timestamp with time zone. > I see this query on DB backend: > {code:java} > SELECT 1 FROM billinginfo WHERE "datestamp" < '2022-01-02 11:59:59.9375' or > "datestamp" is null > {code} > The table is huge and entries go way back before > 2022-01-02 11:59:59. So what ends up happening - all workers but one > complete and one remaining continues to process that query which, to me, > looks like it wants to get all the data before 2022-01-02 11:59:59. Which is > not what I intended. > I remedies this by changing to: > {code:python} > .option("dbtable", "(select * from billinginfo where datestamp > '2022 > 01-01-01 00:00:00') as foo") \ > {code} > And that seem to have solved the issue. But this seems kludgy. Am I doing > something wrong or there is a bug in the way partitioning queries are > generated? -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org