[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-07-10 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564730#comment-17564730
 ] 

Pablo Langa Blanco commented on SPARK-39623:


I think the problem here is a misunderstanding of how lowerBound and upperBound 
work. 
This options only affect on how spark generate the partitions, but all data is 
returned, this options don't work as a filter.

For example with this configuration

{code:java}
.option("numPartitions", "4")
.option("partitionColumn", "t")
.option("lowerBound", "2022-07-10 00:00:00")
.option("upperBound", "2022-07-10 23:59:00")
{code}


The expected filters of the queries are

{code:sql}
WHERE "t" < '2022-07-10 05:59:45' or "t" is null
WHERE "t" >= '2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30'
WHERE "t" >= '2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15'
WHERE "t" >= '2022-07-10 17:59:15'
{code}


If you want to filter the data, you can do it as you have shown or doing 
something like that


{code:java}
df.where(col("t") >= lit("2022-07-10 00:00:00"))
{code}


and then spark pushes down this filter and generates these partitions:


{code:sql}
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" < 
'2022-07-10 05:59:45' or "t" is null)
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 05:59:45' AND "t" < '2022-07-10 11:59:30')
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 11:59:30' AND "t" < '2022-07-10 17:59:15')
WHERE (("t" IS NOT NULL) AND ("t" >= '2022-07-10 00:00:00.0')) AND ("t" >= 
'2022-07-10 17:59:15')
{code}


> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-07-05 Thread Dmitry (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562695#comment-17562695
 ] 

Dmitry commented on SPARK-39623:


It is quoted in my issue, here it is again:

I see this query on DB backend:
{code:sql}
SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
"datestamp" is null
{code}

> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39623) partitionng by datestamp leads to wrong query on backend?

2022-07-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562336#comment-17562336
 ] 

Hyukjin Kwon commented on SPARK-39623:
--

You can actually check the logs and which SQL is being sent to the backend. It 
would be much easier to assess the issue if you share them.

> partitionng by datestamp leads to wrong query on backend?
> -
>
> Key: SPARK-39623
> URL: https://issues.apache.org/jira/browse/SPARK-39623
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Dmitry
>Priority: Major
>
> Hello,
> I am new to Apache spark, so please bear with me. I would like to report what 
> seems to me a bug, but may be I am just not understanding something.
> My goal is to run data analysis on a spark cluster. Data is stored in 
> PostgreSQL DB. Tables contained timestamped entries (timestamp with time 
> zone).
> The code look like:
>  {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("foo") \
> .config("spark.jars", "/opt/postgresql-42.4.0.jar") \
> .getOrCreate()
> df = spark.read \
>  .format("jdbc") \
>  .option("url", "jdbc:postgresql://example.org:5432/postgres") \
>  .option("dbtable", "billing") \
>  .option("user", "user") \
>  .option("driver", "org.postgresql.Driver") \
>  .option("numPartitions", "4") \
>  .option("partitionColumn", "datestamp") \
>  .option("lowerBound", "2022-01-01 00:00:00") \
>  .option("upperBound", "2022-06-26 23:59:59") \
>  .option("fetchsize", 100) \
>  .load()
> t0 = time.time()
> print("Number of entries is => ", df.count(), " Time to execute ", 
> time.time()-t0)
> ...
> {code}
> datestamp is timestamp with time zone. 
> I see this query on DB backend:
> {code:java}
> SELECT 1 FROM billinginfo  WHERE "datestamp" < '2022-01-02 11:59:59.9375' or 
> "datestamp" is null
> {code}
> The table is huge and entries go way back before 
>  2022-01-02 11:59:59. So what ends up happening - all workers but one 
> complete and one remaining continues to process that query which, to me, 
> looks like it wants to get all the data before 2022-01-02 11:59:59. Which is 
> not what I intended. 
> I remedies this by changing to:
> {code:python}
>  .option("dbtable", "(select * from billinginfo where datestamp > '2022 
> 01-01-01 00:00:00') as foo") \
> {code}
> And that seem to have solved the issue. But this seems kludgy. Am I doing 
> something wrong or there is a bug in the way partitioning queries are 
> generated?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org