Re: Spark SQL check timestamp with other table and update a column.

2020-11-21 Thread Khalid Mammadov

Hi,

I am not sure if you were writing pseudo-code or real one but there were 
few issues in the sql.


I have reproduced you example in the Spark REPL and all worked as 
expected and result is the one you need


Please see below full code:

## *Spark 3.0.0*

>>> a = spark.read.csv("tab1", sep="|", header=True, schema="col1 
TIMESTAMP, col2 INT, col3 STRING")

>>> a.printSchema
int, col3: string]>


>>> a.show()
++++
|    col1|col2|col3|
++++
|2020-11-17 20:50:...|   1|null|
++++

>>> b = spark.read.csv("tab2", sep="|", header=True, schema="col1 
TIMESTAMP, col2 INT, col3 STRING")

>>> b.printSchema
int, col3: string]>


>>> b.show()
++++
|    col1|col2|col3|
++++
|2020-11-17 21:19:...|   1| win|
|2020-11-17 20:49:...|   1| win|
|2020-11-17 20:19:...|   1| Win|
++++

>>> a.createOrReplaceTempView("table1")
>>> b.createOrReplaceTempView("table2")

>>> res = spark.sql("""
... select a.col1,a.col2, b.col1, b.col2, coalesce(a.col3,b.col3) as col3
... from table1 a *join* table2 b
... on (a.col2=b.col2) and (*a.col1 < b.col1*)
... """)
>>> res.show()
++++++
|    col1|col2|    col1|col2|col3|
++++++
|2020-11-17T20:50:...|   1|2020-11-17T21:19:...|   1| win|
++++++

Regards,

Khalid

On 19/11/2020 05:13, anbutech wrote:

select a.col1,a.col2.coalesce(a.col3,b.col3) as col3
from table1 a left table2 b
on (a.col2=b.col2) and (b.col1 < b.col1)


Spark SQL check timestamp with other table and update a column.

2020-11-18 Thread anbutech
Hi Team,

i want to update a col3 in table 1 if col1 from table2 is less than col1 in
table1 and update each record in table 1.I 'am not getting the correct
output.

Table 1:
col1|col2|col3
2020-11-17T20:50:57.777+|1|null

Table 2:
col1|col2|col3
2020-11-17T21:19:06.508+|1|win
2020-11-17T20:49:06.244+|1|win
2020-11-17T20:19:13.484+|1|Win

sql tried:

select a.col1,a.col2.coalesce(a.col3,b.col3) as col3 
from table1 a left table2 b
on (a.col2=b.col2) and (b.col1 < b.col1)

output:
I getting the following output.

2020-11-17T20:50:57.777+|1|Win2020-11-17T21:19:06.508+|1|win
2020-11-17T20:50:57.777+|1|Win2020-11-17T20:49:06.244+|1|win

i'm looking for only the second record in the output.

the issue here is i'm getting additional one records if col1 from table2 is
less than col1 in table1 when i'm using the above query.

expected output:

2020-11-17T20:50:57.777+|1|Win 2020-11-17T20:49:06.244+|1|win

how do we achieve that correctly.I have many records like this.

Thanks



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org