Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread ayan guha
Hi Option 1: You can write back to processed queue and add some additional info like last time tried and some counter. Option 2: Load unpaired transactiona in a staging table in postgres. Modify streaming job of "created" flow to try to pair any unpaired trx and clear it by moving to main table.

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
One alternative I can think of is that you publish your orphaned transactions to another topic from the main Spark job You create a new DF based on orphaned transactions result = orphanedDF \ .. .writeStream \ .outputMode('complete'

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
I mean... I guess? But I don't really have Airflow here, and I didn't really wanted to fall back to a "batch"-kinda approach with Airflow I'd rather use a Dead Letter Queue approach instead (like I mentioned another topic for the failed ones, which is later consumed and pumps the messages back to

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Well this is a matter of using journal entries. What you can do is that those "orphaned" transactions that you cannot pair through transaction_id can be written to a journal table in your Postgres DB. Then you can pair them with the entries in the relevant Postgres table. If the essence is not tim

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
That is exactly the case, Sebastian! - In practise, that "created means "*authorized*", but I still cannot deduct anything from the customer balance - the "processed" means I can safely deduct the transaction_amount from the customer balance, - and the "refunded" means I must give the transactio

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Sebastian Piu
So in payment systems you have something similar I think You have an authorisation, then the actual transaction and maybe a refund some time in the future. You want to proceed with a transaction only if you've seen the auth but in an eventually consistent system this might not always happen. You

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
I'm terribly sorry, Mich. That was my mistake. The timestamps are not the same (I copy&pasted without realizing that, I'm really sorry for the confusion) Please assume NONE of the following transactions are in the database yet *transactions-created:* { "transaction_id": 1, "amount": 1000, "times

spark perform slow in eclipse rich client platform

2021-07-09 Thread jianxu
Hi There; � Wonder if anyone might have experience with running spark app from Eclipse Rich Client Platform in java. The same code run from Eclipse Rich Client Platform of spark app is much slower than running from normal Java in Eclipse without Rich Client Platform. � Appreciate any input

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
One second The topic called transactions_processed is streaming through Spark. Transactions-created are created at the same time (the same timestamp) but you have NOT received them and they don't yet exist in your DB, because presumably your relational database is too slow to ingest them? you do a

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
Thanks for the quick reply! I'm not sure I got the idea correctly... but from what I'm underding, wouldn't that actually end the same way? Because, this is the current scenario: *transactions-processed: * { "transaction_id": 1, "timestamp": "2020-04-04 11:01:00" } { "transaction_id": 2, "timestam

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Thanks for the details. Can you read these in the same app. For example. This is PySpark but it serves the purpose. Read topic "newtopic" in micro batch and the other topic "md" in another microbatch try: # process topic --> newtopic streamingNewtopic = self.spark

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
Hello! Sure thing! I'm reading them *separately*, both are apps written with Scala + Spark Structured Streaming. I feel like I missed some details on my original thread (sorry it was past 4 AM) and it was getting frustrating Please let me try to clarify some points: *Transactions Created Consume

Re: [Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Mich Talebzadeh
Can you please clarify if you are reading these two topics separately or within the same scala or python script in Spark Structured Streaming? HTH view my Linkedin profile *Disclaimer:* Use it at your own risk. Any and all responsi

[Spark Structured Streaming] retry/replay failed messages

2021-07-09 Thread Bruno Oliveira
Hello guys, I've been struggling with this for some days now, without success, so I would highly appreciate any enlightenment. The simplified scenario is the following: - I've got 2 topics in Kafka (it's already like that in production, can't change it) - transactions-created, -

Spark problem

2021-07-09 Thread shashank patel
Hi, I need a solution for incremental count . for 100 request it giving count in 3 response (4,49,100). I want count like this (1,2,3,4,5,6,7,8 .98,99,100). i am using sparksql(groupby with count). Regards Shashank