Re: covid 19 Data [DISCUSSION]

2020-04-12 Thread Teemu Heikkilä
lysis and visualisation and it’s probably good choice for the task. If you want to go lower level like with Spark and you are familiar with Python, pandas could be good library to investigate. br, Teemu Heikkilä te...@emblica.com +358 40 0963509 Emblica ı The data engineering co

Re: reporting use case

2019-04-04 Thread Teemu Heikkilä
Based on your answers, I would consider using the update stream to update actual snapshots ie. by joining the data Ofcourse now it depends on how the update stream has been implemented how to get the data in spark. Could you tell little bit more about that? - Teemu > On 4 Apr 2019, at 22.23,

Re: spark optimized pagination

2018-06-11 Thread Teemu Heikkilä
So you are now providing the data on-demand through spark? I suggest you change your API to query from cassandra and store the results from Spark back there, that way you will have to process the whole dataset just once and cassandra is suitable for that kind of workloads. -T > On 10 Jun

Re: Writing a DataFrame is taking too long and huge space

2018-03-09 Thread Teemu Heikkilä
Sounds like you’re doing something else than just writing the same file back to disk, what your preprocessing consists? Sometimes you can save lot’s of space by using other formats but now we’re talking over 200x increase in file size so depending on the transformations for the data you might

Re: Measuring cluster utilization of a streaming job

2017-11-14 Thread Teemu Heikkilä
Without knowing anything about your pipeline the best estimate of the resources needed is to run the job with same ingestion rate as the normal production load. With kafka you can enable back pressure so with high load also your latency will just increase but you don’t have to have capacity for

[Spark Structured Streaming] Changing partitions of (flat)MapGroupsWithState

2017-11-08 Thread Teemu Heikkilä
I have spark structured streaming job and I'm crunching through few terabytes of data. I'm using file stream reader and it works flawlessly, I can adjust the partitioning of that with spark.default.parallelism However I'm doing sessionization for the data after loading it and I'm currently