lysis and visualisation and it’s probably good
choice for the task. If you want to go lower level like with Spark and you are
familiar with Python, pandas could be good library to investigate.
br,
Teemu Heikkilä
te...@emblica.com
+358 40 0963509
Emblica ı The data engineering co
Based on your answers, I would consider using the update stream to update
actual snapshots ie. by joining the data
Ofcourse now it depends on how the update stream has been implemented how to
get the data in spark.
Could you tell little bit more about that?
- Teemu
> On 4 Apr 2019, at 22.23,
So you are now providing the data on-demand through spark?
I suggest you change your API to query from cassandra and store the results
from Spark back there, that way you will have to process the whole dataset just
once and cassandra is suitable for that kind of workloads.
-T
> On 10 Jun
Sounds like you’re doing something else than just writing the same file back to
disk, what your preprocessing consists?
Sometimes you can save lot’s of space by using other formats but now we’re
talking over 200x increase in file size so depending on the transformations for
the data you might
Without knowing anything about your pipeline the best estimate of the resources
needed is to run the job with same ingestion rate as the normal production load.
With kafka you can enable back pressure so with high load also your latency
will just increase but you don’t have to have capacity for
I have spark structured streaming job and I'm crunching through few terabytes
of data.
I'm using file stream reader and it works flawlessly, I can adjust the
partitioning of that with spark.default.parallelism
However I'm doing sessionization for the data after loading it and I'm
currently