Hi,
I am using spark 2.3.2, i am facing issues due to data locality, even after
giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL,
can someone help me with this.
Thank you
Do you really want to build all of that and open yourself to bugs when you
can just use foreachBatch? Here are your options:
1. Build it yourself
// Read offsets from some store
prevOffsets = readOffsets()
latestOffsets = getOffsets()
df = spark.read.format("kafka").option("startOffsets",
Thanks Anil, I think that’s the approach I will take.
Hi Burak,
That was a possibility to think about, but my team has custom dataframe
writer functions we would like to use, unfortunately they were written for
static dataframes in mind. I do see there is a ForEachBatch write mode but
my
Hi all,
I am using spark 2.3.2, i am facing issues due to data locality, even after
giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL,
can someone help me with this.
Thank you
Hi Ruijing,
Why do you not want to use structured streaming here? This is exactly why
structured streaming + Trigger.Once was built, just so that you don't build
that solution yourself.
You also get exactly once semantics if you use the built in sinks.
Best,
Burak
On Mon, Feb 3, 2020 at 3:15 PM
Hi,
I am using Spark on EMR, and I was hoping to use their optimised committer,
but it looks like that, if
"spark.sql.sources.partitionOverwriteMode": 'dynamic' then
it will not be used.
What are the best practices to use in this case?
The renaming phase in S3, is very slow, and the bottleneck
I would have to check it, but in principle it could be done by checking the
streaming logs, so that once you detect when a shuffle operation starts and
ends, you can know the total operation time.
https://stackoverflow.com/questions/27276884/what-is-shuffle-read-shuffle-write-in-apache-spark
El
dear spark contributors,
I'm searching for a way to model spark shuffle cost and i wonder if there s
mathematic formulas to compute "shuffle read " and "shuffle write" sizes in
the stages view in spark UI.
if there isn't, are there any references to head start in this.
Stage Id ▾