[ spark-streaming ] - Data Locality issue

2020-02-04 Thread Karthik Srinivas
Hi, I am using spark 2.3.2, i am facing issues due to data locality, even after giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL, can someone help me with this. Thank you

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Burak Yavuz
Do you really want to build all of that and open yourself to bugs when you can just use foreachBatch? Here are your options: 1. Build it yourself // Read offsets from some store prevOffsets = readOffsets() latestOffsets = getOffsets() df = spark.read.format("kafka").option("startOffsets",

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Ruijing Li
Thanks Anil, I think that’s the approach I will take. Hi Burak, That was a possibility to think about, but my team has custom dataframe writer functions we would like to use, unfortunately they were written for static dataframes in mind. I do see there is a ForEachBatch write mode but my

Data locality

2020-02-04 Thread Karthik Srinivas
Hi all, I am using spark 2.3.2, i am facing issues due to data locality, even after giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL, can someone help me with this. Thank you

Re: Best way to read batch from Kafka and Offsets

2020-02-04 Thread Burak Yavuz
Hi Ruijing, Why do you not want to use structured streaming here? This is exactly why structured streaming + Trigger.Once was built, just so that you don't build that solution yourself. You also get exactly once semantics if you use the built in sinks. Best, Burak On Mon, Feb 3, 2020 at 3:15 PM

Committer to use if "spark.sql.sources.partitionOverwriteMode": 'dynamic'

2020-02-04 Thread edge7
Hi, I am using Spark on EMR, and I was hoping to use their optimised committer, but it looks like that, if "spark.sql.sources.partitionOverwriteMode": 'dynamic' then it will not be used. What are the best practices to use in this case? The renaming phase in S3, is very slow, and the bottleneck

Re: shuffle mathematic formulat

2020-02-04 Thread Aironman DirtDiver
I would have to check it, but in principle it could be done by checking the streaming logs, so that once you detect when a shuffle operation starts and ends, you can know the total operation time. https://stackoverflow.com/questions/27276884/what-is-shuffle-read-shuffle-write-in-apache-spark El

shuffle mathematic formulat

2020-02-04 Thread asma zgolli
dear spark contributors, I'm searching for a way to model spark shuffle cost and i wonder if there s mathematic formulas to compute "shuffle read " and "shuffle write" sizes in the stages view in spark UI. if there isn't, are there any references to head start in this. Stage Id ▾