Hi
I had a similar problem. For me, using the rdd stat counter helped a lot.
Check out
http://stackoverflow.com/questions/41169873/spark-dynamic-dag-is-a-lot-slower-and-different-from-hard-coded-dag
and
http://stackoverflow.com/questions/41445571/spark-migrate-sql-window-function-to-rdd-for-better
Hi guys,
I have this situation:
1. Data frame with 22 columns
2. I need to add some columns (feature engineering) using existing columns,
12 columns will be add by each column in list.
3. I created a loop, but in the 5 item(col) on the loop this starts to go
very slow in the join part, I can obse
Hi guys,
I have this situation:
1. Data frame with 22 columns
2. I need to add some columns (feature engineering) using existing columns,
12 columns will be add by each column in list.
3. I created a loop, but in the 5 item(col) on the loop this starts to go
very slow in the join part, I can obse
What i missed is try increasing number of partitions using repartition
On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote:
> It does not look like scala vs python thing. How big is your audience data
> store? Can it be broadcasted?
>
> What is the memory footprint you are seeing? At what point yarn
It does not look like scala vs python thing. How big is your audience data
store? Can it be broadcasted?
What is the memory footprint you are seeing? At what point yarn is killing?
Depeneding on that you may want to tweak around number of partitions of
input dataset and increase number of executor
Hi everybody,
I am using Apache Spark Streaming using a TCP connector to receive data.
I have a python application that connects to a sensor, and create a TCP
server that waits connection from Apache Spark, and then, sends json data
through this socket.
How can I manage to join many independent