date:20170415

Re: Problem with Execution plan using loop

2017-04-15 Thread Georg Heiler

Hi I had a similar problem. For me, using the rdd stat counter helped a lot. Check out http://stackoverflow.com/questions/41169873/spark-dynamic-dag-is-a-lot-slower-and-different-from-hard-coded-dag and http://stackoverflow.com/questions/41445571/spark-migrate-sql-window-function-to-rdd-for-better

Fwd: Problem with Execution plan using loop

2017-04-15 Thread Javier Rey

Hi guys, I have this situation: 1. Data frame with 22 columns 2. I need to add some columns (feature engineering) using existing columns, 12 columns will be add by each column in list. 3. I created a loop, but in the 5 item(col) on the loop this starts to go very slow in the join part, I can obse

Problem with Execution plan using loop

2017-04-15 Thread Javier Rey

Hi guys, I have this situation: 1. Data frame with 22 columns 2. I need to add some columns (feature engineering) using existing columns, 12 columns will be add by each column in list. 3. I created a loop, but in the 5 item(col) on the loop this starts to go very slow in the join part, I can obse

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha

What i missed is try increasing number of partitions using repartition On Sun, 16 Apr 2017 at 11:06 am, ayan guha wrote: > It does not look like scala vs python thing. How big is your audience data > store? Can it be broadcasted? > > What is the memory footprint you are seeing? At what point yarn

Re: Memory problems with simple ETL in Pyspark

2017-04-15 Thread ayan guha

It does not look like scala vs python thing. How big is your audience data store? Can it be broadcasted? What is the memory footprint you are seeing? At what point yarn is killing? Depeneding on that you may want to tweak around number of partitions of input dataset and increase number of executor

Join streams Apache Spark

2017-04-15 Thread tencas

Hi everybody, I am using Apache Spark Streaming using a TCP connector to receive data. I have a python application that connects to a sensor, and create a TCP server that waits connection from Apache Spark, and then, sends json data through this socket. How can I manage to join many independent

Re: Problem with Execution plan using loop

Fwd: Problem with Execution plan using loop

Problem with Execution plan using loop

Re: Memory problems with simple ETL in Pyspark

Re: Memory problems with simple ETL in Pyspark

Join streams Apache Spark

6 matches

Site Navigation

Mail list logo

Footer information