DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hello folks, Recently I have noticed unexpectedly big network traffic between Driver Program and Worker node. During debugging I have figured out that it is caused by following block of code —— Java ——— — DataFrame etpvRecords = context.sql(" SOME SQL query here"); Mapper m = new

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
I noticed that toJavaRDD causes a computation on the DataFrame, so is it considered an action, even though logically it's a transformation? On Nov 4, 2015 6:51 PM, "Aliaksei Tsyvunchyk" wrote: > Hello folks, > > Recently I have noticed unexpectedly big network traffic

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hello Romi, Do you mean that in my particular case I’m causing computation on dataFrame or it is regular behavior of DataFrame.toJavaRDD ? If it’s regular behavior, do you know which approach could be used to perform make/reduce on dataFrame without causing it to load all data to driver

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Romi Kuntsman
In my program I move between RDD and DataFrame several times. I know that the entire data of the DF doesn't go into the driver because it wouldn't fit there. But calling toJavaRDD does cause computation. Check the number of partitions you have on the DF and RDD... On Nov 4, 2015 7:54 PM,

Re: DataFrame.toJavaRDD cause fetching data to driver, is it expected ?

2015-11-04 Thread Aliaksei Tsyvunchyk
Hi Romi, Thank for pointing me. I quite new in Spark and not sure how it can help when I’ll check number of partitions in DF and RDD, so if you can give me some explanation it would be really helpful. Link to documentation will also help. > On Nov 4, 2015, at 1:05 PM, Romi Kuntsman